skip to main content
research-article

Semantic Collaborative Learning for Cross-Modal Moment Localization

Published: 07 November 2023 Publication History

Abstract

Localizing a desired moment within an untrimmed video via a given natural language query, i.e., cross-modal moment localization, has attracted widespread research attention recently. However, it is a challenging task because it requires not only accurately understanding intra-modal semantic information, but also explicitly capturing inter-modal semantic correlations (consistency and complementarity). Existing efforts mainly focus on intra-modal semantic understanding and inter-modal semantic alignment, while ignoring necessary semantic supplement. Consequently, we present a cross-modal semantic perception network for more effective intra-modal semantic understanding and inter-modal semantic collaboration. Concretely, we design a dual-path representation network for intra-modal semantic modeling. Meanwhile, we develop a semantic collaborative network to achieve multi-granularity semantic alignment and hierarchical semantic supplement. Thereby, effective moment localization can be achieved based on sufficient semantic collaborative learning. Extensive comparison experiments demonstrate the promising performance of our model compared with existing state-of-the-art competitors.

References

[1]
Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision. 5803–5812.
[2]
Piotr Bojanowski, Rémi Lajugie, Edouard Grave, Francis Bach, Ivan Laptev, Jean Ponce, and Cordelia Schmid. 2015. Weakly-supervised alignment of video with text. In Proceedings of the IEEE International Conference on Computer Vision. 4462–4470.
[3]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.
[4]
Qingchao Chen, Yang Liu, and Samuel Albanie. 2021. Mind-the-gap! Unsupervised domain adaptation for text-video retrieval. In Proceedings of the American Association for Artificial Intelligence. 1072–1080.
[5]
Xiaolin Chen, Xuemeng Song, Ruiyang Ren, Lei Zhu, Zhiyong Cheng, and Liqiang Nie. 2020. Fine-grained privacy detection with graph-regularized hierarchical attentive representation learning. ACM Transactions on Information Systems 38, 4 (2020), 1–26.
[6]
Zhiyong Cheng, Fan Liu, Shenghan Mei, Yangyang Guo, Lei Zhu, and Liqiang Nie. 2022. Feature-level attentive ICF for recommendation. ACM Transactions on Information Systems 40, 4 (2022), 1–24.
[7]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4171–4186.
[8]
Jianfeng Dong, Xirong Li, and Cees G. M. Snoek. 2018. Predicting visual features from text for image and video caption retrieval. IEEE Transactions on Multimedia 20, 12 (2018), 3377–3388.
[9]
Fuli Feng, Xiangnan He, Yiqun Liu, Liqiang Nie, and Tat-Seng Chua. 2018. Learning on partial-order hypergraphs. In Proceedings of the World Wide Web Conference. 1523–1532.
[10]
Fuli Feng, Liqiang Nie, Xiang Wang, Richang Hong, and Tat-Seng Chua. 2017. Computational social indicators: A case study of Chinese university ranking. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval. 455–464.
[11]
Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017. TALL: Temporal activity localization via language query. In Proceedings of the IEEE International Conference on Computer Vision. 5267–5275.
[12]
Junyu Gao and Changsheng Xu. 2021. Fast video moment retrieval. In Proceedings of the IEEE International Conference on Computer Vision. 1523–1532.
[13]
Jiyang Gao, Zhenheng Yang, Chen Sun, Kan Chen, and Ram Nevatia. 2017. TURN TAP: Temporal unit regression network for temporal action proposals. In Proceedings of the IEEE International Conference on Computer Vision. 3628–3636.
[14]
Shen Gao, Xiuying Chen, Li Liu, Dongyan Zhao, and Rui Yan. 2021. Learning to respond with your favorite stickers: A framework of unifying multi-modality and user preference in multi-turn dialog. ACM Transactions on Information Systems 39, 2 (2021), 1–32.
[15]
Guoqiang Gong, Xinghan Wang, Yadong Mu, and Qi Tian. 2020. Learning temporal co-attention models for unsupervised video action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9819–9828.
[16]
Jiafeng Guo, Yinqiong Cai, Yixing Fan, Fei Sun, Ruqing Zhang, and Xueqi Cheng. 2022. Semantic models for the first-stage retrieval: A comprehensive review. ACM Transactions on Information Systems 40, 4 (2022), 1–42.
[17]
Yangyang Guo, Zhiyong Cheng, Jiazheng Jing, Yanpeng Lin, Liqiang Nie, and Meng Wang. 2022. Enhancing factorization machines with generalized metric learning. IEEE Transactions on Knowledge and Data Engineering 34, 8 (2022), 1–15.
[18]
Yangyang Guo, Zhiyong Cheng, Liqiang Nie, Yibing Liu, Yinglong Wang, and Mohan Kankanhalli. 2019. Quantifying and alleviating the language prior problem in visual question answering. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 75–84.
[19]
Ning Han, Jingjing Chen, Guangyi Xiao, Hao Zhang, Yawen Zeng, and Hao Chen. 2021. Fine-grained cross-modal alignment network for text-video retrieval. In Proceedings of the ACM International Conference on Multimedia. 3826–-3834.
[20]
Xianjing Han, Xuemeng Song, Yiyang Yao, Xin-Shun Xu, and Liqiang Nie. 2020. Neural compatibility modeling with probabilistic knowledge distillation. IEEE Transactions on Image Processing 29, 1 (2020), 871–882.
[21]
Yudong Han, Yangyang Guo, Jianhua Yin, Meng Liu, Yupeng Hu, and Liqiang Nie. 2021. Focal and composed vision-semantic modeling for visual question answering. In Proceedings of the ACM International Conference on Multimedia. 4528–4536.
[22]
Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2018. Localizing moments in video with temporal language. In Proceedings of the International Conference on Empirical Methods in Natural Language Processing. 1380–1390.
[23]
Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi Feng, Kate Saenko, and Trevor Darrell. 2016. Natural language object retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4555–4564.
[24]
Yupeng Hu, Meng Liu, Xiaobin Su, Zan Gao, and Liqiang Nie. 2021. Video moment localization via deep cross-modal hashing. IEEE Transactions on Image Processing 30, 1 (2021), 4667–4677.
[25]
Yupeng Hu, Liqiang Nie, Meng Liu, Kun Wang, Yinglong Wang, and Xiansheng Hua. 2021. Coarse-to-fine semantic alignment for cross-modal moment localization. IEEE Transactions on Image Processing 30, 1 (2021), 5933–5943.
[26]
Haoshuo Huang, Vihan Jain, Harsh Mehta, Alexander Ku, Gabriel Magalhaes, Jason Baldridge, and Eugene Ie. 2019. Transferable representation learning in vision-and-language navigation. In Proceedings of the IEEE International Conference on Computer Vision. 7404–7413.
[27]
Weike Jin, Zhou Zhao, Pengcheng Zhang, Jieming Zhu, Xiuqiang He, and Yueting Zhuang. 2021. Hierarchical cross-modal graph consistency learning for video-text retrieval. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval. 1114–1124.
[28]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations. 1–15.
[29]
Dan Li, Tong Xu, Peilun Zhou, Weidong He, Yanbin Hao, Yi Zheng, and Enhong Chen. 2021. Social context-aware person search in videos via multi-modal cues. ACM Transactions on Information Systems 40, 3 (2021), 1–25.
[30]
Kun Li, Dan Guo, and Meng Wang. 2021. Proposal-free video grounding with contextual pyramid network. In Proceedings of the American Association for Artificial Intelligence. 1902–1910.
[31]
Yanwei Li, Lin Song, Yukang Chen, Zeming Li, Xiangyu Zhang, Xingang Wang, and Jian Sun. 2020. Learning dynamic routing for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8553–8562.
[32]
Dahua Lin, Sanja Fidler, Chen Kong, and Raquel Urtasun. 2014. Visual semantic search: Retrieving videos via complex textual queries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2657–2664.
[33]
Zhijie Lin, Zhou Zhao, Zhu Zhang, Zijian Zhang, and Deng Cai. 2020. Moment retrieval via cross-modal interaction networks with query reconstruction. IEEE Transactions on Image Processing 29, 1 (2020), 3750–3762.
[34]
Daochang Liu, Tingting Jiang, and Yizhou Wang. 2019. Completeness modeling and context separation for weakly supervised temporal action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1298–1307.
[35]
Fan Liu, Zhiyong Cheng, Chenghao Liu, and Liqiang Nie. 2022. An attribute-aware attentive GCN model for attribute missing in recommendation. IEEE Transactions on Knowledge and Data Engineering 34, 9 (2022), 1–12.
[36]
Fan Liu, Zhiyong Cheng, Changchang Sun, Yinglong Wang, Liqiang Nie, and Mohan Kankanhalli. 2019. User diverse preference modeling by multimodal attentive metric learning. In Proceedings of the ACM International Conference on Multimedia. 1526-–1534.
[37]
Fan Liu, Zhiyong Cheng, Lei Zhu, Zan Gao, and Liqiang Nie. 2021. Interest-aware message-passing GCN for recommendation. In Proceedings of the International Conference on World Wide Web. 1296–1305.
[38]
Meng Liu, Liqiang Nie, Meng Wang, and Baoquan Chen. 2017. Towards micro-video understanding by joint sequential-sparse modeling. In Proceedings of the ACM International Conference on Multimedia. 970–978.
[39]
Meng Liu, Liqiang Nie, Xiang Wang, Qi Tian, and Baoquan Chen. 2019. Online data organizer: Micro-video categorization by structure-guided multimodal dictionary learning. IEEE Transactions on Image Processing 28, 3 (2019), 1235–1247.
[40]
Meng Liu, Liqiang Nie, Yunxiao Wang, Meng Wang, and Yong Rui. 2023. A survey on video moment localization. ACM Computing Surveys 55, 9 (2023), 1–37.
[41]
Meng Liu, Leigang Qu, Liqiang Nie, Maofu Liu, Lingyu Duan, and Chen Baoquan. 2020. Iterative local-global collaboration learning towards one-shot video person re-identification. IEEE Transactions on Image Processing 29, 1 (2020), 9360–9372.
[42]
Meng Liu, Xiang Wang, Liqiang Nie, Xiangnan He, Baoquan Chen, and Tat-Seng Chua. 2018. Attentive moment retrieval in videos. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval. 15–24.
[43]
Meng Liu, Xiang Wang, Liqiang Nie, Qi Tian, Baoquan Chen, and Tat-Seng Chua. 2018. Cross-modal moment localization in videos. In Proceedings of the ACM International Conference on Multimedia. 843–851.
[44]
Yang Liu, Qingchao Chen, and Samuel Albanie. 2021. Adaptive cross-modal prototypes for cross-domain visual-language retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 14954–14964.
[45]
Niluthpol Chowdhury Mithun, Sujoy Paul, and Amit K. Roy-Chowdhury. 2019. Weakly supervised video moment retrieval from text queries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11592–11601.
[46]
Jonghwan Mun, Minsu Cho, and Bohyung Han. 2020. Local-global video-text interactions for temporal grounding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10810–10819.
[47]
Zhiwu Qing, Haisheng Su, Weihao Gan, Dongliang Wang, Wei Wu, Xiang Wang, Yu Qiao, Junjie Yan, Changxin Gao, and Nong Sang. 2021. Temporal context aggregation network for temporal action proposal refinement. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 485–494.
[48]
Leigang Qu, Meng Liu, Da Cao, Liqiang Nie, and Qi Tian. 2020. Context-aware multi-view summarization network for image-text matching. In Proceedings of the ACM International Conference on Multimedia. 1047–1055.
[49]
Leigang Qu, Meng Liu, Jianlong Wu, Zan Gao, and Liqiang Nie. 2021. Dynamic modality interaction modeling for image-text retrieval. In Proceedings of the International ACM Conference on Research and Development in Information Retrieval. 1104–1113.
[50]
Krishna Ranjay, Hata Kenji, Ren Frederic, Fei-Fei Li, and Carlos Niebles Juan. 2017. Dense-captioning events in videos. In Proceedings of the IEEE International Conference on Computer Vision. 706–715.
[51]
Maheen Rashid, Hedvig Kjellstrom, and Yong Jae Lee. 2020. Action graphs: Weakly-supervised action localization with graph convolution networks. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 615–624.
[52]
Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal. 2013. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics 1, 1 (2013), 25–36.
[53]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems.
[54]
Marcus Rohrbach, Michaela Regneri, Mykhaylo Andriluka, Sikandar Amin, Manfred Pinkal, and Bernt Schiele. 2012. Script data for attribute-based recognition of composite activities. In Proceedings of the European Conference on Computer Vision. 144–157.
[55]
Zheng Shou, Dongang Wang, and Shih-Fu Chang. 2016. Temporal action localization in untrimmed videos via multi-stage CNNs. In Proceedings of the International Conference on Computer Vision and Pattern Recognition. 1049–1058.
[56]
Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Proceedings of the European Conference on Computer Vision. 510–526.
[57]
Xuemeng Song, Fuli Feng, Xianjing Han, Xin Yang, Wei Liu, and Liqiang Nie. 2018. Neural compatibility modeling with attentive knowledge distillation. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval. 5–14.
[58]
Tianyu Su, Xuemeng Song, Na Zheng, Weili Guan, Yan Li, and Liqiang Nie. 2021. Complementary factorization towards outfit compatibility modeling. In Proceedings of the ACM International Conference on Multimedia. 4073–4081.
[59]
Xiao Sun, Xiang Long, Dongjiang He, Shilei Wei, and Zhouhui Lian. 2021. VSRNet: End-to-end video segment retrieval with text query. Pattern Recognition 119, 1 (2021), 108027–108036.
[60]
Haoyu Tang, Jihua Zhu, Meng Liu, Zan Gao, and Zhiyong Cheng. 2022. Frame-wise Cross-modal Matching for Video Moment Retrieval. IEEE Transactions on Multimedia 24, 1 (2022), 1338–1349.
[61]
Haoyu Tang, Jihua Zhu, Lin Wang, Qinghai Zheng, and Tianwei Zhang. 2022. Multi-level query interaction for temporal language grounding. IEEE Transactions on Intelligent Transportation Systems 1, 12 (2022), 25479–25488.
[62]
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 4489–4497.
[63]
Chenyang Wang, Weizhi Ma, Chong Chen, Min Zhang, Yiqun Liu, and Shaoping Ma. 2023. Sequential recommendation with multiple contrast signals. ACM Transactions on Information Systems 41, 1 (2023), 1–27.
[64]
Hao Wang, Zheng-Jun Zha, Xuejin Chen, Zhiwei Xiong, and Jiebo Luo. 2020. Dual path interaction network for video moment localization. In Proceedings of the ACM International Conference on Multimedia. 4116-–4124.
[65]
Hao Wang, Zheng-Jun Zha, Liang Li, Dong Liu, and Jiebo Luo. 2021. Structured multi-level interaction network for video moment localization via language query. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7026–7035.
[66]
Xin Wang, Qiuyuan Huang, Asli Celikyilmaz, Jianfeng Gao, Dinghan Shen, Yuan-Fang Wang, William Yang Wang, and Lei Zhang. 2019. Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6629–6638.
[67]
Xiang Wang, Liqiang Nie, Xuemeng Song, Dongxiang Zhang, and Tat-Seng Chua. 2017. Unifying virtual and physical worlds: Learning toward local and global consistency. ACM Transactions on Information Systems 36, 1 (2017), 1–26.
[68]
Yinwei Wei, Xiang Wang, Weili Guan, Liqiang Nie, Zhouchen Lin, and Baoquan Chen. 2020. Neural multimodal cooperative learning toward micro-video understanding. IEEE Transactions on Image Processing 29, 1 (2020), 1–14.
[69]
Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. 2015. Empirical evaluation of rectified activations in convolutional network. In Proceedings of the International Conference on Machine Learning. 1–5.
[70]
Huijuan Xu, Abir Das, and Kate Saenko. 2017. R-C3D: Region convolutional 3D network for temporal activity detection. In Proceedings of the IEEE International Conference on Computer Vision. 5783–5792.
[71]
Huijuan Xu, Kun He, Bryan A. Plummer, Leonid Sigal, Stan Sclaroff, and Kate Saenko. 2019. Multilevel language and vision integration for text-to-clip retrieval. In Proceedings of the American Association for Artificial Intelligence. 9062–9069.
[72]
Mingzhu Xu, Ping Fu, Bing Liu, and Junbao Li. 2021. Multi-stream attention-aware graph convolution network for video salient object detection. IEEE Transactions on Image Processing 30, 1 (2021), 4183–4197.
[73]
Mingzhu Xu, Bing Liu, Ping Fu, Junbao Li, and Yu Hen Hu. 2019. Video saliency detection via graph clustering with motion energy and spatiotemporal objectness. IEEE Transactions on Multimedia 21, 11 (2019), 2790–2805.
[74]
Mingzhu Xu, Bing Liu, Ping Fu, Junbao Li, Yu Hen Hu, and Shou Feng. 2020. Video salient object detection via robust seeds extraction and multi-graphs manifold propagation. IEEE Transactions on Circuits and Systems for Video Technology 30, 7 (2020), 2191–2206.
[75]
Xun Yang, Jianfeng Dong, Yixin Cao, Xun Wang, Meng Wang, and Tat-Seng Chua. 2021. Tree-augmented cross-modal encoding for complex-query video retrieval. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval. 1339–1348.
[76]
En Yu, Jianhua Ma, Jiande Sun, Xiaojun Chang, Huaxiang Zhang, and Alexander G. Hauptmann. 2022. Deep discrete cross-modal hashing with multiple supervision. Neurocomputing 486, 1 (2022), 215–224.
[77]
En Yu, Jiande Sun, Jing Li, Xiaojun Chang, Xian-Hua Han, and Alexander G. Hauptmann. 2019. Adaptive semi-supervised feature selection for cross-modal retrieval. IEEE Transactions on Multimedia 21, 5 (2019), 1276–1288.
[78]
Youngjae Yu, Hyungjin Ko, Jongwook Choi, and Gunhee Kim. 2017. End-to-end concept word detection for video captioning, retrieval, and question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3165–3173.
[79]
Yitian Yuan, Tao Mei, and Wenwu Zhu. 2019. To find where you talk: Temporal sentence localization in video with attention based location regression. In Proceedings of the American Association for Artificial Intelligence. 9159–9166.
[80]
Runhao Zeng, Haoming Xu, Wenbing Huang, Peihao Chen, Mingkui Tan, and Chuang Gan. 2020. Dense regression network for video grounding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10287–10296.
[81]
Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, and Shaoping Ma. 2022. Learning discrete representations via constrained clustering for effective and efficient dense retrieval. In Proceedings of the International Conference on Web Search and Data Mining. 1328–1336.
[82]
Da Zhang, Xiyang Dai, Xin Wang, YuanFang Wang, and Larry S. Davis. 2019. MAN: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1247–1257.
[83]
Hao Zhang, Aixin Sun, Wei Jing, and Joey Tianyi Zhou. 2020. Span-based localizing network for natural language video localization. In Proceedings of the Association for Computational Linguistics. 6543–6554.
[84]
Songyang Zhang, Houwen Peng, Jianlong Fu, and Jiebo Luo. 2020. Learning 2D temporal adjacent networks for moment localization with natural language. In Proceedings of the American Association for Artificial Intelligence. 12870–12877.
[85]
Zongmeng Zhang, Xianjing Han, Xuemeng Song, Yan Yan, and Liqiang Nie. 2021. Multi-modal interaction graph convolutional network for temporal language localization in videos. IEEE Transactions on Image Processing 30, 1 (2021), 8265–8277.
[86]
Zijian Zhang, Zhou Zhao, Zhu Zhang, Zhijie Lin, Qi Wang, and Richang Hong. 2021. Temporal textual localization in video via adversarial bi-directional interaction networks. IEEE Transactions on Multimedia 23, 1 (2021), 3306–3317.
[87]
Lei Zhu, Chaoqun Zheng, Xu Lu, Zhiyong Cheng, Liqiang Nie, and Huaxiang Zhang. 2021. Efficient multi-modal hashing with online query adaption for multimedia retrieval. ACM Transactions on Information Systems 40, 2 (2021), 1–36.
[88]
Zhang Zhu, Zhijie Lin, Zhou Zhao, and Zhenxin Xiao. 2019. Cross-modal interaction networks for query-based moment retrieval in videos. In Proceedings of the International Conference on Research and Development in Information Retrieval. 655–664.

Cited By

View all
  • (2024)Breaking barriers of system heterogeneityProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/419(3789-3797)Online publication date: 3-Aug-2024
  • (2024)Unsupervised Video Moment Retrieval with Knowledge-Based Pseudo-Supervision ConstructionACM Transactions on Information Systems10.1145/370122943:1(1-26)Online publication date: 9-Dec-2024
  • (2024)Graph Convolutional Metric Learning for Recommender Systems in Smart CitiesIEEE Transactions on Consumer Electronics10.1109/TCE.2024.341170470:3(5929-5941)Online publication date: 1-Aug-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems
ACM Transactions on Information Systems  Volume 42, Issue 2
March 2024
897 pages
EISSN:1558-2868
DOI:10.1145/3618075
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 November 2023
Online AM: 07 September 2023
Accepted: 19 August 2023
Revised: 25 June 2023
Received: 16 May 2022
Published in TOIS Volume 42, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Cross-modal moment localization
  2. intra-modal semantic understanding
  3. inter-modal semantic collaboration

Qualifiers

  • Research-article

Funding Sources

  • National Natural Science Foundation (NSF) of China
  • NSF of Shandong Province
  • Key R&D Program of Shandong (Major scientific and technological innovation projects)
  • Alibaba Group through Alibaba Innovative Research Program

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)286
  • Downloads (Last 6 weeks)18
Reflects downloads up to 10 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Breaking barriers of system heterogeneityProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/419(3789-3797)Online publication date: 3-Aug-2024
  • (2024)Unsupervised Video Moment Retrieval with Knowledge-Based Pseudo-Supervision ConstructionACM Transactions on Information Systems10.1145/370122943:1(1-26)Online publication date: 9-Dec-2024
  • (2024)Graph Convolutional Metric Learning for Recommender Systems in Smart CitiesIEEE Transactions on Consumer Electronics10.1109/TCE.2024.341170470:3(5929-5941)Online publication date: 1-Aug-2024
  • (2024)Gazing After Glancing: Edge Information Guided Perception Network for Video Moment RetrievalIEEE Signal Processing Letters10.1109/LSP.2024.340353331(1535-1539)Online publication date: 2024

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media