Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image

Jiao, Pengkun; Zhao, Na; Chen, Jingjing; Jiang, Yu-Gang

doi:10.1007/978-3-031-73195-2_22

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15106))

Included in the following conference series:

European Conference on Computer Vision

233 Accesses

Abstract

Open-vocabulary 3D object detection (OV-3DDet) aims to localize and recognize both seen and previously unseen object categories within any new 3D scene. While language and vision foundation models have achieved success in handling various open-vocabulary tasks with abundant training data, OV-3DDet faces a significant challenge due to the limited availability of training data. Although some pioneering efforts have integrated vision-language models (VLM) knowledge into OV-3DDet learning, the full potential of these foundational models has yet to be fully exploited. In this paper, we unlock the textual and visual wisdom to tackle the open-vocabulary 3D detection task by leveraging the language and vision foundation models. We leverage a vision foundation model to provide image-wise guidance for discovering novel classes in 3D scenes. Specifically, we utilize a object detection vision foundation model to enable the zero-shot discovery of objects in images, which serves as the initial seeds and filtering guidance to identify novel 3D objects. Additionally, to align the 3D space with the powerful vision-language space, we introduce a hierarchical alignment approach, where the 3D feature space is aligned with the vision-language feature space using a pretrained VLM at the instance, category, and scene levels. Through extensive experimentation, we demonstrate significant improvements in accuracy and generalization, highlighting the potential of foundation models in advancing open-vocabulary 3D object detection in real-world scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

€32.70 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: EUR 29.95; Price includes VAT (France)

eBook: EUR 63.34; Price includes VAT (France)

Softcover Book: EUR 78.06; Price includes VAT (France)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection

LaMI-DETR: Open-Vocabulary Detection with Language Model Instruction

Annotation-Free Object Detection by Knowledge-Extraction Training From Visual-Language Models

References

Arnold, E., Al-Jarrah, O.Y., Dianati, M., Fallah, S., Oxtoby, D., Mouzakitis, A.: A survey on 3d object detection methods for autonomous driving applications. IEEE Trans. Intell. Transp. Syst. 20(10), 3782–3795 (2019)
Article Google Scholar
Cao, Y., Yihan, Z., Xu, H., Xu, D.: Coda: collaborative novel box discovery and cross-modal alignment for open-vocabulary 3d object detection. Adv. Neural Inf. Process. Syst. 36 (2024)
Google Scholar
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: Scannet: richly-annotated 3d reconstructions of indoor scenes. In: Proceedings of Computer Vision and Pattern Recognition (CVPR). IEEE (2017)
Google Scholar
Ding, R., Yang, J., Xue, C., Zhang, W., Bai, S., Qi, X.: Pla: language-driven open-vocabulary 3d scene understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7010–7019 (2023)
Google Scholar
Han, Y., Zhao, N., Chen, W., Ma, K.T., Zhang, H.: Dual-perspective knowledge enrichment for semi-supervised 3d object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 2049–2057 (2024)
Google Scholar
Jiao, Y., Jie, Z., Chen, S., Chen, J., Ma, L., Jiang, Y.G.: Msmdfusion: fusing lidar and camera at multiple scales with multi-depth seeds for 3d object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21643–21652 (2023)
Google Scholar
Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N.: Mdetr-modulated detection for end-to-end multi-modal understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1780–1790 (2021)
Google Scholar
Lai, X., et al.: Lisa: reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692 (2023)
Li*, L.H., et al.: Grounded language-image pre-training. In: CVPR (2022)
Google Scholar
Liu, S., et al.: Grounding dino: marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023)
Liu, Y., et al.: Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Liu, Z., Zhang, Z., Cao, Y., Hu, H., Tong, X.: Group-free 3d object detection via transformers. arXiv preprint arXiv:2104.00678 (2021)
Lu, Y., et al.: Open-vocabulary point-cloud object detection without 3d annotation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1190–1199 (2023)
Google Scholar
Mao, J., Shi, S., Wang, X., Li, H.: 3d object detection for autonomous driving: a comprehensive survey. Int. J. Comput. Vision, 1–55 (2023)
Google Scholar
Misra, I., Girdhar, R., Joulin, A.: An end-to-end transformer model for 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2906–2917 (2021)
Google Scholar
Pan, X., Xia, Z., Song, S., Li, L.E., Huang, G.: 3d object detection with pointformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7463–7472 (2021)
Google Scholar
Qi, C.R., Litany, O., He, K., Guibas, L.J.: Deep hough voting for 3d object detection in point clouds. In: proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9277–9286 (2019)
Google Scholar
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: deep hierarchical feature learning on point sets in a metric space. Adv. Neural Inf. Process. Syst. 30 (2017)
Google Scholar
Qian, X., et al.: Mlcvnet: multi-level context votenet for 3d object detection. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Sheng, H., et al.: Rethinking iou-based optimization for single-stage 3d object detection. In: European Conference on Computer Vision, pp. 544–561. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-20077-9_32
Wu, H., Wen, C., Li, W., Li, X., Yang, R., Wang, C.: Transformation-equivariant 3d object detection for autonomous driving. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 2795–2802 (2023)
Google Scholar
Yang, B., Luo, W., Urtasun, R.: Pixor: real-time 3d object detection from point clouds. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 7652–7660 (2018)
Google Scholar
Yao, L., et al.: Detclip: dictionary-enriched visual-concept paralleled pre-training for open-world detection (2022)
Google Scholar
Yin, T., Zhou, X., Krahenbuhl, P.: Center-based 3d object detection and tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11784–11793 (2021)
Google Scholar
Zeng, Y., et al.: Clip2: contrastive language-image-point pretraining from real-world point cloud data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15244–15253 (2023)
Google Scholar
Zhang, R., et al.: Pointclip: point cloud understanding by clip. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8552–8562 (2022)
Google Scholar
Zhao, N., Chua, T.S., Lee, G.H.: Sess: self-ensembling semi-supervised 3d object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11079–11087 (2020)
Google Scholar
Zhou, X., Girdhar, R., Joulin, A., Krähenbühl, P., Misra, I.: Detecting twenty-thousand classes using image-level supervision. In: European Conference on Computer Vision, pp. 350–368. Springer, Heidelberg (2022). https://doi.org/10.1007/978-3-031-20077-9_21
Zhou, Y., Tuzel, O.: Voxelnet: end-to-end learning for point cloud based 3d object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4490–4499 (2018)
Google Scholar
Zhu, C., Zhang, W., Wang, T., Liu, X., Chen, K.: Object2scene: putting objects in context for open-vocabulary 3d detection. arXiv preprint arXiv:2309.09456 (2023)
Zhu, X., et al.: Pointclip v2: prompting clip and gpt for powerful 3d open-world learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2639–2650 (2023)
Google Scholar

Download references

Acknowledgments

This research work is partially supported by the Shanghai Science and Technology Program (Project No. 21JC1400600), and the Agency for Science, Technology and Research (A*STAR) under its MTC Programmatic Funds (Grant No. M23L7b0021).

Author information

Authors and Affiliations

Shanghai Key Laboratory of Intelligent Information Processing, School of CS, Fudan University, Shanghai, China
Pengkun Jiao, Jingjing Chen & Yu-Gang Jiang
Shanghai Collaborative Innovation Center on Intelligent Visual Computing, Shanghai, China
Pengkun Jiao, Jingjing Chen & Yu-Gang Jiang
Singapore University of Technology and Design, Singapore, Singapore
Na Zhao

Authors

Pengkun Jiao
View author publications
You can also search for this author in PubMed Google Scholar
Na Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Jingjing Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yu-Gang Jiang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Na Zhao .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 263 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jiao, P., Zhao, N., Chen, J., Jiang, YG. (2025). Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15106. Springer, Cham. https://doi.org/10.1007/978-3-031-73195-2_22

Download citation

DOI: https://doi.org/10.1007/978-3-031-73195-2_22
Published: 27 November 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73194-5
Online ISBN: 978-3-031-73195-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection

LaMI-DETR: Open-Vocabulary Detection with Language Model Instruction

Annotation-Free Object Detection by Knowledge-Extraction Training From Visual-Language Models

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 263 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection

LaMI-DETR: Open-Vocabulary Detection with Language Model Instruction

Annotation-Free Object Detection by Knowledge-Extraction Training From Visual-Language Models

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 263 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation