Guiding Instruction-based Image Editing via Multimodal Large Language Models

Fu, Tsu-Jui; Hu, Wenze; Du, Xianzhi; Wang, William Yang; Yang, Yinfei; Gan, Zhe

Computer Science > Computer Vision and Pattern Recognition

arXiv:2309.17102 (cs)

[Submitted on 29 Sep 2023 (v1), last revised 5 Feb 2024 (this version, v2)]

Title:Guiding Instruction-based Image Editing via Multimodal Large Language Models

Authors:Tsu-Jui Fu, Wenze Hu, Xianzhi Du, William Yang Wang, Yinfei Yang, Zhe Gan

View PDF

Abstract:Instruction-based image editing improves the controllability and flexibility of image manipulation via natural commands without elaborate descriptions or regional masks. However, human instructions are sometimes too brief for current methods to capture and follow. Multimodal large language models (MLLMs) show promising capabilities in cross-modal understanding and visual-aware response generation via LMs. We investigate how MLLMs facilitate edit instructions and present MLLM-Guided Image Editing (MGIE). MGIE learns to derive expressive instructions and provides explicit guidance. The editing model jointly captures this visual imagination and performs manipulation through end-to-end training. We evaluate various aspects of Photoshop-style modification, global photo optimization, and local editing. Extensive experimental results demonstrate that expressive instructions are crucial to instruction-based image editing, and our MGIE can lead to a notable improvement in automatic metrics and human evaluation while maintaining competitive inference efficiency.

Comments:	ICLR'24 (Spotlight) ; Project at this https URL ; Code at this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2309.17102 [cs.CV]
	(or arXiv:2309.17102v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2309.17102

Submission history

From: Tsu-Jui Fu [view email]
[v1] Fri, 29 Sep 2023 10:01:50 UTC (22,015 KB)
[v2] Mon, 5 Feb 2024 05:04:53 UTC (36,246 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Guiding Instruction-based Image Editing via Multimodal Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Guiding Instruction-based Image Editing via Multimodal Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators