Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis

Feng, Weixi; He, Xuehai; Fu, Tsu-Jui; Jampani, Varun; Akula, Arjun; Narayana, Pradyumna; Basu, Sugato; Wang, Xin Eric; Wang, William Yang

Computer Science > Computer Vision and Pattern Recognition

arXiv:2212.05032 (cs)

[Submitted on 9 Dec 2022 (v1), last revised 28 Feb 2023 (this version, v3)]

Title:Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis

Authors:Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, William Yang Wang

View PDF

Abstract:Large-scale diffusion models have achieved state-of-the-art results on text-to-image synthesis (T2I) tasks. Despite their ability to generate high-quality yet creative images, we observe that attribution-binding and compositional capabilities are still considered major challenging issues, especially when involving multiple objects. In this work, we improve the compositional skills of T2I models, specifically more accurate attribute binding and better image compositions. To do this, we incorporate linguistic structures with the diffusion guidance process based on the controllable properties of manipulating cross-attention layers in diffusion-based T2I models. We observe that keys and values in cross-attention layers have strong semantic meanings associated with object layouts and content. Therefore, we can better preserve the compositional semantics in the generated image by manipulating the cross-attention representations based on linguistic insights. Built upon Stable Diffusion, a SOTA T2I model, our structured cross-attention design is efficient that requires no additional training samples. We achieve better compositional skills in qualitative and quantitative results, leading to a 5-8% advantage in head-to-head user comparison studies. Lastly, we conduct an in-depth analysis to reveal potential causes of incorrect image compositions and justify the properties of cross-attention layers in the generation process.

Comments:	ICLR 2023 Camera Ready version
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2212.05032 [cs.CV]
	(or arXiv:2212.05032v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2212.05032

Submission history

From: Weixi Feng [view email]
[v1] Fri, 9 Dec 2022 18:30:24 UTC (16,763 KB)
[v2] Sun, 29 Jan 2023 06:09:29 UTC (9,476 KB)
[v3] Tue, 28 Feb 2023 23:46:24 UTC (7,870 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators