0% found this document useful (0 votes)
9 views8 pages

Ai Paper 9

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 8

EasyInv: Toward Fast and Better DDIM Inversion

Ziyue Zhang1 , Mingbao Lin2 , Shuicheng Yan2 , Rongrong Ji1


1
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen
University, China
2
Skywork AI, Singapore
zhang zi yue@foxmail.com, linmb001@outlook.com, shuicheng.yan@kunlun-inc.com, rrji@xmu.edu.cn
arXiv:2408.05159v1 [cs.CV] 9 Aug 2024

Abstract
SD-V1-4
This paper introduces EasyInv, an easy yet novel approach
that significantly advances the field of DDIM Inversion by ad-
dressing the inherent inefficiencies and performance limita-
tions of traditional iterative optimization methods. At the core SD-XL

of our EasyInv is a refined strategy for approximating inver-


sion noise, which is pivotal for enhancing the accuracy and re- Original Image (a) DDIM Inversion (b) ReNoise (c) Fixed-Point (d) EasyInv (Ours)
liability of the inversion process. By prioritizing the initial la-
tent state, which encapsulates rich information about the orig- Figure 1: Performance comparison of inversion methods
inal images, EasyInv steers clear of the iterative refinement of
noise items. Instead, we introduce a methodical aggregation
including vanilla DDIM Inversion (Couairon et al. 2023),
of the latent state from the preceding time step with the cur- Fixed-Point Iteration (Pan et al. 2023), ReNoise (Garibi et al.
rent state, effectively increasing the influence of the initial 2024) and our EasyInv. The proposed EasyInv performs well
latent state and mitigating the impact of noise. We illustrate upon different diffusion models of SD-V1-4 and SD-XL.
that EasyInv is capable of delivering results that are either on
par with or exceed those of the conventional DDIM Inversion
approach, especially under conditions where the model’s pre- P2P (Hertz et al. 2022) have been proposed to address im-
cision is limited or computational resources are scarce. Con- age editing challenges, they are still confined to the realm of
currently, our EasyInv offers an approximate threefold en- prompted image manipulation. Given that diffusion models
hancement regarding inference efficiency over off-the-shelf
generate images from noise inputs, a potential breakthrough
iterative optimization techniques.
lies in identifying the corresponding noise for any given im-
age. This would enable the diffusion model to initiate the
Introduction generation process from a known starting point, thereby al-
Diffusion models have become a major focus of research lowing for precise control over the final output. The recent
in recent years, mostly renowned for their ability to gener- innovation of DDIM Inversion (Couairon et al. 2023) aims to
ate high-quality images that closely match given prompts. overcome this challenge by reversing the denoising process
Among the many diffusion models introduced in the com- to introduce noise. This technique effectively retrieves the
munity, Stable Diffusion (SD) (Rombach et al. 2022) stands initial noise configuration after a series of reference steps,
out as one of the most widely utilized in scientific research, thereby preserving the integrity of the original image while
largely due to its open-source nature. Another contemporary affording the user the ability to manipulate the output by ad-
diffusion model gaining popularity is DALL-E 3 (Betker justing the denoising parameters. With DDIM inversion, the
et al. 2023), which offers users access to its API and the abil- generative process becomes more adaptable, facilitating the
ity to interact with it through platforms like ChatGPT (Ope- creation and subsequent editing of images with greater pre-
nai 2024). These models have significantly transformed the cision and control. For example, the MasaCtrl method (Cao
visual arts industry and have attracted substantial attention et al. 2023) first transforms a real image into a noise repre-
from the research community. sentation and then identifies the arrangement of objects dur-
ing the denoising phase. Portrait Diffusion(Liu et al. 2023)
While renowned generative diffusion models have made
simultaneously inverts both the source and target images.
significant strides, a prevalent limitation is their reliance on
Subsequently, it merges their respective Q, K and V values
textual prompts for input. This approach becomes restric-
for mixups.
tive when users seek to iteratively refine an image, as the
sole reliance on prompts hinders flexibility. Although so- Considering the reliance on inversion techniques to pre-
lutions such as ObjectAdd (Zhang, Lin, and Ji 2024) and serve the integrity of the input image, the quality of the in-
version process is paramount, as it profoundly influences
Copyright © 2025, Association for the Advancement of Artificial subsequent tasks. As depicted in Figure 1(a), the perfor-
Intelligence (www.aaai.org). All rights reserved. mance of DDIM Inversion has been found to be less than
satisfactory due to the discrepancy between the noise esti-
mated during the inversion process and the noise expected in SD-V1-4

the sampling process. Consequently, numerous studies have


been conducted to enhance its efficacy. In Null-Text inver-
sion (Mokady et al. 2023), researchers observed that using
SD-XL
a null prompt as input, the diffusion model could generate
optimal results during inversion, suggesting that improve-
Original Image (a) DDIM Inversion (b) ReNoise (c) Fixed-Point (d) EasyInv (Ours)
ments to inversion might be better achieved in the recon-
struction branch. Ju et al.’s work (Ju et al. 2023) exempli-
Figure 2: Visualization of the latent states midway through
fies this approach by calculating the distance between latents
all denoising steps for various inversion methods. Our Easy-
at the current step and the previous step. PTI (Dong et al.
Inv shows its enhanced convergence by closely approximat-
2023) opts to update the conditional vector in each step to
ing the original image.
guide the reconstruction branch for improving consistency.
ReNoise (Garibi et al. 2024) focuses on refining the inver-
sion process itself. This method iteratively adds and then
denoises noise at each time step, using the denoised noise Related Works
as input for the subsequent iteration. However, as shown in Diffusion Model. In recent years, there has been sig-
Figure 1(b), it can result in a black image output when deal- nificant progress in the field of generative models, with
ing with certain special inputs, which will be discussed in diffusion models emerging as a particularly popular ap-
detail in Sec. . Pan et al. (Pan et al. 2023), while maintain- proach. The seminal denoising diffusion probabilistic mod-
ing the iterative updating process, also amalgamated noise els (DDPM) (Ho, Jain, and Abbeel 2020) introduced a prac-
from previous steps with the current step’s noise. However, tical framework for image generation based on the diffusion
this method’s performance is limited in less effective mod- process. This method stands out from its predecessors, such
els as displayed in Figure1(c). For instance, it performs well as generative adversarial networks (GANs), due to its iter-
in SD-XL (Podell et al. 2023) but fails to yield proper re- ative nature. During the data preparation phase, Gaussian
sults in SD-V1-4 (Rombach et al. 2022). We attribute this noise is incrementally added to a real image until it transi-
to their method’s sole focus on optimizing noise; when the tions into a state that is indistinguishable from raw Gaussian
noise is highly inaccurate, such simple optimization strate- noise. Subsequently, a model can be trained to predict the
gies encounter difficulties. Additionally, the iterative updat- noise added at each step, enabling users to input any Gaus-
ing of noise is time-consuming, as Pan et al.’s method re- sian noise and obtain a high-quality image as a result. Ho
quires multiple model inferences per time step. et al. (Ho, Jain, and Abbeel 2020) provided a robust the-
In this paper, we conduct an in-depth analysis and recog- oretical foundation for their model, which has facilitated
nize that the foundation of any inversion process is the ini- further advancements. Generative process in DDPM is both
tial latent state derived from a real image. Errors introduced time-consuming and inherently stochastic due to the random
at each step of the inversion process can accumulate, lead- noise introduced at each step. To address these limitations,
ing to a suboptimal reconstruction. Current methodologies, the denoising diffusion implicit models (DDIM) were de-
which focus on optimizing the transition between succes- veloped (Song, Meng, and Ermon 2020). By reformulating
sive steps, may not be adequate to address this issue holisti- DDPM, DDIM has successfully reduced the amount of ran-
cally. To tackle this, we propose a novel approach that con- dom noise added at each step. This reformulation results in a
siders the inversion process as a whole, underscoring the more deterministic denoising process. Furthermore, the ab-
significance of the initial latent state throughout the pro- sence of random noise allows for the aggregation of several
cess. Our approach, named EasyInv, incorporates a straight- denoising steps, thereby significantly reducing the overall
forward mechanism to periodically reinforce the influence computation time required to generate an image.
of the initial latent state during the inversion. This is real- Image Inversion. Converting a real image into noise is
ized by blending the current latent state with the previous a pivotal first step in the realm of real image editing using
one at strategically selected intervals, thereby increasing the diffusion models. The precision of this process has a pro-
weight of the initial latent state and diminishing the noise’s found impact in the final edit, with the critical element being
impact. As a result, EasyInv ensures a reconstructed version the accurate identification of the noise added at each step.
that remains closer to the original image, as illustrated in Couairon et al. (Couairon et al. 2023) ingeniously swapped
Figure 1(d). Furthermore, by building upon the traditional the roles of independent and implicit variables within the
DDIM Inversion framework (Couairon et al. 2023), EasyInv denoising function of the DDIM model, enabling it to pre-
does not depend on iterative optimization between adjacent dict the noise that should be introduced to the current la-
steps, thus enhancing computational efficiency. In Figure 2, tents. However, it is essential to recognize that the denois-
we present a visualization of the latent states at the mid- ing step in a diffusion model is inherently an approximation,
point of the total denoising steps for various inversion meth- and when this approximation is utilized inversely, discrepan-
ods. It is evident that the outcomes of our EasyInv are more cies between the model’s output and the actual noise value
closely aligned with the original image compared to all other are likely to be exacerbated. To address this issue, ReNoise
methods, demonstrating that EasyInv achieves faster conver- (Garibi et al. 2024) iterates through each noising step mul-
gence. tiple times. For each inversion step, they employ an itera-
tive approach to add and subsequently reduce noise, with the Fixed-Point Iteration
noise reduced in the final iteration being carried forward to The vanilla DDIM Inversion method, as discussed, involves
the subsequent iteration. Pan et al. (Pan et al. 2023) offered an approximation that is not entirely precise for ε∗t . To ad-
a theoretical underpinning to the ReNoise method. Iterative dress this, researchers have sought to refine a more accurate
optimization from ReNoise is classified under the umbrella approximation of ε∗t , thereby ensuring that the desired con-
of fixed-point iteration methods. Building upon Anderson’s ditions are optimally met. This refinement process aims to
seminal work (Anderson 1965), Pan et al. have advanced the enhance the precision of the method, leading to more reli-
field by proposing their novel method for optimizing noise able results in the context of the application:
during the inversion process.
ε∗t = εt . (5)
Methodology For clarity, let’s first restate Eq. (3) as follows:
Preliminaries
DDIM Inversion Let zT denote a noise tensor with zT ∼ z∗t = g(ε∗t ), (6)
I(0, I). The DDIM (Couairon et al. 2023) leverages a pre- which represents the introduction of adding noise to the la-
trained neural network εθ to perform T denoising diffusion
steps. Each step aims to estimate the underlying noise and tent state z∗t−1 . Under the assumption of Eq. (5), it should be
subsequently restore a less noisy version of the tensor, zt−1 , the case that:
from its noisy counterpart zt as: zt = z∗t . (7)
r r r ! Subsequently, by employing the noise estimation function
αt−1 1 1
from Eq. (2), we obtain:

zt−1 = zt + − 1− − 1 ·εθ zt , t, τθ (y) ,
αt αt−1 αt
d zt ) = d g(ε∗t ) .

(1) (8)
where t = T → 1, and {αt }Tt=1 constitutes a prescribed
variances set that guides the diffusion process. Furthermore, Given that d(zt ) = εt and considering Eq. (5), we can
τθ serves as an intermediate representation that encapsulates deduce that:
ε∗t = d g(ε∗t ) .

the textual condition y. For the convenience of following (9)
sections, we denote: This formulation presents a fixed-point problem, which
 pertains to a value that remains unchanged under a spe-
d(zt ) = εθ zt , t, τθ (y) . (2)
cific transformation (Bauschke et al. 2011). In the context
Re-evaluating Eq. (1), we derive DDIM Inversion pro- of functions, a fixed point is an element that is invariant un-
cess (Couairon et al. 2023) as presented in Eq.(3). In this re- der the application of the function. In this paper, we seek a
formulation, we relocate an approximate z∗t to the left-hand ε∗t that, when transformed by g and followed by d, can map
side, resulting in the following expression: back to itself, signifying an optimal solution as per Eq. (5).
z∗t = g εθ z∗t−1 , t − 1, τθ (y)
 Fixed-point iteration is a computational technique de-
r signed to identify the fixed points of a function. It functions
αt ∗ through an iterative process, as delineated below:
= zt−1 −
αt−1
(ε∗t )n = d g(ε∗t )n−1 ,

r r r ! (10)
αt 1 1
− 1 · εθ z∗t−1 , t − 1, τθ (y) ,

−1−
αt−1 αt−1 αt where n denotes the iteration count. This iterative process
(3) can be enhanced through acceleration techniques such as
Anderson acceleration (Anderson 1965). However, calculat-
Review. Given an image I∗ , after encoding it into the la- ing a complex ε∗t can be quite onerous. An empirical acceler-
tent z∗0 , we initiate T inversion steps using Eq. (3) to obtain ation method proposed (Pan et al. 2023) introduces a refine-
the noise z∗T . Starting with zT = z∗T , we proceed with a ment for ε∗t by setting: ε∗t )n = εθ ((z ∗ n

t ) , t − 1, τθ (y) and
denoising process in Eq. (1) to infer an approximate recon- (ε∗t )n−1 = εθ (z∗t )n−1 , t − 1, τθ (y) . They finally reach:

struction z0 that resembles the original latent z∗0 . The pri-
mary source of error in this reconstruction arises from the (z∗t )n+1 = g(0.5 · (ε∗t )n + 0.5 · (ε∗t )n−1 ), (11)
difference between the noise predicted during the inversion
process εθ z∗t−1 , t − 1, τθ (y) and where the term 0.5·(ε∗t )n +0.5·(ε∗t )n−1 represents the refine-
 the noise expected in the ment technique for ε∗t as suggested by Pan et al.. If we were
sampling process, εθ zt , t, τθ (y) , denoted as εt , at each it-
erative step. This discrepancy originates from an imprecise to apply the function d to both sides of Eq. (11), it would
approximation of the time step from t to t − 1. Therefore, align perfectly with the form of Eq. (10). Their experiments
reducing the discrepancy between the predicted noises at have demonstrated that this approach is more effective than
each step is crucial for achieving an accurate reconstruction, both Anderson’s method (Anderson 1965) and other tech-
which is essential for the success of subsequent image edit- niques in inversion tasks.
ing tasks. For simplicity in the following expressions, we Despite the progress made, this paper acknowledges in-
define: herent limitations in the practical implementation of the
inversion technique: (1) Inversion Efficiency: While the
ε∗t = εθ z∗t−1 , t−1, τθ (y) ,
 
εt = εθ zt , t, τθ (y) . (4) method outlined in Eq. (11) has shown improvements over
traditional fixed-point iteration, it still relies on iterative op- Original image DDIM Inversion ReNoise Fixed Point EasyInv (Ours)

timization. The need for multiple forward passes through the


diffusion model is computationally demanding and can re-
sult in inefficiencies in downstream applications. (2) Inver-
sion Performance: The theoretical improvements presented
assume that ε∗t = εt . However, iterative optimization does
not guarantee the exact fulfillment of Equation (7) for ev-
ery time step t. Therefore, while the method may theo-
retically offer superior performance, cumulative errors can
sometimes lead to practical outcomes that are less satisfac-
tory than those achieved with the standard DDIM Inversion
method, as shown in Figure 1.

EasyInv Figure 3: A visual assessment of various inversion tech-


To facilitate our subsequentqanalysis, we introduce the niques utilizing the SD-XL model.
αt
notation ᾱt to represent αt−1 and β̄t to denote
q
αt
q
1
q
1
 Given that 0 < ᾱi < 1, for i = 0 → t̄, it follows that
αt−1 − 1 − αt − 1 . With these notations, we
Qt̄−1
αt−1 ( i=0 ᾱi )(1 − ᾱt̄ ) > 0. Consequently, in comparison to z∗t̄ ,
can reframe Eq. (3) as follow: z∗t̄−1 carries a higher proportion of z0 and is, therefore, less
z∗t = ᾱt z∗t−1 + β̄t ε∗t , (12) susceptible to the influence of noise. Our approach, there-
fore, accentuates the significance of the initial latent state
Similarly, we can express the form of z∗t−1 as: z0 , which encapsulates the most comprehensive information
z∗t−1 = ᾱt−1 z∗t−2 + β̄t−1 ε∗t−1 , (13) regarding the original image, within z∗t̄ .

By combining these two formulas, we derive: Experimentation


z∗t = ᾱt ᾱt−1 z∗t−2 + ᾱt β̄t−1 ε∗t−1 + β̄t ε∗t . (14) We compare our EasyInv over the vanilla DDIM Inversion
(Couairon et al. 2023), ReNoise (Garibi et al. 2024), Pan
This can be further generalized to:
et al.’s method (Pan et al. 2023) (referred to as Fixed-Point
t
Y t
X t
Y Iteration), using SD V1.4 and SD-XL on one NVIDIA GTX
z∗t = ( ᾱi )z∗0 + (β̄i ᾱj )ε∗i . (15) 3090 GPU.
i=1 i=1 j=i+1 For Fixed-Point Iteration (Pan et al. 2023), we re-
From Eq. (15), it is evident that z∗t is a weighted sum of implemented it using settings from the paper, as the source
z0 and a series of noise terms ε∗i . The denoising process of code is unavailable. We set the data type of all methods
Eq. (1) aims to iteratively reduce the impact of these noise to float16 by default to improve efficiency. The inversion
terms. In prior research, the crux of inversion is to introduce and denoising steps T = 50, except for Fixed-Point Itera-
the appropriate noise ε∗i at each step to identify a suitable z∗t . tion, which recommends T = 20. For our EasyInv, we use
This allows the model to obtain z0 as the final output after 0.85 · T < t̄ < 0.95 · T and η = 0.8 with the SD-XL frame-
the denoising process. However, iteratively updating ε∗i can work, and 0.05·T < t̄ < 0.25·T and η = 0.5 with SD-V1-4,
be time-consuming, and when the model lacks high preci- due to the varying capacities of the two models.
sion, achieving satisfactory results within a reasonable num- For quantitative comparison, we use three major metrics:
ber of iterations may be challenging. LPIPS index (Zhang et al. 2018), SSIM (Wang et al. 2004),
To address this, we propose an alternative perspective. and PSNR. The LPIPS index uses a pre-trained VGG16 (Si-
During inversion, rather than searching for better noise, we monyan and Zisserman 2015) to compare image pairs. SSIM
aggregate the latent state from the last time step z∗t̄−1 with and PSNR measure image similarity. We also report in-
ference time. We randomly sample 2,298 images from the
the current latent state z∗t̄ at specific time steps t̄, as illus-
COCO 2017 test and validation sets (Lin et al. 2014). With
trated in the following formula:
the well-trained SD-XL model, error accumulation is min-
z∗t̄ = ηz∗t̄ + (1 − η)z∗t̄−1 , (16) imal, making all methods perform similarly. Therefore, we
display results using the SD-V1-4 model.
where η is a trade-off parameter, typically set to η ≥ 0.7.
The selection of t̄ will be discussed in Sec. . This approach Quantitative Results
effectively increases the weight z0 in z∗t̄ , since:
Table 1 presents the quantitative results of different methods.
t̄−1
Y EasyInv achieves a competitive LPIPS score of 0.321, bet-
z∗t̄−1 − z∗t̄ = ( ᾱi )(1 − ᾱt̄ )z0 ter than ReNoise (0.316) and Fixed-Point Iteration (0.373),
i=1 indicating closer perceptual similarity to the original im-
(17) age. For SSIM, EasyInv achieves the highest score of 0.646,
t̄−1
X t̄−1
Y
+(1 − ᾱt̄ ) (β̄i ᾱj )ε∗i + (−β̄t )ε∗t̄ showing superior structural similarity crucial for maintain-
i=1 j=i+1 ing image coherence. For PSNR, EasyInv scores 30.189,
Table 1: A comparative analysis of quantitative outcomes utilizing the SD-V1-4 model.

LPIPS (↓) SSIM (↑) PSNR (↑) Time (↓)


DDIM Inversion 0.328 0.621 29.717 5s
ReNoise 0.316 0.641 31.025 16s
Fixed-Point Iteration 0.373 0.563 29.107 14s
EasyInv (Ours) 0.321 0.646 30.189 5s

Table 2: A comparative analysis of half- and full-precision EasyInv utilizing the SD-V1-4.

LPIPS (↓) SSIM (↑) PSNR (↑) Time (↓)


Full Precision 0.321 0.646 30.184 9s
Half Precision 0.321 0.646 30.189 5s

Original image DDIM Inversion ReNoise Fixed Point EasyInv (Ours) Qualitative Results
We visually evaluate all methods using SD-XL and SD-V1-
4. Figure 3 presents a comparison of several examples across
all methods utilizing SD-XL. ReNoise struggles with images
containing significant white areas, resulting in black images.
The other two methods also perform poorly, especially evi-
dent in the clock example. Figure 4 displays the results ob-
tained from the SD-V1-4 using images sourced from the in-
ternet. These images also feature large areas of white color.
ReNoise consistently produces black images with these in-
puts, indicating an issue inherent to the method rather than
the model. Fixed-Point Iteration and DDIM Inversion also
Figure 4: A visual assessment of various inversion tech- fail to generate satisfactory results in such cases, suggest-
niques utilizing the SD-V1-4 model. ing these images pose challenges for inversion methods.
Our method, shown in the figure, effectively addresses these
challenges, demonstrating robustness and enhancing perfor-
close to ReNoise’s highest score of 31.025, indicating high mance in handling special scenarios. These findings under-
image fidelity. EasyInv completes the inversion process in score the efficacy of our approach, particularly in address-
the fastest time of 5 seconds, matching DDIM Inversion, and ing challenging cases that are less common in the COCO
significantly quicker than ReNoise (16 seconds) and Fixed- dataset.
Point Iteration (14 seconds), highlighting its efficiency with- Figure 5 presents more visual results of our method,
out compromising on quality. In summary, EasyInv performs with original images exclusively obtained from the COCO
strongly across all metrics, with the highest SSIM score indi- dataset (Lin et al. 2014). The results are unequivocal: our
cating effective preservation of image structure. Its efficient approach consistently generates images that closely resem-
inversion makes it highly suitable for real-world applications ble their originals post-inversion and reconstruction. The va-
where both quality and speed are crucial. riety of categories represented in these images underscores
the broad applicability and consistent performance of our
Table 2 compares EasyInv’s performance in half-precision
method. In aggregate, these findings affirm that our tech-
(float16) and full-precision (float32) formats. Both achieve
nique is not merely efficient but also remarkably robust,
the same LPIPS score of 0.321, indicating consistent percep-
adeptly reconstructing images with a high level of precision
tual similarity to the original image. Similarly, both achieve
and clarity.
an SSIM score of 0.646, showing preserved structural in-
tegrity with high fidelity. For PSNR, half precision slightly
outperforms full precision with scores of 30.189 and 30.184. Downstream Image Editing
This slight advantage in PSNR for half precision is note- To showcase the practical utility of our EasyInv, we have
worthy given its well reduced computation time. The most employed various inversion techniques within the realm of
significant difference is observed in the time metric, where consistent image synthesis and editing. We have seamlessly
half precision completes the inversion process in 5 seconds, integrated these inversion methods into MasaCtrl (Cao et al.
approximately 44% faster than full precision, which takes 2023), a widely-adopted image editing approach that ex-
9 seconds. This efficiency gain highlights EasyInv’s excep- tracts correlated local content and textures from source im-
tional optimization for half precision, offering faster speeds ages to ensure consistency. For demonstrative purposes, we
and reduced resources without compromising output quality. present an image of a “peach” alongside the prompt “A foot-
Original image EasyInv (Ours) Original image EasyInv (Ours)

Figure 5: More visual results of our EasyInv utilizing the SD-V1-4 model.
Conclusion
Our EasyInv presents a significant advancement in the field
of DDIM Inversion by addressing the inefficiencies and
Original Image (a) DDIM Inversion (b) ReNoise (c) Fixed-Point (d) EasyInv (Ours)
performance limitations in traditional iterative optimization
methods. By emphasizing the importance of the initial la-
Figure 6: Results of MasaCtrl (Cao et al. 2023) with prompt tent state and introducing a refined strategy for approxi-
“A football”, using inverted latent generated by different mating inversion noise, EasyInv enhances both the accu-
methods as input. racy efficiency of the inversion process. Our method strategi-
cally reinforces the initial latent state’s influence, mitigating
the impact of noise and ensuring a closer reconstruction to
ball.” The impact of inversion quality is depicted in Figure the original image. This approach not only matches but of-
6. In these instances, we utilize the inverted latents of the ten surpasses the performance of existing DDIM Inversion
“peach” image, as shown in Figure 4, as the input for Mas- methods, especially in scenarios with limited model preci-
aCtrl (Cao et al. 2023). Our ultimate goal is to generate an sion or computational resources. EasyInv also demonstrates
image of a football that retains the distinctive features of a remarkable improvement in inference efficiency, achiev-
the “peach” image. As evident from Figure 6, our EasyInv ing approximately three times faster processing than stan-
achieves superior texture quality and a shape most closely dard iterative techniques. Through extensive evaluations, we
resembling that of a football. From our perspective, images have shown that EasyInv consistently delivers high-quality
with extensive white areas constitute a significant category results, making it a robust and efficient solution for image
in actual image editing, given that they are a prevalent char- inversion tasks. The simplicity and effectiveness of EasyInv
acteristic in conventional photography. However, such fea- underscore its potential for broader applications, promoting
tures often prove detrimental to the ReNoise method. Thus, greater accessibility and advancement in the field of diffu-
for authentic image editing scenarios, our approach stands sion models.
out as a preferable alternative, not to mention its commend-
able efficiency. References
Anderson, D. G. 1965. Iterative procedures for nonlinear
Limitations integral equations. Journal of the ACM.
One potential risk associated with our approach is the phe- Bauschke, H. H.; Burachik, R. S.; Combettes, P. L.; Elser,
nomenon known as “over-denoising,” which occurs when V.; Luke, D. R.; and Wolkowicz, H. 2011. Fixed-point al-
there is a disproportionate focus on achieving a pristine gorithms for inverse problems in science and engineering,
final-step latent state. This can occasionally result in overly volume 49. Springer Science & Business Media.
smooth image outputs, as exemplified by the “peach” figure Betker, J.; Goh, G.; Jing, L.; Brooks, T.; Wang, J.; Li, L.;
in Figure 4. In the context of most real-world image edit- Ouyang, L.; Zhuang, J.; Lee, J.; Guo, Y.; et al. 2023. Im-
ing tasks, this is not typically an issue, as these tasks of- proving image generation with better captions. Computer
ten involve style migration, which inherently alters the de- Science.
tails of the original image. However, in specific applications, Cao, M.; Wang, X.; Qi, Z.; Shan, Y.; Qie, X.; and Zheng, Y.
such as using diffusion models for creating advertisements, 2023. MasaCtrl: Tuning-Free Mutual Self-Attention Control
this could pose a challenge. Nonetheless, our experimental for Consistent Image Synthesis and Editing. In International
results highlight that the method’s two key benefits signif- Conference on Computer Vision.
icantly outweigh this minor shortcoming. Firstly, it is ca- Couairon, G.; Verbeek, J.; Schwenk, H.; and Cord, M.
pable of delivering satisfactory outcomes even with models 2023. DiffEdit: Diffusion-based Semantic Image Editing
that may under-perform relative to other methods, as shown with Mask Guidance. In International Conference on Learn-
in the above experiments. Secondly, it enhances inversion ing Representations.
efficiency by reverting to the original DDIM Inversion base-
Dong, W.; Xue, S.; Duan, X.; and Han, S. 2023. Prompt tun-
line (Couairon et al. 2023), thereby eliminating the necessity
ing inversion for text-driven image editing using diffusion
for iterative optimizations. This strategy not only simplifies
models. In International Conference on Computer Vision.
the process but also ensures the maintenance of high-quality
outputs, marking it as a noteworthy advancement over cur- Garibi, D.; Patashnik, O.; Voynov, A.; Averbuch-Elor,
rent methodologies. H.; and Cohen-Or, D. 2024. ReNoise: Real Image
In conclusion, our research has made significant strides Inversion Through Iterative Noising. arXiv preprint
with the introduction of EasyInv. As we look ahead, our arXiv:2403.14602.
commitment to advancing this technology remains unwa- Hertz, A.; Mokady, R.; Tenenbaum, J.; Aberman, K.; Pritch,
vering. Our future research agenda will be focused on the Y.; and Cohen-or, D. 2022. Prompt-to-Prompt Image Editing
persistent enhancement and optimization of the techniques with Cross-Attention Control. In International Conference
in this paper. This will be done with the ultimate goal of en- on Learning Representations.
suring that our methodology is not only robust and efficient Ho, J.; Jain, A.; and Abbeel, P. 2020. Denoising diffusion
but also highly adaptable to the diverse and ever-evolving probabilistic models. Advances in neural information pro-
needs of industrial applications. cessing systems.
Ju, X.; Zeng, A.; Bian, Y.; Liu, S.; and Xu, Q. 2023. Direct
Inversion: Boosting Diffusion-based Editing with 3 Lines of
Code. In International Conference on Learning Representa-
tions.
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ra-
manan, D.; Dollár, P.; and Zitnick, C. L. 2014. Microsoft
COCO: Common Objects in Context. In European Confer-
ence on Computer Vision.
Liu, J.; Huang, H.; Jin, C.; and He, R. 2023. Portrait Diffu-
sion: Training-free Face Stylization with Chain-of-Painting.
arXiv preprint arXiv:2312.02212.
Mokady, R.; Hertz, A.; Aberman, K.; Pritch, Y.; and Cohen-
Or, D. 2023. Null-text inversion for editing real images us-
ing guided diffusion models. In Computer Vision and Pat-
tern Recognition.
Openai. 2024. ChatGPT. https://chat.openai.com/, Last ac-
cessed on 2024-2-27.
Pan, Z.; Gherardi, R.; Xie, X.; and Huang, S. 2023. Effec-
tive Real Image Editing with Accelerated Iterative Diffusion
Inversion. In International Conference on Computer Vision.
Podell, D.; English, Z.; Lacey, K.; Blattmann, A.; Dockhorn,
T.; Müller, J.; Penna, J.; and Rombach, R. 2023. SDXL: Im-
proving Latent Diffusion Models for High-Resolution Image
Synthesis. In International Conference on Learning Repre-
sentations.
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; and Om-
mer, B. 2022. High-resolution image synthesis with latent
diffusion models. In Computer Vision and Pattern Recogni-
tion.
Simonyan, K.; and Zisserman, A. 2015. Very deep convolu-
tional networks for large-scale image recognition. In Inter-
national Conference on Learning Representations.
Song, J.; Meng, C.; and Ermon, S. 2020. Denoising Diffu-
sion Implicit Models. In International Conference on Learn-
ing Representations.
Wang, Z.; Bovik, A. C.; Sheikh, H. R.; and Simoncelli, E. P.
2004. Image quality assessment: from error visibility to
structural similarity. IEEE transactions on image process-
ing.
Zhang, R.; Isola, P.; Efros, A. A.; Shechtman, E.; and Wang,
O. 2018. The unreasonable effectiveness of deep features as
a perceptual metric. In Computer Vision and Pattern Recog-
nition.
Zhang, Z.; Lin, M.; and Ji, R. 2024. ObjectAdd: Adding Ob-
jects into Image via a Training-Free Diffusion Modification
Fashion. arXiv preprint arXiv:2404.17230.

You might also like