Fisher Information in Conditional Diffusion
Fisher Information in Conditional Diffusion
{songky7, laihanj3}@[Link]
A BSTRACT
Training-free conditional diffusion models have received great attention in conditional image gen-
eration tasks. However, they require a computationally expensive conditional score estimator to
let the intermediate results of each step in the reverse process toward the condition, which causes
slow conditional generation. In this paper, we propose a novel Fisher information-based conditional
diffusion (FICD) model to generate high-quality samples according to the condition. In particular,
we further explore the conditional term from the perspective of Fisher information, where we show
Fisher information can act as a weight to measure the informativeness of the condition in each gen-
eration step. According to this new perspective, we can control and gain more information along
the conditional direction in the generation space. Thus, we propose the upper bound of the Fisher
information to reformulate the conditional term, which increases the information gain and decreases
the time cost. Experimental results also demonstrate that the proposed FICD can offer up to 2x
speed-ups under the same sampling steps as most baselines. Meanwhile, FICD can improve the
generation quality in various tasks compared to the baselines with a low computation cost.
1 Introduction
Recently, unconditional diffusion models (DMs) [1, 2, 3] have shown great success in image generation tasks. How-
ever, people often want to generate images with the properties they want. Conditional diffusion models [4], which
incorporate the conditions to generate the desired properties, have emerged as a crucial role for various generation
tasks, e.g., text-image generations [5, 6, 7], style-driven generation [8], and image edit-driven generation [9, 4, 10].
Many researchers [11, 9] have proposed to reuse/fine-tune the mature unconditional diffusion models to improve con-
ditional generation tasks.
From unconditional to conditional diffusion models [12], the main problem is how to incorporate conditional infor-
mation into the diffusion model to guide the sample generation. A new conditional score estimator [13, 5] should
be introduced into the diffusion models. However, it is not trivial to define the conditional score estimator. This is
because the diffusion model is a multi-step generation. In each step, the intermediate-generated results are sampled
from different noisy distributions. This hinders us from directly measuring the distance between the condition and
intermediate results to generate desired images. Training-based methods [5, 10] have first been proposed to solve this
problem. For all time steps, they retrain the time-dependent conditional score estimator [4] to measure the distance
between the condition and each intermediate result. Meanwhile, retraining the conditional score estimator requires a
large amount of computation.
Another line of research is training-free conditional diffusion models [12, 14, 9, 15, 16], which reuse the unconditional
score function for the conditional score estimator and do not require retraining. To make the conditional score estimator
not a time-dependent score function and training-free, one solution is posterior sampling [12]. Specifically, given the
t-th step intermediate result xt and the condition c, it uses two steps to make the conditional term not be a time-
dependent function: 1) Posterior mean. It first approximates the posterior mean x̂0 from xt by using the unconditional
score function; and 2) Measurement. It then uses the energy function [9] or other measures [14] to learn the relationship
between the time-independent x̂0 and the condition c. Hence, we can achieve a training-free method by reusing the
unconditional score function in the first step and using the same measure for different time steps since the x̂0 is time-
independent. Both are training-free. In this way, the conditional term will guide the intermediate results to be close to
the condition. However, the iterative generation process still causes slow conditional generation.
Previous methods alleviated the time cost by introducing additional hypotheses. For example, Gabriel et al. [16]
redefined the conditional term based on the Bayesian framework. RED [14] fine-tuned the start point of the reverse
process based on variational inference. Further, MPGD [15] introduced the linear-based manifold hypotheses to
decouple the dependency between the conditional term and the diffusion model.
In this paper, we offer a novel view based on the Fisher information [17] for the conditional score estimator, and a
novel Fisher information-based conditional diffusion (FICD) is proposed to reduce the time cost while maintaining
high-quality generation.
Concretely, according to the two steps, i.e., the posterior mean and measurement, in posterior sampling for training-
free methods, we also divide the conditional score estimator into 1) the posterior part (more details in Sec. 4) and
2) the measurement part (more details in Eq. 5). Then, we find that the posterior part can be redefined as the Fisher
information, where the Fisher information could be exactly reflected in the information gain. Based on this, the
posterior part could be further regarded as the weight function for the measurement part, which measures how much
information could be gained for the condition in each time step. Following this new view, we also find that the weight
function may sometimes hinder the reverse process from being close to the condition. Therefore, this motivates us to
use the upper bound of Fisher information to increase the overall information gain. In this way, with more information
gained following the conditional, the reverse process could generate images closer to the condition. Meanwhile, the
upper bound also helps us cancel the calculation of the diffusion model’s gradients simultaneously. Therefore, our
FICD could accelerate the conditional generation while maintaining high quality. To sum up, the main contributions
of this paper are:
• We propose a novel FICD to decrease the computation cost but increase the generation quality, which uses the
upper bound of the Fisher information to approximate the posterior mean by incorporating the information
theory.
• The proposed method provides a new perspective to understand and improve the existing training-free con-
ditional diffusion methods, where the key is to accumulate enough information gain for the condition in the
reverse process.
• The experimental results show that FICD accelerates the generation speed while maintaining high quality
compared with SOTA methods.
2 Related Work
Training-based conditional diffusion models. These methods aim to fine-tune the parameters of the score estimator
for different downstream tasks. For example, DreamBooth [11] directly fine-tuned the UNet of the diffusion model for
the condition. ControlNet [18] introduced an additional UNet and fine-tuned it while freezing the original one. Stable
Diffusion [5] introduced the transformer layers to re-train. Training-based methods could generate high-quality images
according to the condition, but the cost is too high since it needs to be fine-tuned each time for different downstream
tasks.
Training-free conditional diffusion methods. Some training-free methods focus on utilizing the structure and po-
tential of the UNet. For example, Tumanyan et al. [8] leveraged the attention maps from attention layers. Jaeseok et
al. [19] found the “h-space" among the UNet of the diffusion model. Wu et al. [20] directly changed the inputs of the
UNet. Except for the success, these methods need a special design based on the downstream tasks, thus limiting their
generalization.
The other methods [12, 21, 9, 22] focus on sampling from the posterior distribution. For example, DPS [12] first
used the distance norm as the metric based on the inverse problem. FreeDom [9] further used the energy function
as the metric, which extended it from solving the inverse problem to more widely downstream tasks such as face ID
generation [23]. Yong-Hyun [24] leveraged the bias vector discovered in the latent space to guide the diffusion models.
For the inverse problem, RED [14] and Gabriel [16] fine-tuned this start point and rebuilt the reverse process based
on the Bayesian framework to improve the generation quality and decrease the time cost. DSG [25] alleviated the
estimation bias of the condition in the reverse process to improve the FreeDom and mainly for the inverse problem.
Recently, MPGD [15] has been proposed to cancel the posterior part based on the manifold assumption. Compared
to the MPGD, FICD offers an additional view based on the information theory to accelerate the sampling process
while maintaining high-quality generation. MPGD introduced the manifold hypothesis based on linear assumption to
decouple the dependency, while our method has no additional assumption. Meanwhile, it could be found that dropping
2
the score function’s gradient will inevitably lose useful information, which is different from the MPGD and may lead
to an unstable generation for some tasks.
Information theory with the diffusion models. In light of the SDE [2] that introduced the score function to ex-
plain the behavior of DMs, information theory [17] shows the potential to make further improvements to DMs. Both
information theoretical diffusion [17] and InforDiffusion [26] leveraged mutual information to interpret the correla-
tion behind the observed and hidden variables to improve the generation quality. Interpretable diffusion [27] further
explained the most important part of the input for DMs similar to CAM [28].
3 Preliminary
Training-free conditional generation. Training-free methods aim to solve conditional generation tasks without re-
training the unconditional diffusion model. Suppose that we have an unconditional diffusion model, the uncondi-
tional score function is ∇xt log p(xt ) in the t-th timestep [1]. To keep the consistency with SDE [2], we also denote
sθ (xt , t) ≈ ∇xt log p(xt ), where sθ (xt , t) is the score function based on the neural network with θ parameters. Now
given the condition c, the conditional score function can be formulated as ∇xt log p(xt |c). The problem becomes
how to define this conditional score estimator.
By Bayesian rule [12], we have:
∇xt log p(xt |c) = ∇xt log p(xt ) + ∇xt log p(c|xt ) . (1)
| {z } | {z }
Unconditional term Conditional term
The unconditional term is the unconditional score function sθ (xt , t). To estimate the conditional term, a differentiable
metric ε(xt , c) (also called the energy function [9]) is proposed to measure the distance between xt and c:
exp−λε(xt ,c)
p(c|xt ) = , (2)
Z
where λ is a temperature coefficient and Z > 0 is a normalizing constant.
Following the posterior sampling [9], we calculate the posterior mean x̂0|t of the xt :
1
x̂0|t ≈ √ (xt + (1 − α̂t )sθ (xt , t)), (3)
α̂t
where α̂t is also a known parameter from the noise schedule [1] related to the t-th timestep. And then we use x̂0|t to
measure the distance between the condition c under the data domain instead of the noise domain [12]. Based on Eq. 2
and [12], we have:
log p(c|xt ) ≈ log p(c|x̂0|t ) ∝ ε(x̂0|t , c). (4)
Motivation. Based on the Eq. 1-Eq. 4, we take the derivative of the log p(c|xt ) with respect to xt and have
∂ x̂0|t ∂ε(x̂0|t , c)
∇xt log p(c|x̂0|t ) = . (5)
∂xt ∂ x̂0|t
According to Eq. 5, the derivative of the conditional term with respect to xt in training-free methods can be further
∂ x̂
divided into two parts. The first part is the derivative of the posterior mean x̂0|t with respect to xt : ∂x0|tt
, called
it posterior part. This part does not contain the condition c. The second part is the derivative of the measurement
∂ε(x̂ ,c)
function with respect to the posterior mean: ∂ x̂0|t0|t
. We refer to it as the measurement part, which contains the
condition c.
In this paper, we show that the Fisher information could explain the posterior part as the information gain. Following
this, the posterior part is similar to acts as the weight function for the measurement part to control the accumulated
information about how xt is close to the condition c. In this case, we further show an interesting finding: we could
increase the information gain of the posterior part to increase the generation quality. To achieve this, we propose using
the Fisher information’s upper bound as the new posterior part, where the upper bound could further reduce the time
cost.
4 Methodology
∂ x̂
In our work, we introduce the Fisher information to cancel the posterior part ∂x0|t
t
. Specifically, we leverage the upper
bound of the Fisher information to redefine the posterior part. Additionally, we provide an additional view from the
information theory to explain how the Fisher information improves the training-free conditional generation.
3
Algorithm 1 The overall algorithm for FICD
Input: c, T , sθ , the noise schedule parameter α̂t , βt , the differentiable metric function ε and the hyperparameter
ρt .
Output: x0 ▷ The generated image based on c
1: xT ∼ N (0, 1)
2: for t in [T − 1, ..., 1] do
3: ϵ ∼ N (0, 1) √
4: xt−1 = (1 + 0.5βt )xt + βt ∇xt log p(xt ) + βt ϵ
5: x̂0|t = √1α̂ (xt + (1 − α̂t )sθ (xt , t)) ▷ The MMSE estimation
t
2
6: gt = â ∇x̂0|t ε(x̂0|t , c)
√
t
7: xt−1 = xt−1 − ρt gt
8: end for
Return: x0
|| xlogp(x)||2 || xlogp(x)||2 || xlogp(x)||2 || xlogp(x)||2
1 R U P 9 D O X H
1 R U P 9 D O X H
1 R U P 9 D O X H
1 R U P 9 D O X H
7 L P H 6 W H S V 7 L P H 6 W H S V 7 L P H 6 W H S V 7 L P H 6 W H S V
Cramér-Rao bound estimation. Interestingly, we could leverage the upper bound of I(xt ), the Cramér-Rao bound
estimation to cancel the posterior part:
Theorem 1. Given the sequence {xT , xT −1 , ..., xt , ..., x1 }, where t ∈ [T, 0) and xT is the initial state of the reverse
process, the I(xt ) is bounded to the Cramér-Rao bound:
1
I(xt ) < . (8)
1 − α̂t
We replace the I(xt ) by Cramér-Rao bound directly. In the end, the Eq. 6 could be estimated as:
∂ x̂0|t 2
≈√ . (9)
∂xt ât
Eq. 9 shows the upper bound for I(xt ) cancel the posterior part based on the noise scheduler parameters.
4
Condition
blonde Young
FIGD
MPGD
FreeDom beauty man
Figure 2: Qualitative examples of using a single condition human face images. The included conditions are (a) text,
(b) face parsing maps, and (c) sketches. We compare the results with those of three baselines. It can be found that
MPGD is invalid since these tasks break the linear hypothesis theory, and FICD performs well.
Sampling process for FICD. In the end, we could derive a new sampling process for FICD. By Eq. 5 and Eq. 9, we
have:
2 ∂ε(x̂0|t , c)
∇xt log p(c|x̂0|t ) ≈ √
ât ∂ x̂0|t
(10)
2
= √ ∇x̂0|t log p(c|x̂0|t ).
ât
Then by Eq. 1 and Eq. 10, the sampling process of FICD is:
2ρt
xt−1 = m̂t−1 − √ ∇x̂0|t ε(x̂0|t , c)
ât (11)
p
m̂t−1 = (1 + 0.5βt )xt + βt ∇xt log p(xt ) + βt ϵ,
where ρt is the hyperparameter regarded as the learning rate to control the strength of guidance. The detailed algorithm
is shown in Algorithm. 1.
We offer an additional explanation to illustrate why our method can perform better than the existing training-free
methods. Our explanation is based on the information perspective. Definition 1 illustrates that the posterior part could
be regarded as information gain. This helps us formulate an information-based perspective view.
∂ x̂ ∂ε(x̂ ,c)
Concretely, the measure part offers the direction following the condition. The overall ∂x0|t t
0|t
∂ x̂0|t could further re-
gard the accumulated information following the direction of the condition. Its norm |∇xt log p(c|x̂0|t )|2 denotes to re-
flect the amount of the information gain in the t time steps. It could be noticed that the larger norm |∇xt log p(c|x̂0|t )|2
means more information.
First, for the overall conditional term ∇xt log p(c|x̂0|t ), some recent work [9] had shown that the early and latter
phases could be skipped, and the critical phase is the middle phase in the conditional term. To further explore it, we
also make an empirical study based on the style-generation task shown in Fig. 3 via showing the |∇xt log p(c|x̂0|t )|2 in
different phases. As shown in Fig. 3, two observations can be concluded: 1) There is less information in the early and
later phases since |∇xt log p(c|x̂0|t )|2 is smaller. Small |∇xt log p(c|x̂0|t )|2 generates a small weight for the condi-
tional term to optimize the Eq. 2. 2) The key for conditional generation is the middle phase, since |∇xt log p(c|x̂0|t )|2
5
∇ xt logp(c|x̂ 0|t ) ∇ xt logp(c|x̂ 0|t ) ∇ xt logp(c|x̂ 0|t ) ∇ xt logp(c|x̂ 0|t )
) ,C ' ) ,C ' ) ,C ' ) ,C '
* U D G L H Q W 1 R U P
* U D G L H Q W 1 R U P
* U D G L H Q W 1 R U P
* U D G L H Q W 1 R U P
7 L P H V W H S V 7 L P H V W H S V 7 L P H V W H S V 7 L P H V W H S V
keeps in a suitable level. Different timesteps T show a similar trend. In this information-based perspective, in the early
and late phases, the existing methods [12] do not generate samples well toward the condition, where a lot of time steps
are wasted.
Intuitively, more information should be accumulated in both the early and late phases, especially for the early phase,
where the distance between xt in the early phase and c is very large. But, the above empirical results show that the
existing methods do not provide enough information for the conditions. Furthermore, we show that the upper bound
of the proposed method can help us increase the information about the condition, especially in the early phase. Please
note that after reformulating the posterior part by the upper bound, since 1 − ât will tend to be small, thus √2â now
t
tend to large. In this condition, more information could be gained in the early phase, which generates a significant
weight from the posterior part for the measure part. The early phase could contribute more to optimizing the Eq. 2. In
this way, FICD could generate the conditional sample in the early phase, which reduces the time cost. We also report
the empirical studies shown in Fig. 3 to show that the upper bound could increase the weight of the measure part in
the early phase, which proves our explanation.
5 Experiment
To demonstrate the potential of FICD, we focus on tasks in which the differentiable metrics are nonlinear based
on various open-source diffusion models. The tasks in this paper mainly contain face-related tasks, style-guided
generation with stable diffusion, and ControlNet-related generations with multiple guidance.
For the face-related tasks, we introduce text, segmentation maps, and sketches as the condition to guide the generation
process following the FreeDom [9]. For the style-guided generation, we introduce a style image as the condition to
guide the generation process. For the ControlNet-related generations, we introduce complex multi-guidance. The
detailed settings, including the measurement for these tasks and the hyperparameters we have used, are listed in the
supplementary.
We use a single RTX4090 GPU to finish all the experiments. The baselines we used in our paper are three SOTA
methods: FreeDom, MPGD, and LGD [30]. We use the time-travel strategy [9]. To make a fair comparison, all the
pre-trained diffusion models we used are the same as the FreeDom.
Face-related tasks. To begin with, we first show the result of FICD in face-related tasks, including text, face parsing
maps, and sketches, as shown in Fig 2. It could be found that FICD generates high-quality images related to the
condition compared with the MPGD. Meanwhile, compared with the FreeDom, FICD generates similar images. This
proves the validity of our theory analysis, where FICD increases the information to generate high-quality images
further. We also report the qualitative results in the supplementary.
Style-guided generation with Stable Diffusion. To further show the potential of the FICD, we change the tasks
to the style transfer based on the diffusion model. This task is more complicated than face-related tasks since there
are two guides: 1) the text prompts and 2) the style image. We report the qualitative examples shown in Fig. 4 and
the qualitative results in Table 1. It can be seen that FICD the SOTA performance. Under the T = 100, FICD
achieves 225.83 Style metric while maintaining the 29.59 CLIP, which shows a SOTA trade-off compared to the other
6
Prompt: “a cat wearing glasses”
FreeDom FreeDom MPGD MPGD FIGD FIGD
Style (𝑇 = 100) (𝑇 = 50) (𝑇 = 100) (𝑇 = 50) (𝑇 = 50) (𝑇 = 30)
Figure 4: Qualitative examples of style-guided generation with Stable Diffusion experiment based on FICD compared
with the three baselines.
baselines. Meanwhile, with the decrease in the time step from T = 50 to T = 30, FICD achieved the SOTA results.
These results show that improving the information gain could enhance efficiency while verifying that the conflict may
lead to insufficient accumulated information.
ControlNet-related generations with multiple guidance. To further show the effectiveness of the FICD, following
the FreeDom, we implement it in the multiple guidance tasks based on the ControlNet, which includes two tasks: 1)
face guidance and style tasks. Concretely, for the face guidance tasks, we use the text prompts and pose mappings
as the input of ControlNet to generate images with similar poses and satisfy the description of text prompts. In this
condition, we add face ID images as the independent condition to guide the ControlNet in generating similar faces
and poses that satisfy the text prompt and face ID. Then, We report the qualitative examples and the qualitative results
shown in Fig 5. Since the MPGD will generate obvious mismatch generation based on the qualitative examples, we
further compare the FreeDom shown in Table 2. It can be noticed that FICD improves the pose distance metric from
54.21 to 40.01 compared to the FreeDom. Then, the time cost reduces from the 193s to 42s. This illustrates the
efficiency of the FICD, which further verifies the validity of the Fisher information.
For the style task, we changed the condition of the ControlNet to the sketch and used the style image as independent
guidance to prove the feasibility of the FICD. We report the qualitative examples shown in Fig. 6. It can be noticed
that FreeDom works well compared to the MPGD. Thus, we make a further comparison between FreeDom and FICD
shown in Table 3. It can be further found that FICD improves the CLIP metric from 69.71 to 70.27. Then, FICD
improves the Style metric from 282.25 to 244.50. The time cost has been reduced from 299s to 32s by FICD. These
results show the efficiency of the FICD. Meanwhile, it also proves that the conflict will truly cause the accumulated
information not to be enough, and improving the information gained could enhance the generation quality.
7
Table 1: Quantitative results of style-guided generation with Stable Diffusion experiment. We compared FICD with
the three baselines, FreeDom, MPGD, and LGD, in style-guided generation tasks with the Stable Diffusion; we get
the overall improvement both in the Style score [9], CLIP score, and the time cost since our method could generate
high-quality images with few sampling steps. NA means that we used the shared memory since the requirement GPU
VRAM of LGD-MC is over 24GB, and we cannot get the precise time cost.
Method Style↓ CLIP↑ Time(s)
LGD-MC (T = 100) 247.39 29.96 NA
FreeDom (T = 100) 226.26 28.07 63
MPGD (T = 100) 351.16 25.08 27
FICD (T = 100) 225.83 29.59 27
MPGD (T = 50) 408.28 23.23 17
FreeDom (T = 50) 377.96 30.01 33
FICD (T = 50) 245.51 30.32 17
FICD (T = 30) 281.67 29.07 11
Table 2: Qualitative results of face-related tasks based on ControlNet experiments, where Pose distance is the norm
between pose maps and generated images.
Method Pose Distance↓ Time(s)
FreeDom 54.21 193
FICD 40.01 42
Ablation study for ρt . We study the effect of ρt from small to large shown (ρt = 1 as the beginning) in the supple-
mentary. We can see that FICD is scalable, and the user can set different values according to the requirements.
6 Conclusion
We proposed the Fisher information conditional diffusion to achieve training-free condition generation called FICD.
FICD first finds that the conditional term under the training-free conditional generation could be split into the measure-
ment and posterior parts. Then, the posterior part could be modeled using Fisher’s information. In this novel view, the
function of the conditional term could be explained as accumulating enough information following the measurement
part. The posterior part plays a role in controlling how much information can be accumulated. In this case, the novel
insight is to use the upper bound of the Fisher information to approximate the posterior part. This leads to an increase
in the overall information gain in the generation process. An interesting finding emerges here: increasing the infor-
mation gain could enable FICD to achieve conditional generation with fewer steps, thus first reducing the time cost.
This also improves the generation quality since more information can be accumulated during the overall generation
process. To prove our theory, we offer an information theory-based explanation, giving an illustration based on the
information theory to show why FICD could work well. In the end, the experimental results show that FICD could
successfully increase the generation quality while reducing the time cost.
Limitation. However, there are some limitations. We have shown that dropping the gradient will cause some in-
formation to be lost. No theory identifies whether such information always has a negative influence. Thus, this will
inevitably cause a potential threat. For example, based on further exploration, we find that dropping the gradient
will inevitably aggregate conflict among various conditions. This is obvious in the ControlNet-related tasks. This is
Table 3: Qualitative results of style image guidance with ControlNet experiments, where CLIP is the cosine similarity
of the clip embedding of generated images and sketch, Style is the distance norm between style images and generated
images.
Method CLIP↑ Style↓ Time(s)
FreeDom 69.71 282.25 299
FICD 70.27 244.50 32
8
Condition
“Young
Man
Realistic
Photo”
another reason why our upper bound could work better than only increasing the factor of the condition (Detailed in
supplementary). In future work, we will further study to improve FICD.
References
[1] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. CoRR, abs/2006.11239,
2020.
[2] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-
based generative modeling through stochastic differential equations. In International Conference on Learning
Representations, 2021.
[3] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models, 2022.
[4] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models,
2023.
[5] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image
synthesis with latent diffusion models, 2021.
9
Condition
“Young
Man
Realistic
Photo”
[6] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever,
and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models,
2022.
[7] Bram Wallace, Akash Gokul, Stefano Ermon, and Nikhil Naik. End-to-end diffusion latent optimization improves
classifier guidance, 2023.
[8] Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven
image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition (CVPR), pages 1921–1930, June 2023.
[9] Jiwen Yu, Yinhuai Wang, Chen Zhao, Bernard Ghanem, and Jian Zhang. Freedom: Training-free energy-
guided conditional diffusion model. Proceedings of the IEEE/CVF International Conference on Computer Vision
(ICCV), 2023.
[10] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit:
Guided image synthesis and editing with stochastic differential equations, 2022.
[11] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth:
Fine tuning text-to-image diffusion models for subject-driven generation. In IEEE/CVF Conference on Computer
Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 22500–22510,
2023.
[12] Hyungjin Chung, Jeongsol Kim, Michael Thompson Mccann, Marc Louis Klasky, and Jong Chul Ye. Diffusion
posterior sampling for general noisy inverse problems. In The Eleventh International Conference on Learning
Representations, 2023.
[13] Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis. CoRR, abs/2105.05233,
2021.
[14] Morteza Mardani, Jiaming Song, Jan Kautz, and Arash Vahdat. A variational perspective on solving inverse
problems with diffusion models, 2023.
[15] Yutong He, Naoki Murata, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Dongjun Kim, Wei-Hsiang Liao,
Yuki Mitsufuji, J. Zico Kolter, Ruslan Salakhutdinov, and Stefano Ermon. Manifold preserving guided diffusion,
2023.
10
[16] Gabriel Cardoso, Yazid Janati El Idrissi, Sylvain Le Corff, and Eric Moulines. Monte carlo guided diffusion for
bayesian linear inverse problems, 2023.
[17] Xianghao Kong, Rob Brekelmans, and Greg Ver Steeg. Information-theoretic diffusion. In International Con-
ference on Learning Representations, 2023.
[18] Ming Li, Taojiannan Yang, Huafeng Kuang, Jie Wu, Zhaoning Wang, Xuefeng Xiao, and Chen Chen. Control-
net++: Improving conditional controls with efficient consistency feedback, 2024.
[19] Jaeseok Jeong, Mingi Kwon, and Youngjung Uh. Training-free content injection using h-space in diffusion
models, 2024.
[20] Chen Henry Wu and Fernando De la Torre. A latent space of stochastic diffusion models for zero-shot image
editing and guidance. In ICCV, 2023.
[21] Jiaming Song, Arash Vahdat, Morteza Mardani, and Jan Kautz. Pseudoinverse-guided diffusion models for
inverse problems. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali,
Rwanda, May 1-5, 2023, 2023.
[22] Youcan Xu, Zhen Wang, Jun Xiao, Wei Liu, and Long Chen. Freetuner: Any subject in any style with training-
free diffusion, 2024.
[23] Zalan Fabian, Berk Tinaz, and Mahdi Soltanolkotabi. Adapt and diffuse: Sample-adaptive reconstruction via
latent diffusion models, 2023.
[24] Yong-Hyun Park, Mingi Kwon, Jaewoong Choi, Junghyo Jo, and Youngjung Uh. Understanding the latent space
of diffusion models through the lens of riemannian geometry. In Proceedings of the Advances in Neural Infor-
mation Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS
2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
[25] Lingxiao Yang, Shutong Ding, Yifan Cai, Jingyi Yu, Jingya Wang, and Ye Shi. Guidance with spherical gaussian
constraint for conditional diffusion, 2024.
[26] Yingheng Wang, Yair Schiff, Aaron Gokaslan, Weishen Pan, Fei Wang, Christopher De Sa, and Volodymyr
Kuleshov. InfoDiffusion: Representation learning using information maximizing diffusion models. In Proceed-
ings of the 40th International Conference on Machine Learning, pages 36336–36354, 2023.
[27] Xianghao Kong, Ollie Liu, Han Li, Dani Yogatama, and Greg Ver Steeg. Interpretable diffusion via information
decomposition, 2023.
[28] Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv
Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. International Journal
of Computer Vision, 128(2):336–359, October 2019.
[29] Andrew R. Barron. Entropy and the central limit theorem. The Annals of Probability, 14(1):336–342, 1986.
[30] Jiaming Song, Qinsheng Zhang, Hongxu Yin, Morteza Mardani, Ming-Yu Liu, Jan Kautz, Yongxin Chen, and
Arash Vahdat. Loss-guided diffusion models for plug-and-play controllable generation. In International Confer-
ence on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, pages 32483–32498, 2023.
11