On Discrete Prompt Optimization or Di Usion Models
On Discrete Prompt Optimization or Di Usion Models
This paper introduces the first gradient-based Rombach et al., 2022; Ramesh et al., 2022; Saharia et al.,
framework for prompt optimization in text-to- 2022; Ho et al., 2022; Yu et al., 2022; Chang et al., 2023).
image diffusion models. We formulate prompt Despite being trained with huge corpora, there still exists a
engineering as a discrete optimization problem substantial gap between user intention and what the model
over the language space. Two major challenges interprets (Zhou et al., 2022; Feng et al., 2022; Rombach
arise in efficiently finding a solution to this prob- et al., 2022; Radford et al., 2021; Lian et al., 2023; Ouyang
lem: (1) Enormous Domain Space: Setting the et al., 2022; Ramesh et al., 2022). The misalignment is even
domain to the entire language space poses signif- more severe in text-to-image generative models, partially
icant difficulty to the optimization process. (2) since they often rely on much smaller and less capable text
Text Gradient: Efficiently computing the text gra- encoders (Radford et al., 2021; Cherti et al., 2023; Raffel
dient is challenging, as it requires backpropagat- et al., 2020) than large language models (LLMs). As a re-
ing through the inference steps of the diffusion sult, instructing a large model to produce intended content
model and a non-differentiable embedding lookup often requires laborious human efforts in crafting the prompt
table. Beyond the problem formulation, our main through trials and errors (a.k.a. Prompt Engineering) (Art,
technical contributions lie in solving the above Year; Wang et al., 2022; Witteveen & Andrews, 2022; Liu
challenges. First, we design a family of dynami- & Chilton, 2022; Zhou et al., 2022; Hao et al., 2022). To
cally generated compact subspaces comprised of automate this process for language generation, several re-
only the most relevant words to user input, sub- cent attempts have shown tremendous potential in utilizing
stantially restricting the domain space. Second, LLMs to enhance prompts (Pryzant et al., 2023; Zhou et al.,
we introduce “Shortcut Text Gradient” — an ef- 2022; Chen et al., 2023; Guo et al., 2023; Yang et al., 2023;
fective replacement for the text gradient that can Hao et al., 2022). However, efforts on text-to-image genera-
be obtained with constant memory and runtime. tive models remain scarce and preliminary, probably due to
Empirical evaluation on prompts collected from the challenges faced by these models’ relatively small text
diverse sources (DiffusionDB, ChatGPT, COCO) encoders in understanding subtle language cues.
suggests that our method can discover prompts
that substantially improve (prompt enhancement) DPO-Diff. This paper presents a systematic study of
or destroy (adversarial attack) the faithfulness of prompt optimization for text-to-image diffusion models.
images generated by the text-to-image diffusion We introduce a novel optimization framework based on
model. the following key observations. 1) Prompt engineering for
diffusion models can be formulated as a Discrete Prompt
Optimization (DPO-Diff) problem over the space of natural
1. Introduction languages. Moreover, the framework can be used to find
prompts that either improve (prompt enhancement) or de-
Large-scale text-based generative models exhibit a remark-
stroy (adversarial attack) the generation process, by simply
able ability to generate novel content conditioned on user
reversing the sign of the objective function. 2) We show
1
Google Research 2 University of California, Los that for diffusion models with classifier-free guidance (Ho
Angeles 3 Google Deepmind. Correspondence to: Bo- & Salimans, 2022), improving the image generation process
qing Gong <[email protected]>, Ruochen Wang is more effective when optimizing “negative prompts” (An-
<[email protected]>. drew, 2023; Woolf, 2022) than positive prompts. Beyond
Proceedings of the 41 st International Conference on Machine the problem formulation of DPO-Diff, where “Diff” high-
Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by lights our focus on text-to-image diffusion models, the main
the author(s). technical contributions of this paper lie in efficient methods
1
On Discrete Prompt Optimization for Diffusion Models
Com pact Search Spaces For Positive and N egative Prom pts Gumbel Sof t max ShortCut Gradient
[without unrolling]
Synonym s LLM ( Chat GPT)
Amidst -> Among, within. [0.2, 0.7, 0.1]
Bustling -> Busy, hectic, active. [0.1, 0.3, 0.5, 0.9]
City -> metropolis, municipality. [0.6, 0.3, 0.1] Gr adient Enabled
Neon -> Fluorescent, bright, vibrant. [0.3, 0.5, 0.1, 0.1]
Lights -> Lamps, illumination, lanterns. [0.0, 0.1, 0.8, 0.9]
Vibrant -> Lively, colorful, dynamic. [0.3, 0.3, 0.3, 0.1]
Streets -> Roads, avenues. [0.2, 0.5, 0.2, 0.1]
... ...
Antonym s
Amidst -> Apart from, outside. [0.1, 0.8, 0.1]
Bustling -> Calm, quiet, tranquil. [0.2, 0.2, 0.4, 0.2]
City -> Countryside, wilderness. [0.7, 0.2, 0.1]
Neon -> Dull, dim, muted. [0.4, 0.3, 0.1, 0.2] CLIP
Lights -> Darkness, shadow, obscurity. [0.1, 0.2, 0.6, 0.1] Positive Shar ed Shar ed Shar ed
Shar ed
Illuminate -> Obscure, darken, dim. [0.1, 0.1, 0.7, 0.1] Prompt
Vibrant -> Dull, lifeless, subdued. [0.3, 0.3, 0.1, 0.3] This 0.3
... ... w P(w ) It 0.5 w P(w ) w P(w )
That 0.2 User Pr om pt
Ami ds t t he bus t l i ng c i t y ,
neon l i ght s i l l umi nat e t he
v i br ant s t r eet s .
This 0.3
Synonyms Space It 0.5
User Pr om pt w P(w ) w P(w ) w P(w )
That 0.2
Ami ds t t he bus t l i ng c i t y , Negative
neon l i ght s i l l umi nat e t he Antonyms Space Shar ed
Prompt Shar ed Shar ed
v i br ant s t r eet s . Shar ed
Figure 1: Computational procedure of Shortcut Text Gradient (Bottom) v.s. Full Gradient (Top) on text.
for solving this optimization problem, including the design sion model, we collect and filter a set of challenging prompts
of compact domain spaces and a gradient-based algorithm. from diverse sources including DiffusionDB (Wang et al.,
2022), COCO (Lin et al., 2014), and ChatGPT (Ouyang
Compact domain spaces. DPO-Diff’s domain space is a et al., 2022). Empirical results suggest that DPO-Diff can
discrete search space at the word level to represent prompts. effectively discover prompts that improve (or destroy for ad-
While this space is generic enough to cover any sentence, it versarial attack) the faithfulness of text-to-image diffusion
is excessively large due to the dominance of words irrelevant models, surpassing human-engineered prompts and prior
to the user input. To alleviate this issue, we design a family baselines by a large margin. We summarize our primary
of dynamically generated compact search spaces based on contributions as follows:
relevant word substitutions, for both positive and negative • DPO-Diff: A generic framework for prompt optimiza-
prompts. These subspaces enable efficient search for both tion as a discrete optimization problem over the space of
prompt enhancement and adversarial attack tasks. natural languages, of arbitrary metrics.
• Compact domain spaces: A family of dynamic compact
Shortcut Text Gradients for DPO-Diff. Solving DPO-
search spaces, over which a gradient-based algorithm
Diff with a gradient-based algorithm requires computing
enables efficient solution finding for the prompt optimiza-
the text gradient, i.e., backpropagating from the generated
tion problem.
image, through all inference steps of a diffusion model, and
finally to the discrete text. Two challenges arise in obtaining • Shortcut Text Gradients: The first novel computation
this gradient: 1) This process incurs compound memory- method to enable backpropagation through the diffusion
runtime complexity over the number of backward passes models’ lengthy sampling steps with constant memory-
through the denoising step, making it prohibitive to run on runtime complexity, enabling gradient-based search algo-
large-scale diffusion models (e.g., a 870M-parameter Stable rithms.
Diffusion v1 requires ∼750G memory to run backpropa-
• Negative prompt optimization: The first empirical re-
gation through 50 inference steps (Rombach et al., 2022)).
sult demonstrating the effectiveness of optimizing nega-
2) The embedding lookup tables in text encoders are non-
tive prompts for diffusion models.
differentiable. To reduce the computational cost in 1), we
provide a generic replacement for the text gradient that by-
passes the need to unroll the inference steps in a backward 2. Related Work
pass, allowing it to be computed with constant memory and
Text-to-image diffusion models. Diffusion models
runtime. To backpropagate through the discrete embedding
trained on a large corpus of image-text datasets significantly
lookup table, we continuously relax the categorical word
advanced the state of text-guided image generation (Rom-
choices to a learnable smooth distribution over the vocabu-
bach et al., 2022; Ramesh et al., 2022; Saharia et al., 2022;
lary, using the Gumbel Softmax trick (Guo et al., 2021; Jang
Chang et al., 2023; Yu et al., 2022). Despite the success,
et al., 2016; Dong & Yang, 2019). The gradient obtained
these models can sometimes generate images with poor
by this method, termed Shortcut Text Gradient, enables
quality. While some preliminary observations suggest that
us to efficiently solve DPO-Diff regardless of the number of
negative prompts can be used to improve image quality (An-
inference steps of a diffusion model.
drew, 2023; Woolf, 2022), there exists no principled way
To evaluate our prompt optimization method for the diffu- to find negative prompts. Moreover, several studies have
2
On Discrete Prompt Optimization for Diffusion Models
shown that large-scale text-to-image diffusion models face ated image, the optimization process necessitates backprop-
significant challenges in understanding language cues in agation through all inference steps. Despite using similar
user input during image generation; Particularly, diffusion terminologies, these methodologies are fundamentally dis-
models often generate images with missing objects and in- tinct and not interchangeable. Table 3 further summarizes
correctly bounded attribute-object pairs, resulting in poor the key differences in taxonomy.
“faithfulness” or “relevance” (Hao et al., 2022; Feng et al.,
2022; Lian et al., 2023; Liu et al., 2022). Existing solu- Efficient Backpropagation through diffusion sampling
tions to this problem include compositional generation (Liu steps. Text-to-image diffusion models generate images
et al., 2022), augmenting diffusion model with large lan- via a progressive denoising process, making multiple passes
guage models (Yang et al., 2023), and manipulating atten- through the same network (Ho et al., 2020). When a loss
tion masks (Feng et al., 2022). As a method orthogonal to is applied to the output image, computing the gradient w.r.t.
them, our work reveals that negative prompt optimization any model component (text, weight, sampler, etc.) requires
can also alleviate this issue. backpropagating through all the sampling steps. This pro-
cess incurs compound complexity over the number of back-
Prompt optimization for text-based generative models. ward passes in both memory and runtime, making it infeasi-
Aligning a pretrained large language model (LLM) with ble to run on regular commercial devices. Existing efforts
human intentions is a crucial step toward unlocking the po- achieve constant memory via gradient checkpointing (Wat-
tential of large-scale text-based generative models (Ouyang son et al., 2021) or solving an augmented SDE problem (Nie
et al., 2022; Rombach et al., 2022). An effective line of et al., 2022), at the expense of even higher runtime.
training-free alignment methods is prompt optimization
(PO) (Zhou et al., 2022). PO originated from in-context 3. Preliminaries on diffusion model
learning (Dale, 2021), which is mainly concerned with var-
ious arrangements of task demonstrations. It later evolves Denoising diffusion probabilistic models. On a high
into automatic prompt engineering, where powerful lan- level, diffusion models (Ho et al., 2020) is a type of hierar-
guage models are utilized to refine prompts for certain chical Variational Autoencoder (Sønderby et al., 2016) that
tasks (Zhou et al., 2022; Pryzant et al., 2023; Yang et al., generates samples by reversing (backward) a progressive
2023; Pryzant et al., 2023; Hao et al., 2022). While PO noisification process (forward). Let x0 · · · xT be a series of
has been widely explored for LLMs, efforts on diffusion intermediate samples of increasing noise levels, the forward
models remain scarce. The most relevant prior work to ours process progressively adds Gaussian noise to the original
is Promptist (Hao et al., 2022), which finetunes an LLM via image x0 :
reinforcement learning from human feedback (Ouyang et al., p
2022) to augment user prompts with artistic modifiers (e.g., q(xt |xt−1 ) = N (xt ; 1 − βt xt−1 , βt I), (1)
high-resolution, 4K) (Art, Year), resulting in aesthetically
pleasing images. However, the lack of paired contextual- where β is a scheduling variable. Using reparameterization
aware data significantly limits its ability to follow the user trick, xt |Tt=1 can be computed from x0 in one step:
intention (Figure 3). √ √
xt = ᾱt x0 + 1 − ᾱt ϵ, (2)
Yt
Textual Inversion Optimizing texts in pretrained diffu- where αt = 1 − βt and ᾱt = αi , (3)
i=1
sion models has also been explored under “Textual Inver-
sion” task (Gal et al., 2022; Wen et al., 2023; Mokady et al., where ϵ is a standard Gaussian error. The reverse process
2023). Textual Inversion involves adapting a frozen model starts with a standard Gaussian noise, xT ∼ N (0, I), and
to generate novel visual concepts based on a set of user- progressively denoises it using the following joint distribu-
provided images. It achieves this by distilling these images tion:
into soft or hard text prompts, enabling the model to repli- YT
cate the visual features of the user images. Since the source pθ (x0:T ) = p(xT ) pθ (xt−1 |xt )
t=1
images are provided, the training process mirrors that of
typical diffusion model training. While some Textual In- where pθ (xt−1 |xt ) = N (xt−1 ; µθ (xt , t), Σ).
version papers also use the term “prompt optimization”,
it is distinct from the Prompt Optimization considered by While the mean function µθ (xt , t) can be parameterized
Promptist (Hao et al., 2022) and our work. Our objective is by a neural network (e.g., UNet (Rombach et al., 2022;
to enhance a model’s ability to follow text prompts. Here, Ronneberger et al., 2015)) directly, prior studies found that
the primary input is the user prompt, and improvement is modeling the residual error ϵ(xt , t) instead works better em-
achieved by optimizing this prompt to enhance the resulting pirically (Ho et al., 2020). The two strategies are mathemat-
image. Since the score function is applied to the final gener- ically equivalent as µθ (xt , t) = √1αt (xt − √1−α t
1−ᾱ
ϵ(xt , t)).
t
3
On Discrete Prompt Optimization for Diffusion Models
Conditional generation and negative prompts. The where d(s, suser ) is a distance measure that forces the per-
above formulation can be easily extended to conditional gen- turbed prompt (s) to be semantically similar to the user input
eration via classifier-free guidance (Ho & Salimans, 2022), (suser ).
widely adopted in contemporary diffusion models. At each
sampling step, the predicted error ϵ̃ is obtained by subtract- 5. Compact search spaces for efficient prompt
ing the unconditional signal (c(“”)) from the conditional
signal (c(s)), up to a scaling factor w:
discovery
While the entire language space facilitates maximal gener-
ϵ̃θ (xt , c(s),t) = (1 + w)ϵθ (xt , c(s), t) − wϵθ (xt , c(“”), t). ality, it is also unnecessarily inefficient as it is popularized
(4) with words irrelevant to the task. We propose a family of
compact search spaces that dynamically extracts a subset of
If we replace this empty string with an actual text, then it task-relevant words to the user input.
becomes a Negative Prompt (Andrew, 2023; Woolf, 2022),
instructing the model what to exclude from the generated
image. 5.1. Application 1: Discovering adversarial prompts for
model diagnosis
4. DPO-Diff Framework Synonym Space for adversarial attack. In light of the
constraint on semantic similarity in (6), we build a search
Formulation Our main insight is that prompt engineering space for the adversarial prompts by substituting each word
can be formulated as a discrete optimization problem in the in the user input suser with its synonyms (Alzantot et al.,
language space. Concretely, we represent the problem do- 2018), preserving the meaning of the original sentence. The
main S as a sequence of M words wi from a predefined synonyms can be found by either dictionary lookup or query-
vocabulary V: S = {w1 , w2 , . . . wM |∀i, wi ∈ V}. This ing ChatGPT (Appendix F.2).
space is generic enough to cover all possible sentences of
lengths less than M (when the empty string is present). Let 5.2. Application 2: Discovering enhanced prompts for
G(s) denote a text-to-image generative model, and suser , image generation
s denote the user input and optimized prompt, respectively.
The optimization problem can be written as While the Synonym Space is suitable for attacking diffu-
sion models, we found that it performs poorly on find-
min L(G(s), suser ) (5) ing improved prompts. This is in contradiction to LLMs
s∈S
where rephrasing user prompts can often lead to substan-
where L can be any objective function that measures the tial gains (Zhou et al., 2022). One plausible reason is that
effectiveness of the learned prompt when used to generate contemporary diffusion models often rely on small-scale
images. Following previous works (Hao et al., 2022), we text encoders (Radford et al., 2021; Cherti et al., 2023;
use clip loss CLIP(I, suser ) (Crumb, 2022) to measure the Raffel et al., 2020) that are much weaker than LLMs with
instruction-following ability of the diffusion model. many known limitations in understanding subtle language
cues (Feng et al., 2022; Liu et al., 2022; Yang et al., 2023).
Antonym Space for negative prompt optimization. In-
Application DPO-Diff framework is versatile for
spired by these observations, we propose a novel solution
handling not only prompt enhancement but also adversarial
to optimize for negative prompts instead — a unique con-
attack tasks. Figure 1 illustrates the taxonomy of those
cept that rises from classifier-free guidance (Ho & Sali-
two applications. Adversarial attacks for text-to-image
mans, 2022) used in diffusion models (Section 3). Recall
generative models can be defined as follows:
that negative prompts instruct the diffusion model to re-
Definition 4.1. Given a user input suser , the attacker aims move contents in generated images, opposite to the pos-
at slightly perturbing suser to disrupt the prompt-following itive prompt; Intuitively, the model’s output image can
ability of image generation, i.e., the resulting generated safely exclude the content with the opposite meaning to
image is no longer describable by suser . the words in the user input, thereby amplifying the con-
To modify (5) into the adversarial attack, we can simply add cepts presented in the positive prompt. We thereby build
a negative sign to the objective function (L), and restrict the space of negative prompts from the antonyms of each
the distance between an adversarial prompt (s) and user word in the user prompt. The antonyms of words can
input (suser ). Mathematically, this can be written as the also be obtained either via dictionary lookup or querying
following: ChatGPT. However unlike synonyms space, we concate-
nate the antonyms directly in comma separated format,
min −L(G(s), suser ) s.t. d(s, suser ) ≤ λ, (6) mirroring the practical usage of negative prompts. To the
s∈S best of our knowledge, this is the first exploratory work on
4
On Discrete Prompt Optimization for Diffusion Models
automated negative prompt optimization. analytically, with complexity independent of t. Note that
the above estimation of x0 is not a trick — it directly comes
6. A Gradient-based solver for DPO-Diff from a mathematically equivalent interpretation of the dif-
fusion model, where each inference step can be viewed as
Due to the query efficiency of white-box algorithms leverag- computing x̂0 and plugging it into q(xt−K |xt , x̂0 ) to ob-
ing gradient information, we also explore a gradient-based tain the transitional probability (See Appendix C for the
method to solve (5) and (6). However, obtaining the text derivation).
gradient is non-trivial due to two major challenges. 1) Back-
Remark 1: Complexity Analysis With Shortcut Text Gra-
propagating through the sampling steps of the diffusion
dient, the computational cost of backpropagating through
inference process incurs high complexity w.r.t. memory and
the inference process can be reduced to K-times backward
runtime, making it prohibitively expensive to obtain gradi-
passes of UNet. When we set t = T and K = T , it becomes
ents (Watson et al., 2021; Nie et al., 2022). For samplers
the full-text gradient; When K = 1, the computation costs
with 50 inference steps (e.g., DDIM (Song et al., 2020)), it
reduce to a single backward pass. Remark 2: Connection
raises the runtime and memory cost by 50 times compared
to ReFL (Xu et al., 2024). ReFL is a post-hoc alignment
to a single diffusion training step. 2) To further compute the
method for finetuning diffusion models. It also adopts the
gradient on text, the backpropagation needs to pass through
estimation of x0 when optimizing diffusion model against a
a non-differentiable embedding lookup table. To alleviate
scorer, which is mathematically equivalent to the case when
these issues, we propose Shortcut Text Gradient, an effi-
K = 1.
cient replacement for text gradient that can be obtained with
constant memory and runtime. Our solution to (1) and
6.1.2. BACKPROPAGATING THROUGH EMBEDDINGS
(2) are discussed in Section 6.1.1 and Section 6.1.2 respec-
LOOKUP TABLE
tively. Moreover, Section 6.2 discusses how to sample from
the learned text distribution via evolutionary search. In diffusion models, a tokenizer transforms text input into
indices, which will be used to query a lookup table for cor-
6.1. Shortcut Text Gradient responding word embeddings. To allow further propagating
gradients through this non-differentiable indexing operation,
6.1.1. BACKPROPAGATING THROUGH DIFFUSION we relax the categorical choice of words into a continuous
SAMPLING STEPS
probability of words and learn a distribution over them. We
To efficiently backpropagate the loss from the final image parameterize the distribution using Gumbel Softmax (Jang
to intermediate feature at an arbitrary step, our key idea is et al., 2016) with uniform temperature (η = 1):
to trim the computation graph down to only a few steps |V|
from both ends, resulting in a constant number of back-
X exp ((log αi + gi )/η)
ẽ = ei ∗ P|V| (7)
ward passes (Figure 1. To achieve this, three operations are i=1 i=1 exp ((log αi + gi )/η)
required through the image generation process:
where α (a |V|-dimensional vector) denotes the learnable
(1) Sampling without gradient from step T (noise) to t. We parameter, g denotes the Gumbel random variable, ei is the
disable gradients up to step t, thereby eliminating the need embedding of word i, and ẽ is the output mixed embedding.
for backpropagation from T to t.
6.2. Efficient sampling with Evolutionary Search
(2) Enable gradient from t to t − K. The backward compu-
tation graph is enabled for the K step starting at t. To efficiently sample candidate prompts from the learned
(3) Estimating x0 directly from xt−K . To bypass the fi- Gumbel “distribution”, we adopt evolutionary search,
nal t − K steps of UNet, a naive solution is to directly known for its sample efficiency (Goldberg, 1989; Wu et al.,
decode and feed the noisy image xt−K to the loss function. 2019). Our adaptation of the evolutionary algorithm to
However, due to distribution shifts, these intermediate im- the prompt optimization task involves three key steps: (1)
ages often cannot be properly interpreted by downstream Genotype Definition: We define the genotype of each can-
modules such as VAE decoder (Rombach et al., 2022) and didate prompt as the list of searched words from the compact
CLIP (Dhariwal & Nichol, 2021). Instead, we propose to search space, where modifications to the genotype corre-
use the following closed-form estimation of the final image spond to edits the word choices in the prompt. (2) Popula-
x̂0 (Song et al., 2020) to bridge the gap: tion Initialization: We initialize the algorithm’s population
with samples drawn from the learned Gumbel distribution
1 p to bias the starting candidates towards regions of high poten-
x̂0 = √ (xt−K − 1 − ᾱt−K ϵ̂θ (xt−K , t − K)) tial. (3) Evolutionary Operations: We execute a standard
ᾱt−K
evolutionary search, including several rounds of crossover
This way, the Jacobian of x̂0 w.r.t. xt−K can be computed and mutation (Goldberg, 1989), culminating in the selection
5
On Discrete Prompt Optimization for Diffusion Models
6
On Discrete Prompt Optimization for Diffusion Models
A dedicated gardener tending to a ... bonsai tree. intricate, elegant, highly detailed, ..., sharp focus, illustration irresponsible, overlooking, huge, herb, ...
magical ... bear with glowing magical marks ... D&D, fantasy, cinematic lighting, ..., art by artgerm and greg ... normal, elephant, ..., heaps, tundra, advance, Boring, black, ...
Figure 3: Example images generated by improved negative prompts from DPO-Diff v.s. Promptist (More in Figure 7).
Compared with Promptist, DPO-Diff was able to generate images that better capture the content in the original prompt.
A group of friends gather around a table for a meal. A party of friends cluster around a surface for a food
oil painting of a mountain landscape grease picture illustrating one mountain view
Figure 4: Example images generated by adversarial prompts from DPO-Diff. While keeping the overall meaning similar to
the user input, adversarial prompts completely destroy the prompt-following ability of the Stable Diffusion model. (More in
Figure 8)
7
On Discrete Prompt Optimization for Diffusion Models
fectively explore the text regions where Stable Diffusion User Prompt: A bunch of luggage that is in front of a truck.
fails to interpret.
8
On Discrete Prompt Optimization for Diffusion Models
Clip loss
0.86
0.83
0.84
0.82
0.82 Random Search 0.81
Evolutionary Prompt Optimization
Gradient Prompt Optimization
0.80 Hybrid (GPO + EPO) 0.80
0 20 40 60 80 0 20 40 60 80
Number of Evaluations (1 eval = 50 DDIM steps) Number of Evaluations (1 eval = 50 DDIM steps)
Table 2: Quantitative evaluation of optimizing negative Two complementary computational paradigms Promp-
prompts (w/ Antonyms Space) and positive prompts (w/ tist and DPO-Diff represent two major paradigms for ef-
Synonym Space) for Stable Diffusion. fectively utilizing computation: learning and searching,
respectively (Sutton, 2019). Learning-based approach of
Prompt DiffusionDB ChatGPT COCO
User Input 0.8741 ± 0.0203 0.8159 ± 0.0100 0.8606 ± 0.0096 Promptist enhances performance through more parameters
Positive Prompt 0.8747 ± 0.0189 0.8304 ± 0.0284 0.8624 ± 0.0141 and larger datasets, whereas the search-based approach of
Negative Prompt 0.8579 ± 0.0242 0.8133 ± 0.0197 0.8403 ± 0.0210
DPO-Diff focuses on maximizing the potential of pretrained
models via post-hoc optimization. Although learning-based
methods require high quality paired data, they can be effi-
candidates with consistently lower clip loss than Synonyms ciently deployed once trained; On the other hand, search-
Space. based methods generate high quality prompts, but are much
slower to execute. Therefore, as Sutton (2019) highlights,
9. Discussion on the Search v.s. Learning these paradigms are complementary rather than competitive.
DPO-Diff can be leveraged to generate high quality dataset
paradigms for utilizing computatons offline, which can subsequently train Promptist to reduce in-
This section elucidates the relationship between two distinct ference latency effectively. Together, they pave the way for
prompt optimization approaches for diffusion models: DPO- a comprehensive solution to prompt optimization for diffu-
Diff (ours) and Promptist. While Promptist represents a sion models, positioning DPO-Diff as the first search-based
pioneering effort, it is important to discuss why DPO-Diff solution to address this problem.
remains essential.
10. Conclusions
Limitations of Promptist Promptist utilizes the Rein- This work presents DPO-Diff, the first gradient-based frame-
forcement Learning from Human Feedback (RLHF) (Bain work for optimizing discrete prompts. We formulate prompt
& Sammut, 1995; Christiano et al., 2017; Ouyang et al., optimization as a discrete optimization problem over the
2022) approach to fine-tune a language model to gen- text space. To improve the search efficiency, we introduce
erate improved prompts. RLHF relies on paired data a family of compact search spaces based on relevant word
⟨user prompt, improved prompt⟩, which is scarce for dif- substitutions, as well as design a generic computational
fusion models and challenging to curate. This is primarily method for computing the discrete text gradient for diffu-
because generating the improved prompts requires extensive sion model’s inference process. DPO-Diff is generic - We
trial-and-error by human experts, essentially performing demonstrate that it can be directly applied to effectively
what DPO-Diff automates. In fact, the performance limit discover both refined prompts to aid image generation and
exhibited by Promptist is exactly caused by this lack of adversarial prompts for model diagnosis. We hope that
data: The data used by Promptist from DiffusionDB pre- the proposed framework helps open up new possibilities
dominantly features aesthetic modifiers that do not alter the in developing advanced prompt optimization methods for
semantics of the prompts This limits its effectiveness to text-based image generation tasks.
aesthetic enhancements and not addressing the core need
for semantic accuracy in prompts. Consequently, it strug- Limitations To motivate future work, we discuss the
gles with semantic prompt adherence and lacks flexibility known limitations of DPO-Diff in Appendix A.
in modifying prompts for tasks such as adversarial attacks.
9
On Discrete Prompt Optimization for Diffusion Models
10
On Discrete Prompt Optimization for Diffusion Models
Hao, Y., Chi, Z., Dong, L., and Wei, F. Optimizing Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.,
prompts for text-to-image generation. arXiv preprint Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A.,
arXiv:2212.09611, 2022. et al. Training language models to follow instructions
with human feedback. Advances in Neural Information
Ho, J. and Salimans, T. Classifier-free diffusion guidance. Processing Systems, 35:27730–27744, 2022.
arXiv preprint arXiv:2207.12598, 2022.
Pryzant, R., Iter, D., Li, J., Lee, Y. T., Zhu, C., and Zeng, M.
Ho, J., Jain, A., and Abbeel, P. Denoising diffusion Automatic prompt optimization with” gradient descent”
probabilistic models. Advances in neural information and beam search. arXiv preprint arXiv:2305.03495,
processing systems, 33:6840–6851, 2020. 2023.
Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,
A., Kingma, D. P., Poole, B., Norouzi, M., Fleet, D. J., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark,
et al. Imagen video: High definition video generation J., et al. Learning transferable visual models from natu-
with diffusion models. arXiv preprint arXiv:2210.02303, ral language supervision. In International conference on
2022. machine learning, pp. 8748–8763. PMLR, 2021.
Ilyas, A., Engstrom, L., Athalye, A., and Lin, J. Black- Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,
box adversarial attacks with limited queries and infor- Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring
mation. In International conference on machine learning, the limits of transfer learning with a unified text-to-text
pp. 2137–2146. PMLR, 2018. transformer. The Journal of Machine Learning Research,
21(1):5485–5551, 2020.
Jang, E., Gu, S., and Poole, B. Categorical repa-
rameterization with gumbel-softmax. arXiv preprint Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M.
arXiv:1611.01144, 2016. Hierarchical text-conditional image generation with clip
latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
Lian, L., Li, B., Yala, A., and Darrell, T. Llm-grounded dif-
fusion: Enhancing prompt understanding of text-to-image Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and
diffusion models with large language models. arXiv Ommer, B. High-resolution image synthesis with la-
preprint arXiv:2305.13655, 2023. tent diffusion models. In Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition,
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ra- pp. 10684–10695, 2022.
manan, D., Dollár, P., and Zitnick, C. L. Microsoft coco:
Common objects in context. In Computer Vision–ECCV Ronneberger, O., Fischer, P., and Brox, T. U-
2014: 13th European Conference, Zurich, Switzerland, net: Convolutional networks for biomedical image
September 6-12, 2014, Proceedings, Part V 13, pp. 740– segmentation. In Medical Image Computing and
755. Springer, 2014. Computer-Assisted Intervention–MICCAI 2015: 18th
International Conference, Munich, Germany, October
Liu, N., Li, S., Du, Y., Torralba, A., and Tenenbaum, J. B. 5-9, 2015, Proceedings, Part III 18, pp. 234–241.
Compositional visual generation with composable dif- Springer, 2015.
fusion models. In European Conference on Computer
Vision, pp. 423–439. Springer, 2022. Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J.,
Denton, E. L., Ghasemipour, K., Gontijo Lopes, R.,
Liu, V. and Chilton, L. B. Design guidelines for prompt engi- Karagol Ayan, B., Salimans, T., et al. Photorealistic
neering text-to-image generative models. In Proceedings text-to-image diffusion models with deep language un-
of the 2022 CHI Conference on Human Factors in derstanding. Advances in Neural Information Processing
Computing Systems, pp. 1–23, 2022. Systems, 35:36479–36494, 2022.
Mokady, R., Hertz, A., Aberman, K., Pritch, Y., and Cohen- Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K., and
Or, D. Null-text inversion for editing real images us- Winther, O. Ladder variational autoencoders. Advances
ing guided diffusion models. In Proceedings of the in neural information processing systems, 29, 2016.
IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp. 6038–6047, 2023. Song, J., Meng, C., and Ermon, S. Denoising diffusion
implicit models. arXiv preprint arXiv:2010.02502, 2020.
Nie, W., Guo, B., Huang, Y., Xiao, C., Vahdat, A., and
Anandkumar, A. Diffusion models for adversarial purifi- Sutton, R. The bitter lesson. Incomplete Ideas (blog), 13
cation. arXiv preprint arXiv:2205.07460, 2022. (1):38, 2019.
11
On Discrete Prompt Optimization for Diffusion Models
Wu, B., Dai, X., Zhang, P., Wang, Y., Sun, F., Wu, Y., Tian,
Y., Vajda, P., Jia, Y., and Keutzer, K. Fbnet: Hardware-
aware efficient convnet design via differentiable neural
architecture search. In Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition,
pp. 10734–10742, 2019.
Xu, J., Liu, X., Wu, Y., Tong, Y., Li, Q., Ding, M., Tang, J.,
and Dong, Y. Imagereward: Learning and evaluating hu-
man preferences for text-to-image generation. Advances
in Neural Information Processing Systems, 36, 2024.
Yang, C., Wang, X., Lu, Y., Liu, H., Le, Q. V., Zhou, D., and
Chen, X. Large language models as optimizers. arXiv
preprint arXiv:2309.03409, 2023.
Yu, J., Xu, Y., Koh, J. Y., Luong, T., Baid, G., Wang, Z., Va-
sudevan, V., Ku, A., Yang, Y., Ayan, B. K., et al. Scaling
autoregressive models for content-rich text-to-image gen-
eration. arXiv preprint arXiv:2206.10789, 2(3):5, 2022.
Zhou, Y., Muresanu, A. I., Han, Z., Paster, K., Pitis,
S., Chan, H., and Ba, J. Large language models
are human-level prompt engineers. arXiv preprint
arXiv:2211.01910, 2022.
12
On Discrete Prompt Optimization for Diffusion Models
A. Limitations
We identify the following known limitations of the proposed method: Search cost Our method requires multiple passes
through the diffusion model to optimize a given prompt, which incurs a modest amount of search costs. One promising
solution is to use DPO-Diff to generate free paired data for RLHF (e.g. Promptist), which we leave for future work to
explore. Text encoder moreover, while DPO-Diff improves the faithfulness of the generated image, the performance is
upper-bounded by the limitations of the underlying text encoder. For example, the clip text encoder used in stable diffusion
tends to discard spatial relationships in text, which in principle must be resolved by improving the model itself, such as
augmenting the diffusion model with a powerful LLM (Lian et al., 2023; Liu et al., 2022; Feng et al., 2022). Clip loss The
clip loss used in DPO-Diff might not always align with human evaluation. Automatic scoring metrics that better reflect
human judgment, similar to the reward models used in instruction fine-tuning, can further aid the discovery of improved
prompts. Synonyms generated by ChatGPT For adversarial attack task, ChatGPT sometimes generate incorrect synonyms.
Although we use reject-sampling based on sentence embedding similarity as a posthoc fix, it is not completely accurate. This
may impact the validity of adversarial prompts, as by definition they must preserve the user’s original intent. We address
this in human evaluation by asking the raters to consider this factor when determining the success of an attack.
1 p
x̂0 = √ (xt−K − 1 − ᾱt−K ϵ̂θ (xt−K , t − K))
ᾱt−K
And use the estimate to compute the transition probability q(xt−K |xt−K , x0 ).
Proof. To avoid clustered notations, we use t instead of t − K for the proof below. Starting from reorganizing (3) to the one
step estimation:
1 √
x̂0 = √ (xt − 1 − ᾱt ϵ̂θ (xt , t)) (8)
ᾱt
where ϵ̂θ is the predicted error at step t by the network. Intuitively this equation means to use the current predicted error to
one-step estimate x0 . Using the Bayesian Theorem, one can show that
13
On Discrete Prompt Optimization for Diffusion Models
F. Implementation details
F.1. Hyperparameters
This section details the hyperparameter choices for our experiments. We use the same set of hyperparameters for all datasets
and tasks (prompt improvement and adversarial attack), unless otherwise specified.
Model We use Stable Diffusion v1-4 with a DDIM sampler for all experiments in the main paper. The guidance scale
and inference steps are set to 7.5 and 50 respectively (default). We also experimented with other versions, such as Stable
Diffusion v2-1 (512 x 512 resolution) and v2 (786x786 resolution), and found that the results are similar across different
versions. Although, we note that the high-resolution version of v2 tends to produce moderately better original images than
v1-4 and v2-1 in terms of clip loss, possibly due to sharper images.
Shortcut Text Gradient We set K = 1, corresponding to a 1-step Shortcut Text Gradient. This minimizes the memory
and runtime cost while empirically producing enough signal to guide the prompt optimization. Throughout the entire
optimization episode, we progressively increase t from 15 to 25 via a fixed stepwise function. This corresponds to a
coarse-to-fine learning curriculum. We note that the performance is only marginally affected by the choice of the upper and
lower bound for t (e.g. 20-30, 10-40 all produce similar results), as long as it avoids values near 0 (diminishing gradient)
and T (excessively noisy).
14
On Discrete Prompt Optimization for Diffusion Models
Gumbel softmax We use Gumbel Softmax with temperature 1. The learnable parameters are initialized to 1 for the
original word (for positive prompts) and empty string (for negative prompts), and 0 otherwise. To encourage exploration.
We bound the learnable parameters within 0 and 3 via hard clipping. The performance remains largely incentive to the
choice of bound, as long as they are in a reasonable range (i.e. not excessively small or large).
Optimization We optimize DPO-Diff using RMSprop with a learning rate of 0.1 and momentum of 0.5 for 20 iterations.
Each iteration will produce a single Gumbel Sample (batch size = 1) to compute the gradient, which will be clipped to 1/40.
clip loss The specific clip loss used in our experiment is spherical clip loss, following an early online implementation of
clip-guided diffusion (Crumb, 2022):
2
∥x − y∥2
spherical clip(x, y) = 2 · arcsin
2
Note that our method does not rely on this specific choice to function; We also experimented with other distance measures
such as cos similarity on the clip embedding space, and found that they produced nearly identical prompts (and thus images).
Evolution Search We follow a traditional evolution search composed of four steps: initialize population, tournament,
mutation, and crossover. The specific choice of hyperparameters is population size = 20, tournament = top 10, mutation with
prob = 0.1 and size = 10, and crossover with size = 10. We run the evolutionary search for two iterations for both tasks,
while we note that the prompt improvement task often covers much faster (within a single iteration).
Read the next paragraph. For each word, give 5 substitution words that do not change the meaning.
Use the format of ”A → B”.
For Antonyms:
Read the next paragraph. For each word, give 5 opposite words if it has any. Use the format of ”A →
B”.
DiffusionDB DiffusionDB is a giant prompt database comprised of 2m highly diverse prompts for text-to-image generation.
Since these prompts are web-crawled, they are highly noisy, often containing incomplete phrases, emojis, random characters,
non-imagery prompts, etc (We refer the reader to its HuggingFace repo for an overview of the entire database.). Therefore,
we filter prompts from DiffusionDB by (1). asking ChatGPT to determine whether the prompt is complete and describes an
image, and (2) remove emoji-only prompts. We filter a total of 4,000 prompts from DiffusionDB and use those prompts
to generate images via Stable Diffusion. We sample 100 prompts with clip loss above 0.85 for prompt improvement, and
0.8 for adversarial attacks respectively. For ChatGPT, we found that it tends to produce prompts with much lower clip
score compared with COCO and DiffusionDB. To ensure a sufficient amount of prompts from this source is included in the
dataset, we lower the cutoff threshold to 0.82 when filtering its hard prompts for the prompt improvement task.
COCO We use the captions from the 2014 validation split of MS-COCO dataset as prompts. Similar to DiffusionDB,
we filter 4000 prompts, and further sample 100 prompts with clip loss above 0.85 for prompt improvement, and 0.8 for
adversarial attack respectively.
15
On Discrete Prompt Optimization for Diffusion Models
ChatGPT We also query ChatGPT for descriptions, as we found that it tends to produce more vivid and poetic descriptions
compared with the former sources. We use a diverse set of instructions for this task. Below are a few example prompts we
used to query ChatGPT for image descriptions.
A majestic waterfall cascades down a rocky cliff into a clear pool below, surrounded by lush greenery.
The sun setting behind the mountains casting a warm orange glow over the tranquil lake.
A pair of bright red, shiny high heels sit on a glossy wooden floor, with a glittering disco ball above.
A farmer plowing a field with a tractor.
The vivid orange and dark monarch butterfly was flapping through the atmosphere, alighting on a
flower to sip nectar.
We empirically observe that ChatGPT produces prompts with low clip loss when used to generate images through Stable
Diffusion on average, compared with DiffusionDB and COCO. Therefore, for filtering challenging prompts, we reduce the
threshold from 0.85 to 0.82 to allow more prompts to be selected.
Special treatment for Adversarial Attack task. When conducting human evaluation on adversarial attack tasks, we
make the following adjustments to the protocol: (1). The wins and losses are reversed (2) There will be no ”draw”, as this
counts as a failed attempt. (3). Removing meaning-altering successes: we asked the human evaluators to identify cases
where success is achieved only because the adversarial prompt changed the meaning of the user prompt. Such instances
are categorized as failures. The results of our evaluation showcase that DPO-Diff achieved a success rate of 44%, thereby
establishing itself as the only baseline for this particular task on diffusion models.
16
On Discrete Prompt Optimization for Diffusion Models
Figure 7: More images generated by user input versus improved negative prompts using Stable Diffusion v1-4.
alien caught smoking cigarettes in rented house intricate, elegant, highly detailed, ..., art by artgerm and greg native, liberated, clear, dull, out, bought, road, Macro, Script,
rutkowski and, 8 k monochrome, rendered
a spooky ghost in a graveyard by justin gerard and tony sart greg rutkowski, zabrocki, karlkka, ..., zenith view, zenith view, physical, house, aside, except, Grains, design, replica
pincushion lens effect
a plane flies through the air with fumes coming out the back Rephrase: a plane flies through the air with fumes coming ..., car, crashes, land, ..., breeze, departing, into, front, Grains, cold,
A man is seated on a floor with a computer and some papers. intricate, elegant, highly detailed, ..., illustration, by justin female, was, standing, below, top, without, zero, ..., emails,
gerard and artger rutkowski, 8 k Blurry, bad, extra, proportion
Orange and brown cat sitting on top of white shoes. Trending on Artstation, ..., 4k, 8k, unreal 5, very detailed, hyper purple, however, black, crawling, ..., socks, Cropped,
control-realism. background, inverted, shape
17
On Discrete Prompt Optimization for Diffusion Models
Figure 8: More images generated by user input and adversarial prompts using Stable Diffusion v1-4.
A painter adding the finishing touches to a vibrant canvas. A craftsman incorporating the finishing touches to a vivid masterpiece .
A skillful tailor sewing a beautiful dress with intricate details. A skillful tailor tailoring a lovely attire with sophisticated elements .
portrait of evil witch woman in front of sinister deep dark forest ambience image of vile mage dame in front of threatening profound dim wilderness ambience
Amazing photorealistic digital concept art of a guardian robot in a rural setting by a barn. astounding photorealistic digital theory design of a defender robot in a provincial context by a
stable .
close up portrait of a young lizard as a wizard with an epic idea close up snapshot of a youthful chameleon as a magician with an heroic guess
18
On Discrete Prompt Optimization for Diffusion Models
1e 4 1e 5
0.2 2
1
50 45 40 35 30 25 20 15 10 5 1 50 45 40 35 30 25 20 15 10 5 1
inference timestep t inference timestep t
Figure 9: Gradient near the beginning and end of the inference process are significantly less informative. We plot the
average gradient norm over all words across different timesteps. For each timestep, the Shortcut Text Gradient is computed
over 100 Gumbel samples.
19
On Discrete Prompt Optimization for Diffusion Models
Figure 10: Images generated by user input and improved negative prompts on Stable Diffusion XL.
by artgerm and greg rutkowski and epao Mutation, animals, error, surreal
darth vader in iron man armour highly detailed, digital painting, ..., illustration, art by greg yoda, outside, lightweight, exposed, Render, Script, incomplete,
rutkowski and alphonse mucha pieces
The ash and dark pigeon was roosting on the lamppost, intricate, elegant, highly detailed, digital painting, artstation, green, clear, departing, ditch, inner, Mistake, CGI, cooked,
observing the environment. concept art, sharp focus, illustration, by justin gerard and art replica
rutkowski, 8 k
a very big building with a mounted clock greg rutkowski, zabrocki, ..., 8 k, ultra wide angle, zenith view, mildly, tiny, detached, Logo, cityscape, inverted, stale
pincushion lens effect
The man is sitting on the bench close to the asian section. greg rutkowski, zabrocki, karlkka, ..., 8 k, ultra wide angle, girl, standing, under, ground, distant, unto, entirety, Mistake,
zenith view, pincushion lens effect black, engine, poorly
Two sinks stand next to a bathtub in a bathroom. greg rutkowski, zabrocki, karlkka, jayison devadas, trending one,soars, lie, multiple, kitchen, outside, bedroom, Blurry,
impervious artificial, down, poorly
A woman that is standing next to a man. highly detailed, digital painting, artstation, ..., art by greg male, crawling, away, far, several, woman, Mutation, characters,
rutkowski and alphonse mucha folded, username
20