0% found this document useful (0 votes)
21 views

On Discrete Prompt Optimization or Di Usion Models

Uploaded by

orzhao
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

On Discrete Prompt Optimization or Di Usion Models

Uploaded by

orzhao
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

On Discrete Prompt Optimization for Diffusion Models

Ruochen Wang 1 2 Ting Liu 3 Cho-Jui Hsieh 1 2 Boqing Gong 1


Google Research Google Deepmind UCLA
https://github.com/ruocwang/dpo-diffusion

Abstract input prompts (Ouyang et al., 2022; Touvron et al., 2023;


arXiv:2407.01606v1 [cs.LG] 27 Jun 2024

This paper introduces the first gradient-based Rombach et al., 2022; Ramesh et al., 2022; Saharia et al.,
framework for prompt optimization in text-to- 2022; Ho et al., 2022; Yu et al., 2022; Chang et al., 2023).
image diffusion models. We formulate prompt Despite being trained with huge corpora, there still exists a
engineering as a discrete optimization problem substantial gap between user intention and what the model
over the language space. Two major challenges interprets (Zhou et al., 2022; Feng et al., 2022; Rombach
arise in efficiently finding a solution to this prob- et al., 2022; Radford et al., 2021; Lian et al., 2023; Ouyang
lem: (1) Enormous Domain Space: Setting the et al., 2022; Ramesh et al., 2022). The misalignment is even
domain to the entire language space poses signif- more severe in text-to-image generative models, partially
icant difficulty to the optimization process. (2) since they often rely on much smaller and less capable text
Text Gradient: Efficiently computing the text gra- encoders (Radford et al., 2021; Cherti et al., 2023; Raffel
dient is challenging, as it requires backpropagat- et al., 2020) than large language models (LLMs). As a re-
ing through the inference steps of the diffusion sult, instructing a large model to produce intended content
model and a non-differentiable embedding lookup often requires laborious human efforts in crafting the prompt
table. Beyond the problem formulation, our main through trials and errors (a.k.a. Prompt Engineering) (Art,
technical contributions lie in solving the above Year; Wang et al., 2022; Witteveen & Andrews, 2022; Liu
challenges. First, we design a family of dynami- & Chilton, 2022; Zhou et al., 2022; Hao et al., 2022). To
cally generated compact subspaces comprised of automate this process for language generation, several re-
only the most relevant words to user input, sub- cent attempts have shown tremendous potential in utilizing
stantially restricting the domain space. Second, LLMs to enhance prompts (Pryzant et al., 2023; Zhou et al.,
we introduce “Shortcut Text Gradient” — an ef- 2022; Chen et al., 2023; Guo et al., 2023; Yang et al., 2023;
fective replacement for the text gradient that can Hao et al., 2022). However, efforts on text-to-image genera-
be obtained with constant memory and runtime. tive models remain scarce and preliminary, probably due to
Empirical evaluation on prompts collected from the challenges faced by these models’ relatively small text
diverse sources (DiffusionDB, ChatGPT, COCO) encoders in understanding subtle language cues.
suggests that our method can discover prompts
that substantially improve (prompt enhancement) DPO-Diff. This paper presents a systematic study of
or destroy (adversarial attack) the faithfulness of prompt optimization for text-to-image diffusion models.
images generated by the text-to-image diffusion We introduce a novel optimization framework based on
model. the following key observations. 1) Prompt engineering for
diffusion models can be formulated as a Discrete Prompt
Optimization (DPO-Diff) problem over the space of natural
1. Introduction languages. Moreover, the framework can be used to find
prompts that either improve (prompt enhancement) or de-
Large-scale text-based generative models exhibit a remark-
stroy (adversarial attack) the generation process, by simply
able ability to generate novel content conditioned on user
reversing the sign of the objective function. 2) We show
1
Google Research 2 University of California, Los that for diffusion models with classifier-free guidance (Ho
Angeles 3 Google Deepmind. Correspondence to: Bo- & Salimans, 2022), improving the image generation process
qing Gong <[email protected]>, Ruochen Wang is more effective when optimizing “negative prompts” (An-
<[email protected]>. drew, 2023; Woolf, 2022) than positive prompts. Beyond
Proceedings of the 41 st International Conference on Machine the problem formulation of DPO-Diff, where “Diff” high-
Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by lights our focus on text-to-image diffusion models, the main
the author(s). technical contributions of this paper lie in efficient methods

1
On Discrete Prompt Optimization for Diffusion Models

Com pact Search Spaces For Positive and N egative Prom pts Gumbel Sof t max ShortCut Gradient
[without unrolling]
Synonym s LLM ( Chat GPT)
Amidst -> Among, within. [0.2, 0.7, 0.1]
Bustling -> Busy, hectic, active. [0.1, 0.3, 0.5, 0.9]
City -> metropolis, municipality. [0.6, 0.3, 0.1] Gr adient Enabled
Neon -> Fluorescent, bright, vibrant. [0.3, 0.5, 0.1, 0.1]
Lights -> Lamps, illumination, lanterns. [0.0, 0.1, 0.8, 0.9]
Vibrant -> Lively, colorful, dynamic. [0.3, 0.3, 0.3, 0.1]
Streets -> Roads, avenues. [0.2, 0.5, 0.2, 0.1]
... ...

Antonym s
Amidst -> Apart from, outside. [0.1, 0.8, 0.1]
Bustling -> Calm, quiet, tranquil. [0.2, 0.2, 0.4, 0.2]
City -> Countryside, wilderness. [0.7, 0.2, 0.1]
Neon -> Dull, dim, muted. [0.4, 0.3, 0.1, 0.2] CLIP
Lights -> Darkness, shadow, obscurity. [0.1, 0.2, 0.6, 0.1] Positive Shar ed Shar ed Shar ed
Shar ed
Illuminate -> Obscure, darken, dim. [0.1, 0.1, 0.7, 0.1] Prompt
Vibrant -> Dull, lifeless, subdued. [0.3, 0.3, 0.1, 0.3] This 0.3
... ... w P(w ) It 0.5 w P(w ) w P(w )
That 0.2 User Pr om pt

Ami ds t t he bus t l i ng c i t y ,
neon l i ght s i l l umi nat e t he
v i br ant s t r eet s .

This 0.3
Synonyms Space It 0.5
User Pr om pt w P(w ) w P(w ) w P(w )
That 0.2
Ami ds t t he bus t l i ng c i t y , Negative
neon l i ght s i l l umi nat e t he Antonyms Space Shar ed
Prompt Shar ed Shar ed
v i br ant s t r eet s . Shar ed

Figure 1: Computational procedure of Shortcut Text Gradient (Bottom) v.s. Full Gradient (Top) on text.

for solving this optimization problem, including the design sion model, we collect and filter a set of challenging prompts
of compact domain spaces and a gradient-based algorithm. from diverse sources including DiffusionDB (Wang et al.,
2022), COCO (Lin et al., 2014), and ChatGPT (Ouyang
Compact domain spaces. DPO-Diff’s domain space is a et al., 2022). Empirical results suggest that DPO-Diff can
discrete search space at the word level to represent prompts. effectively discover prompts that improve (or destroy for ad-
While this space is generic enough to cover any sentence, it versarial attack) the faithfulness of text-to-image diffusion
is excessively large due to the dominance of words irrelevant models, surpassing human-engineered prompts and prior
to the user input. To alleviate this issue, we design a family baselines by a large margin. We summarize our primary
of dynamically generated compact search spaces based on contributions as follows:
relevant word substitutions, for both positive and negative • DPO-Diff: A generic framework for prompt optimiza-
prompts. These subspaces enable efficient search for both tion as a discrete optimization problem over the space of
prompt enhancement and adversarial attack tasks. natural languages, of arbitrary metrics.
• Compact domain spaces: A family of dynamic compact
Shortcut Text Gradients for DPO-Diff. Solving DPO-
search spaces, over which a gradient-based algorithm
Diff with a gradient-based algorithm requires computing
enables efficient solution finding for the prompt optimiza-
the text gradient, i.e., backpropagating from the generated
tion problem.
image, through all inference steps of a diffusion model, and
finally to the discrete text. Two challenges arise in obtaining • Shortcut Text Gradients: The first novel computation
this gradient: 1) This process incurs compound memory- method to enable backpropagation through the diffusion
runtime complexity over the number of backward passes models’ lengthy sampling steps with constant memory-
through the denoising step, making it prohibitive to run on runtime complexity, enabling gradient-based search algo-
large-scale diffusion models (e.g., a 870M-parameter Stable rithms.
Diffusion v1 requires ∼750G memory to run backpropa-
• Negative prompt optimization: The first empirical re-
gation through 50 inference steps (Rombach et al., 2022)).
sult demonstrating the effectiveness of optimizing nega-
2) The embedding lookup tables in text encoders are non-
tive prompts for diffusion models.
differentiable. To reduce the computational cost in 1), we
provide a generic replacement for the text gradient that by-
passes the need to unroll the inference steps in a backward 2. Related Work
pass, allowing it to be computed with constant memory and
Text-to-image diffusion models. Diffusion models
runtime. To backpropagate through the discrete embedding
trained on a large corpus of image-text datasets significantly
lookup table, we continuously relax the categorical word
advanced the state of text-guided image generation (Rom-
choices to a learnable smooth distribution over the vocabu-
bach et al., 2022; Ramesh et al., 2022; Saharia et al., 2022;
lary, using the Gumbel Softmax trick (Guo et al., 2021; Jang
Chang et al., 2023; Yu et al., 2022). Despite the success,
et al., 2016; Dong & Yang, 2019). The gradient obtained
these models can sometimes generate images with poor
by this method, termed Shortcut Text Gradient, enables
quality. While some preliminary observations suggest that
us to efficiently solve DPO-Diff regardless of the number of
negative prompts can be used to improve image quality (An-
inference steps of a diffusion model.
drew, 2023; Woolf, 2022), there exists no principled way
To evaluate our prompt optimization method for the diffu- to find negative prompts. Moreover, several studies have

2
On Discrete Prompt Optimization for Diffusion Models

shown that large-scale text-to-image diffusion models face ated image, the optimization process necessitates backprop-
significant challenges in understanding language cues in agation through all inference steps. Despite using similar
user input during image generation; Particularly, diffusion terminologies, these methodologies are fundamentally dis-
models often generate images with missing objects and in- tinct and not interchangeable. Table 3 further summarizes
correctly bounded attribute-object pairs, resulting in poor the key differences in taxonomy.
“faithfulness” or “relevance” (Hao et al., 2022; Feng et al.,
2022; Lian et al., 2023; Liu et al., 2022). Existing solu- Efficient Backpropagation through diffusion sampling
tions to this problem include compositional generation (Liu steps. Text-to-image diffusion models generate images
et al., 2022), augmenting diffusion model with large lan- via a progressive denoising process, making multiple passes
guage models (Yang et al., 2023), and manipulating atten- through the same network (Ho et al., 2020). When a loss
tion masks (Feng et al., 2022). As a method orthogonal to is applied to the output image, computing the gradient w.r.t.
them, our work reveals that negative prompt optimization any model component (text, weight, sampler, etc.) requires
can also alleviate this issue. backpropagating through all the sampling steps. This pro-
cess incurs compound complexity over the number of back-
Prompt optimization for text-based generative models. ward passes in both memory and runtime, making it infeasi-
Aligning a pretrained large language model (LLM) with ble to run on regular commercial devices. Existing efforts
human intentions is a crucial step toward unlocking the po- achieve constant memory via gradient checkpointing (Wat-
tential of large-scale text-based generative models (Ouyang son et al., 2021) or solving an augmented SDE problem (Nie
et al., 2022; Rombach et al., 2022). An effective line of et al., 2022), at the expense of even higher runtime.
training-free alignment methods is prompt optimization
(PO) (Zhou et al., 2022). PO originated from in-context 3. Preliminaries on diffusion model
learning (Dale, 2021), which is mainly concerned with var-
ious arrangements of task demonstrations. It later evolves Denoising diffusion probabilistic models. On a high
into automatic prompt engineering, where powerful lan- level, diffusion models (Ho et al., 2020) is a type of hierar-
guage models are utilized to refine prompts for certain chical Variational Autoencoder (Sønderby et al., 2016) that
tasks (Zhou et al., 2022; Pryzant et al., 2023; Yang et al., generates samples by reversing (backward) a progressive
2023; Pryzant et al., 2023; Hao et al., 2022). While PO noisification process (forward). Let x0 · · · xT be a series of
has been widely explored for LLMs, efforts on diffusion intermediate samples of increasing noise levels, the forward
models remain scarce. The most relevant prior work to ours process progressively adds Gaussian noise to the original
is Promptist (Hao et al., 2022), which finetunes an LLM via image x0 :
reinforcement learning from human feedback (Ouyang et al., p
2022) to augment user prompts with artistic modifiers (e.g., q(xt |xt−1 ) = N (xt ; 1 − βt xt−1 , βt I), (1)
high-resolution, 4K) (Art, Year), resulting in aesthetically
pleasing images. However, the lack of paired contextual- where β is a scheduling variable. Using reparameterization
aware data significantly limits its ability to follow the user trick, xt |Tt=1 can be computed from x0 in one step:
intention (Figure 3). √ √
xt = ᾱt x0 + 1 − ᾱt ϵ, (2)
Yt
Textual Inversion Optimizing texts in pretrained diffu- where αt = 1 − βt and ᾱt = αi , (3)
i=1
sion models has also been explored under “Textual Inver-
sion” task (Gal et al., 2022; Wen et al., 2023; Mokady et al., where ϵ is a standard Gaussian error. The reverse process
2023). Textual Inversion involves adapting a frozen model starts with a standard Gaussian noise, xT ∼ N (0, I), and
to generate novel visual concepts based on a set of user- progressively denoises it using the following joint distribu-
provided images. It achieves this by distilling these images tion:
into soft or hard text prompts, enabling the model to repli- YT
cate the visual features of the user images. Since the source pθ (x0:T ) = p(xT ) pθ (xt−1 |xt )
t=1
images are provided, the training process mirrors that of
typical diffusion model training. While some Textual In- where pθ (xt−1 |xt ) = N (xt−1 ; µθ (xt , t), Σ).
version papers also use the term “prompt optimization”,
it is distinct from the Prompt Optimization considered by While the mean function µθ (xt , t) can be parameterized
Promptist (Hao et al., 2022) and our work. Our objective is by a neural network (e.g., UNet (Rombach et al., 2022;
to enhance a model’s ability to follow text prompts. Here, Ronneberger et al., 2015)) directly, prior studies found that
the primary input is the user prompt, and improvement is modeling the residual error ϵ(xt , t) instead works better em-
achieved by optimizing this prompt to enhance the resulting pirically (Ho et al., 2020). The two strategies are mathemat-
image. Since the score function is applied to the final gener- ically equivalent as µθ (xt , t) = √1αt (xt − √1−α t
1−ᾱ
ϵ(xt , t)).
t

3
On Discrete Prompt Optimization for Diffusion Models

Conditional generation and negative prompts. The where d(s, suser ) is a distance measure that forces the per-
above formulation can be easily extended to conditional gen- turbed prompt (s) to be semantically similar to the user input
eration via classifier-free guidance (Ho & Salimans, 2022), (suser ).
widely adopted in contemporary diffusion models. At each
sampling step, the predicted error ϵ̃ is obtained by subtract- 5. Compact search spaces for efficient prompt
ing the unconditional signal (c(“”)) from the conditional
signal (c(s)), up to a scaling factor w:
discovery
While the entire language space facilitates maximal gener-
ϵ̃θ (xt , c(s),t) = (1 + w)ϵθ (xt , c(s), t) − wϵθ (xt , c(“”), t). ality, it is also unnecessarily inefficient as it is popularized
(4) with words irrelevant to the task. We propose a family of
compact search spaces that dynamically extracts a subset of
If we replace this empty string with an actual text, then it task-relevant words to the user input.
becomes a Negative Prompt (Andrew, 2023; Woolf, 2022),
instructing the model what to exclude from the generated
image. 5.1. Application 1: Discovering adversarial prompts for
model diagnosis
4. DPO-Diff Framework Synonym Space for adversarial attack. In light of the
constraint on semantic similarity in (6), we build a search
Formulation Our main insight is that prompt engineering space for the adversarial prompts by substituting each word
can be formulated as a discrete optimization problem in the in the user input suser with its synonyms (Alzantot et al.,
language space. Concretely, we represent the problem do- 2018), preserving the meaning of the original sentence. The
main S as a sequence of M words wi from a predefined synonyms can be found by either dictionary lookup or query-
vocabulary V: S = {w1 , w2 , . . . wM |∀i, wi ∈ V}. This ing ChatGPT (Appendix F.2).
space is generic enough to cover all possible sentences of
lengths less than M (when the empty string is present). Let 5.2. Application 2: Discovering enhanced prompts for
G(s) denote a text-to-image generative model, and suser , image generation
s denote the user input and optimized prompt, respectively.
The optimization problem can be written as While the Synonym Space is suitable for attacking diffu-
sion models, we found that it performs poorly on find-
min L(G(s), suser ) (5) ing improved prompts. This is in contradiction to LLMs
s∈S
where rephrasing user prompts can often lead to substan-
where L can be any objective function that measures the tial gains (Zhou et al., 2022). One plausible reason is that
effectiveness of the learned prompt when used to generate contemporary diffusion models often rely on small-scale
images. Following previous works (Hao et al., 2022), we text encoders (Radford et al., 2021; Cherti et al., 2023;
use clip loss CLIP(I, suser ) (Crumb, 2022) to measure the Raffel et al., 2020) that are much weaker than LLMs with
instruction-following ability of the diffusion model. many known limitations in understanding subtle language
cues (Feng et al., 2022; Liu et al., 2022; Yang et al., 2023).
Antonym Space for negative prompt optimization. In-
Application DPO-Diff framework is versatile for
spired by these observations, we propose a novel solution
handling not only prompt enhancement but also adversarial
to optimize for negative prompts instead — a unique con-
attack tasks. Figure 1 illustrates the taxonomy of those
cept that rises from classifier-free guidance (Ho & Sali-
two applications. Adversarial attacks for text-to-image
mans, 2022) used in diffusion models (Section 3). Recall
generative models can be defined as follows:
that negative prompts instruct the diffusion model to re-
Definition 4.1. Given a user input suser , the attacker aims move contents in generated images, opposite to the pos-
at slightly perturbing suser to disrupt the prompt-following itive prompt; Intuitively, the model’s output image can
ability of image generation, i.e., the resulting generated safely exclude the content with the opposite meaning to
image is no longer describable by suser . the words in the user input, thereby amplifying the con-
To modify (5) into the adversarial attack, we can simply add cepts presented in the positive prompt. We thereby build
a negative sign to the objective function (L), and restrict the space of negative prompts from the antonyms of each
the distance between an adversarial prompt (s) and user word in the user prompt. The antonyms of words can
input (suser ). Mathematically, this can be written as the also be obtained either via dictionary lookup or querying
following: ChatGPT. However unlike synonyms space, we concate-
nate the antonyms directly in comma separated format,
min −L(G(s), suser ) s.t. d(s, suser ) ≤ λ, (6) mirroring the practical usage of negative prompts. To the
s∈S best of our knowledge, this is the first exploratory work on

4
On Discrete Prompt Optimization for Diffusion Models

automated negative prompt optimization. analytically, with complexity independent of t. Note that
the above estimation of x0 is not a trick — it directly comes
6. A Gradient-based solver for DPO-Diff from a mathematically equivalent interpretation of the dif-
fusion model, where each inference step can be viewed as
Due to the query efficiency of white-box algorithms leverag- computing x̂0 and plugging it into q(xt−K |xt , x̂0 ) to ob-
ing gradient information, we also explore a gradient-based tain the transitional probability (See Appendix C for the
method to solve (5) and (6). However, obtaining the text derivation).
gradient is non-trivial due to two major challenges. 1) Back-
Remark 1: Complexity Analysis With Shortcut Text Gra-
propagating through the sampling steps of the diffusion
dient, the computational cost of backpropagating through
inference process incurs high complexity w.r.t. memory and
the inference process can be reduced to K-times backward
runtime, making it prohibitively expensive to obtain gradi-
passes of UNet. When we set t = T and K = T , it becomes
ents (Watson et al., 2021; Nie et al., 2022). For samplers
the full-text gradient; When K = 1, the computation costs
with 50 inference steps (e.g., DDIM (Song et al., 2020)), it
reduce to a single backward pass. Remark 2: Connection
raises the runtime and memory cost by 50 times compared
to ReFL (Xu et al., 2024). ReFL is a post-hoc alignment
to a single diffusion training step. 2) To further compute the
method for finetuning diffusion models. It also adopts the
gradient on text, the backpropagation needs to pass through
estimation of x0 when optimizing diffusion model against a
a non-differentiable embedding lookup table. To alleviate
scorer, which is mathematically equivalent to the case when
these issues, we propose Shortcut Text Gradient, an effi-
K = 1.
cient replacement for text gradient that can be obtained with
constant memory and runtime. Our solution to (1) and
6.1.2. BACKPROPAGATING THROUGH EMBEDDINGS
(2) are discussed in Section 6.1.1 and Section 6.1.2 respec-
LOOKUP TABLE
tively. Moreover, Section 6.2 discusses how to sample from
the learned text distribution via evolutionary search. In diffusion models, a tokenizer transforms text input into
indices, which will be used to query a lookup table for cor-
6.1. Shortcut Text Gradient responding word embeddings. To allow further propagating
gradients through this non-differentiable indexing operation,
6.1.1. BACKPROPAGATING THROUGH DIFFUSION we relax the categorical choice of words into a continuous
SAMPLING STEPS
probability of words and learn a distribution over them. We
To efficiently backpropagate the loss from the final image parameterize the distribution using Gumbel Softmax (Jang
to intermediate feature at an arbitrary step, our key idea is et al., 2016) with uniform temperature (η = 1):
to trim the computation graph down to only a few steps |V|
from both ends, resulting in a constant number of back-
X exp ((log αi + gi )/η)
ẽ = ei ∗ P|V| (7)
ward passes (Figure 1. To achieve this, three operations are i=1 i=1 exp ((log αi + gi )/η)
required through the image generation process:
where α (a |V|-dimensional vector) denotes the learnable
(1) Sampling without gradient from step T (noise) to t. We parameter, g denotes the Gumbel random variable, ei is the
disable gradients up to step t, thereby eliminating the need embedding of word i, and ẽ is the output mixed embedding.
for backpropagation from T to t.
6.2. Efficient sampling with Evolutionary Search
(2) Enable gradient from t to t − K. The backward compu-
tation graph is enabled for the K step starting at t. To efficiently sample candidate prompts from the learned
(3) Estimating x0 directly from xt−K . To bypass the fi- Gumbel “distribution”, we adopt evolutionary search,
nal t − K steps of UNet, a naive solution is to directly known for its sample efficiency (Goldberg, 1989; Wu et al.,
decode and feed the noisy image xt−K to the loss function. 2019). Our adaptation of the evolutionary algorithm to
However, due to distribution shifts, these intermediate im- the prompt optimization task involves three key steps: (1)
ages often cannot be properly interpreted by downstream Genotype Definition: We define the genotype of each can-
modules such as VAE decoder (Rombach et al., 2022) and didate prompt as the list of searched words from the compact
CLIP (Dhariwal & Nichol, 2021). Instead, we propose to search space, where modifications to the genotype corre-
use the following closed-form estimation of the final image spond to edits the word choices in the prompt. (2) Popula-
x̂0 (Song et al., 2020) to bridge the gap: tion Initialization: We initialize the algorithm’s population
with samples drawn from the learned Gumbel distribution
1 p to bias the starting candidates towards regions of high poten-
x̂0 = √ (xt−K − 1 − ᾱt−K ϵ̂θ (xt−K , t − K)) tial. (3) Evolutionary Operations: We execute a standard
ᾱt−K
evolutionary search, including several rounds of crossover
This way, the Jacobian of x̂0 w.r.t. xt−K can be computed and mutation (Goldberg, 1989), culminating in the selection

5
On Discrete Prompt Optimization for Diffusion Models

Table 1: Quantitative comparison of different prompt-


ing methods. We evaluate the generated images using
both Spherical CLIP loss and Human Preference Score v2
(HPSv2) score (renormalized to 0-100) - a score trained to
mimic human preferences on images generated from text.
Our method achieves the best result on both prompt
improvement and adversarial attack among all methods,
including the previous SOTA - Promptist.
DiffusionDB COCO ChatGPT
Attack
CLIP↑ HPSv2↓ CLIP↑ HPSv2↓ CLIP↑ HPSv2↓
User 0.76 ± 0.03 75.28 ± 8.54 0.77 ± 0.03 75.28 ± 8.54 0.77 ± 0.02 73.57 ± 10.81
Figure 2: Win Rate of DPO-Diff versus Promptist on prompt DPO-Diff 0.86 ± 0.05 40.52 ± 11.88 0.94 ± 0.04 45.85 ± 10.18 0.95 ± 0.05 39.73 ± 16.73
improvement task with Human Evaluation. DPO-Diff DiffusionDB COCO ChatGPT
Improve CLIP↓ HPSv2↑ CLIP↓ HPSv2↑ CLIP↓ HPSv2↑
surpasses or matches the performance of Promptist 79% of User 0.87 ± 0.02 48.81 ± 09.71 0.87 ± 0.01 50.33 ± 4.85 0.84 ± 0.01 53.36 ± 5.17
Manual 0.89 ± 0.04 51.43 ± 10.29 - - - -
times on SD-v1 and 88% of times on SD-XL. Promptist 0.88 ± 0.02 54.39 ± 12.47 0.87 ± 0.03 50.08 ± 7.43 0.85 ± 0.02 59.32 ± 6.50
DPO-Diff 0.81 ± 0.03 62.37 ± 12.48 0.82 ± 0.02 61.26 ± 0.77 0.78 ± 0.03 67.71 ± 6.46

of the top candidate as the optimized prompt. Details of the


complete DPO-Diff algorithm, including specific hyperpa- between 0.75 and 0.85 for most inputs. The K for the
rameters, are available in Algorithm 1 of Appendix D and Shortcut Text Gradient is set to 1, as it produces effective
discussed further in Appendix F.1. supervision signals with minimal cost. To generate the
search spaces, we prompt ChatGPT (gpt-4-1106-preview)
Remark: Extending DPO-Diff to Blackbox Settings. In for at most 5 substitutes of each word in the user prompt.
cases where the model is only accessible through forward Furthermore, we use a fixed set of hyperparameters for both
API, our Evolutionary Search (ES) module can be used as prompt improvement and adversarial attacks. We include a
a stand-alone black-box optimizer, thereby expanding the detailed discussion on all the hyperparameters and search
applicability of our framework. As further ablated in Section space generation in Appendix F.
8.1, ES archives descent results with enough queries.
7.2. Application 1 - Adversarial Attack
7. Experiments Unlike RLHF-based prompt-engineering methods (e.g.
Promptist (Hao et al., 2022)) that require finetuning a prompt
7.1. Experimental Setup
generator when adapting to a new task, DPO-Diff, as a train-
Dataset preparation. To encourage semantic diver- free method, can be seamlessly applied to finding adversar-
sity, we collect a prompt dataset from three sources: ial prompts by simply reversing the sign of the objective
DiffusionDB (Wang et al., 2022), ChatGPT generated function.
prompts (Ouyang et al., 2022), and COCO (Lin et al., 2014).
In this section, we demonstrate that DPO-Diff is capable
For each source, we filter 100 “hard prompts” with a clip
of discovering adversarial prompts that destroy the prompt-
loss higher (lower for adversarial attack) than a threshold,
following ability of Stable Diffusion.
amounting to 600 prompts in total for two tasks. Due to
space limit, we include preparation details in Appendix G.1. As suggested by (6), a successful adversarial prompt must
not change the original intention of the user prompt. While
Evaluation Metrics. All methods are evaluated quantita- we specified this constraint to ChatGPT when building the
tively using the clip loss (Crowson et al., 2022) and Human Synonyms Space, occasionally ChatGPT might mistake a
Preference Score v2 (HPSv2). HPSv2 is a CLIP-based word for the synonyms. To address this, during the evolu-
model trained to predict human preferences on images gen- tionary search phase, we perform rejection sampling to
erated from text. For base models, we adopt Stable Diffusion refuse candidate prompts that have different meanings
v1-4. Each prompt is evaluated under two random seeds to the user input. Concretely, we enforce their cosine sim-
(shared across different methods). Besides automatic eval- ilarity in embedding space to be higher than 0.9 (More on
uation metrics, we also conduct human evaluations on this can be found in Appendix G).
the generated images, following the protocol specified in Table 1 summarizes the quantitative results. Our method is
Appendix G.2. able to perturb the original prompt to adversarial directions,
resulting in a substantial increase in the clip loss. Figure 4
Optimization Parameters. We use the Spherical CLIP also visualizes a set of intriguing images generated by the
Loss (Crumb, 2022) as the objective function, which ranges adversarial prompts. We can see that DPO-Diff can ef-

6
On Discrete Prompt Optimization for Diffusion Models

User Input Promptist - Modifiers Negative Prompts by DPO-Diff


The yellow sun was descending beyond the violet peaks, by Greg Rutkowski and Raymond Swanland, ..., ultra realistic red, soaring, red, valleys, white, floor, Plain, body, focus,
coloring the sky with hot shades. digital art surreal

A dedicated gardener tending to a ... bonsai tree. intricate, elegant, highly detailed, ..., sharp focus, illustration irresponsible, overlooking, huge, herb, ...

magical ... bear with glowing magical marks ... D&D, fantasy, cinematic lighting, ..., art by artgerm and greg ... normal, elephant, ..., heaps, tundra, advance, Boring, black, ...

Figure 3: Example images generated by improved negative prompts from DPO-Diff v.s. Promptist (More in Figure 7).
Compared with Promptist, DPO-Diff was able to generate images that better capture the content in the original prompt.

User Input Adversarial Prompts by DPO-Diff


A vibrant sunset casting hues of orange and pink. The vibrant sundown casting tones of orange plus blush.

A group of friends gather around a table for a meal. A party of friends cluster around a surface for a food

oil painting of a mountain landscape grease picture illustrating one mountain view

Figure 4: Example images generated by adversarial prompts from DPO-Diff. While keeping the overall meaning similar to
the user input, adversarial prompts completely destroy the prompt-following ability of the Stable Diffusion model. (More in
Figure 8)

7
On Discrete Prompt Optimization for Diffusion Models

fectively explore the text regions where Stable Diffusion User Prompt: A bunch of luggage that is in front of a truck.

fails to interpret.

Human Evaluation. We further ask human judges to


check whether the attack generated by DPO-Diff is suc-
User Prompt: There are cranes in the water and a boat in the distance.
cessful or not. Since previous prompt optimization methods
do not apply to this task, we only ask the evaluators to
compare DPO-Diff against the original image. DPO-Diff
achieves an average success rate (ASR) of 44% on SD-v1.
Considering that Stable Diffusion models are trained on a User Prompt: harry potter shrek, movie poster, movie still, ...
large amount of caption corpus, this success rate is fairly
substantial.

7.3. Application 2: Prompt Improvement


In this section, we apply DPO-Diff to craft prompts that Figure 5: Evolution of the optimized images from DPO-Diff
improve the prompt-following ability of the generated im- at iteration 0, 10, 20, 40, and 80 (left to right). Noticeable
ages. We compare our method with three baselines: (1) improvements can be observed as early as 10 iterations, and
User Input. (2) Human Engineered Prompts (available only the progression is surprisingly interpretable.
on DiffusionDB) (Wang et al., 2022). (3) Promptist (Hao
et al., 2022), trained to mimic the human-crafted prompt luggage in front of a truck,” the initial image fails to include
provided in DiffusionDB. any luggage, featuring only the truck; However, as the opti-
Table 1 summarizes the result. Among all methods, DPO- mization continues, we can see that DPO-Diff incrementally
Diff achieves the best results under both Spherical CLIP loss adds more luggage to the scene.
and Human Preference Score (HPSv2) score. On the other
hand, our findings suggest that both human-engineered and 8. Ablation Study
Promptist-optimized prompts do not improve the relevance
between generated images and user intention. The reason is We conduct ablation studies on DPO-Diff using 30 ran-
that these methods merely add a set of aesthetic modifiers to domly sampled prompts, 10 from each source. Each search
the original prompt, irrelevant to the semantics of user input. algorithm is run under 4 random seeds.
This can be further observed from the qualitative examples
in Figure 3, where images generated by Promptist often also 8.1. Comparison of different search algorithms.
do not follow the prompts well.
We compare four search algorithms for DPO-Diff: Ran-
dom Search (RS), Evolution Prompt Optimization (EPO),
Human Evaluation. We further ask human judges to Gradient-based Prompt Optimization (GPO), and the full
rate DPO-Diff and Promptist on how well the generated algorithm (GPO + ES). Figure 6 shows their performance
images follow the user prompt. Figure 2 summarizes the under different search budgets (number of evaluations)1 ;
win/draw/loss rate of DPO-Diff against Promptist; The re- While GPO tops EPO under low budgets, it also plateaus
sult shows that DPO-Diff surpasses or matches Promp- quicker as randomly drawing from the learned distribution
tist in human rate 79% of times on SD-v1. is sample-inefficient. Combining GPO with EPO achieves
the best overall performance.
7.4. Qualitative analysis of search progression
To examine the convergence of our search algorithm qual- 8.2. Negative prompt v.s. positive prompt optimization
itatively, we plot the progression of optimized images at One finding in our work is that optimizing negative prompts
various evaluation stages. We set the target iterations at (Antonyms Space) is more effective than positive prompts
0 (the original image), 10, 20, 40, and 80 to illustrate the (Synonyms Space) for Stable Diffusion. To verify the
changes, and showcase the image with the highest clip loss strength of these spaces, we randomly sample 100 prompts
among all evaluated candidates at each iteration. for each space and compute their average clip loss of gener-
Figure 5 illustrates some example trajectories. In most cases, ated images. Table 2 suggests that Antonyms Space contains
the images exhibit noticeable improvement in aligning with 1
Since the runtime of backpropagation through one-step dif-
the user’s prompt at as early as the 10th iteration, and con- fusion sampling is negligible w.r.t. the full sampling process (50
tinue to improve. Moreover, the progression are surprisingly steps for DDIM sampler), we count it the same as one inference
interpretable. For instance, with the prompt: ”A bunch of step.

8
On Discrete Prompt Optimization for Diffusion Models

0.86 Random Search


0.90 Evolutionary Prompt Optimization
0.85 Gradient Prompt Optimization
0.88 Hybrid (GPO + EPO)
0.84
Clip loss

Clip loss
0.86
0.83
0.84
0.82
0.82 Random Search 0.81
Evolutionary Prompt Optimization
Gradient Prompt Optimization
0.80 Hybrid (GPO + EPO) 0.80
0 20 40 60 80 0 20 40 60 80
Number of Evaluations (1 eval = 50 DDIM steps) Number of Evaluations (1 eval = 50 DDIM steps)

Figure 6: Learning curves of different search algorithms in solving DPO-Diff.

Table 2: Quantitative evaluation of optimizing negative Two complementary computational paradigms Promp-
prompts (w/ Antonyms Space) and positive prompts (w/ tist and DPO-Diff represent two major paradigms for ef-
Synonym Space) for Stable Diffusion. fectively utilizing computation: learning and searching,
respectively (Sutton, 2019). Learning-based approach of
Prompt DiffusionDB ChatGPT COCO
User Input 0.8741 ± 0.0203 0.8159 ± 0.0100 0.8606 ± 0.0096 Promptist enhances performance through more parameters
Positive Prompt 0.8747 ± 0.0189 0.8304 ± 0.0284 0.8624 ± 0.0141 and larger datasets, whereas the search-based approach of
Negative Prompt 0.8579 ± 0.0242 0.8133 ± 0.0197 0.8403 ± 0.0210
DPO-Diff focuses on maximizing the potential of pretrained
models via post-hoc optimization. Although learning-based
methods require high quality paired data, they can be effi-
candidates with consistently lower clip loss than Synonyms ciently deployed once trained; On the other hand, search-
Space. based methods generate high quality prompts, but are much
slower to execute. Therefore, as Sutton (2019) highlights,
9. Discussion on the Search v.s. Learning these paradigms are complementary rather than competitive.
DPO-Diff can be leveraged to generate high quality dataset
paradigms for utilizing computatons offline, which can subsequently train Promptist to reduce in-
This section elucidates the relationship between two distinct ference latency effectively. Together, they pave the way for
prompt optimization approaches for diffusion models: DPO- a comprehensive solution to prompt optimization for diffu-
Diff (ours) and Promptist. While Promptist represents a sion models, positioning DPO-Diff as the first search-based
pioneering effort, it is important to discuss why DPO-Diff solution to address this problem.
remains essential.
10. Conclusions
Limitations of Promptist Promptist utilizes the Rein- This work presents DPO-Diff, the first gradient-based frame-
forcement Learning from Human Feedback (RLHF) (Bain work for optimizing discrete prompts. We formulate prompt
& Sammut, 1995; Christiano et al., 2017; Ouyang et al., optimization as a discrete optimization problem over the
2022) approach to fine-tune a language model to gen- text space. To improve the search efficiency, we introduce
erate improved prompts. RLHF relies on paired data a family of compact search spaces based on relevant word
⟨user prompt, improved prompt⟩, which is scarce for dif- substitutions, as well as design a generic computational
fusion models and challenging to curate. This is primarily method for computing the discrete text gradient for diffu-
because generating the improved prompts requires extensive sion model’s inference process. DPO-Diff is generic - We
trial-and-error by human experts, essentially performing demonstrate that it can be directly applied to effectively
what DPO-Diff automates. In fact, the performance limit discover both refined prompts to aid image generation and
exhibited by Promptist is exactly caused by this lack of adversarial prompts for model diagnosis. We hope that
data: The data used by Promptist from DiffusionDB pre- the proposed framework helps open up new possibilities
dominantly features aesthetic modifiers that do not alter the in developing advanced prompt optimization methods for
semantics of the prompts This limits its effectiveness to text-based image generation tasks.
aesthetic enhancements and not addressing the core need
for semantic accuracy in prompts. Consequently, it strug- Limitations To motivate future work, we discuss the
gles with semantic prompt adherence and lacks flexibility known limitations of DPO-Diff in Appendix A.
in modifying prompts for tasks such as adversarial attacks.

9
On Discrete Prompt Optimization for Diffusion Models

Acknowledgements L., and Jitsev, J. Reproducible scaling laws for con-


trastive language-image learning. In Proceedings of the
The work is partially supported by NSF 2048280, 2331966, IEEE/CVF Conference on Computer Vision and Pattern
2325121, 2244760, ONR N00014-23-1-2300, and finished Recognition, pp. 2818–2829, 2023.
during the primary contributor’s internship at Google. Spe-
cial thanks to Liangzhe Yuan, Long Zhao, and Han Zhang Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg,
for providing invaluable guidance and accommodations S., and Amodei, D. Deep reinforcement learning from
throughout the internship. human preferences. Advances in neural information
processing systems, 30, 2017.
Impact Statement
Crowson, K., Biderman, S., Kornis, D., Stander, D., Halla-
This work makes contribution to both research and prac- han, E., Castricato, L., and Raff, E. Vqgan-clip: Open
tical applications of text-to-image (T2I) generation. For domain image generation and editing with natural lan-
the research community, we introduce a new paradigm to guage guidance. In European Conference on Computer
optimize prompts for text-to-image generation, demonstrat- Vision, pp. 88–105. Springer, 2022.
ing promising results across various prompts, models, and
metrics. This approach could provide valuable insights for Crumb. Clip-guided stable diffusion, 2022. URL https:
future studies on diffusion models. For industrial applica- //crumbly.medium.com/.
tions, our method can be easily adopted by T2I generation
service providers to improve the performance of their mod- Dale, R. Gpt-3: What’s it good for? Natural Language
els, or used as an offline data generator for training prompt Engineering, 27(1):113–118, 2021.
agents.
Dhariwal, P. and Nichol, A. Diffusion models beat gans
on image synthesis. Advances in neural information
References processing systems, 34:8780–8794, 2021.
Alzantot, M., Sharma, Y., Elgohary, A., Ho, B.-J., Srivas-
tava, M., and Chang, K.-W. Generating natural language Dong, X. and Yang, Y. Searching for a robust neural
adversarial examples. arXiv preprint arXiv:1804.07998, architecture in four gpu hours. In Proceedings of the
2018. IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp. 1761–1770, 2019.
Andrew. How to use negative prompts?, 2023. URL https:
//lexica.art/. Feng, W., He, X., Fu, T.-J., Jampani, V., Akula, A.,
Narayana, P., Basu, S., Wang, X. E., and Wang,
Art, L. Lexica, Year. URL https://lexica.art/. W. Y. Training-free structured diffusion guidance for
compositional text-to-image synthesis. arXiv preprint
Bain, M. and Sammut, C. A framework for behavioural arXiv:2212.05032, 2022.
cloning. In Machine Intelligence 15, pp. 103–129, 1995.
Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano,
Chang, H., Zhang, H., Barber, J., Maschinot, A., Lezama, J., A. H., Chechik, G., and Cohen-Or, D. An image is worth
Jiang, L., Yang, M.-H., Murphy, K., Freeman, W. T., one word: Personalizing text-to-image generation us-
Rubinstein, M., et al. Muse: Text-to-image genera- ing textual inversion. arXiv preprint arXiv:2208.01618,
tion via masked generative transformers. arXiv preprint 2022.
arXiv:2301.00704, 2023.
Goldberg, D. E. Genetic Algorithms in Search,
Chen, L., Chen, J., Goldstein, T., Huang, H., and Zhou, Optimization and Machine Learning 1st Edition.
T. Instructzero: Efficient instruction optimization Addison-Wesley Professional, 1989. ISBN 978-
for black-box large language models. arXiv preprint 0201157673.
arXiv:2306.03082, 2023.
Guo, C., Sablayrolles, A., Jégou, H., and Kiela, D. Gradient-
Cheng, M., Le, T., Chen, P.-Y., Yi, J., Zhang, H., and based adversarial attacks against text transformers. arXiv
Hsieh, C.-J. Query-efficient hard-label black-box at- preprint arXiv:2104.13733, 2021.
tack: An optimization-based approach. arXiv preprint
arXiv:1807.04457, 2018. Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu,
G., Bian, J., and Yang, Y. Connecting large language mod-
Cherti, M., Beaumont, R., Wightman, R., Wortsman, els with evolutionary algorithms yields powerful prompt
M., Ilharco, G., Gordon, C., Schuhmann, C., Schmidt, optimizers. arXiv preprint arXiv:2309.08532, 2023.

10
On Discrete Prompt Optimization for Diffusion Models

Hao, Y., Chi, Z., Dong, L., and Wei, F. Optimizing Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.,
prompts for text-to-image generation. arXiv preprint Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A.,
arXiv:2212.09611, 2022. et al. Training language models to follow instructions
with human feedback. Advances in Neural Information
Ho, J. and Salimans, T. Classifier-free diffusion guidance. Processing Systems, 35:27730–27744, 2022.
arXiv preprint arXiv:2207.12598, 2022.
Pryzant, R., Iter, D., Li, J., Lee, Y. T., Zhu, C., and Zeng, M.
Ho, J., Jain, A., and Abbeel, P. Denoising diffusion Automatic prompt optimization with” gradient descent”
probabilistic models. Advances in neural information and beam search. arXiv preprint arXiv:2305.03495,
processing systems, 33:6840–6851, 2020. 2023.
Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,
A., Kingma, D. P., Poole, B., Norouzi, M., Fleet, D. J., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark,
et al. Imagen video: High definition video generation J., et al. Learning transferable visual models from natu-
with diffusion models. arXiv preprint arXiv:2210.02303, ral language supervision. In International conference on
2022. machine learning, pp. 8748–8763. PMLR, 2021.
Ilyas, A., Engstrom, L., Athalye, A., and Lin, J. Black- Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,
box adversarial attacks with limited queries and infor- Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring
mation. In International conference on machine learning, the limits of transfer learning with a unified text-to-text
pp. 2137–2146. PMLR, 2018. transformer. The Journal of Machine Learning Research,
21(1):5485–5551, 2020.
Jang, E., Gu, S., and Poole, B. Categorical repa-
rameterization with gumbel-softmax. arXiv preprint Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M.
arXiv:1611.01144, 2016. Hierarchical text-conditional image generation with clip
latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
Lian, L., Li, B., Yala, A., and Darrell, T. Llm-grounded dif-
fusion: Enhancing prompt understanding of text-to-image Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and
diffusion models with large language models. arXiv Ommer, B. High-resolution image synthesis with la-
preprint arXiv:2305.13655, 2023. tent diffusion models. In Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition,
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ra- pp. 10684–10695, 2022.
manan, D., Dollár, P., and Zitnick, C. L. Microsoft coco:
Common objects in context. In Computer Vision–ECCV Ronneberger, O., Fischer, P., and Brox, T. U-
2014: 13th European Conference, Zurich, Switzerland, net: Convolutional networks for biomedical image
September 6-12, 2014, Proceedings, Part V 13, pp. 740– segmentation. In Medical Image Computing and
755. Springer, 2014. Computer-Assisted Intervention–MICCAI 2015: 18th
International Conference, Munich, Germany, October
Liu, N., Li, S., Du, Y., Torralba, A., and Tenenbaum, J. B. 5-9, 2015, Proceedings, Part III 18, pp. 234–241.
Compositional visual generation with composable dif- Springer, 2015.
fusion models. In European Conference on Computer
Vision, pp. 423–439. Springer, 2022. Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J.,
Denton, E. L., Ghasemipour, K., Gontijo Lopes, R.,
Liu, V. and Chilton, L. B. Design guidelines for prompt engi- Karagol Ayan, B., Salimans, T., et al. Photorealistic
neering text-to-image generative models. In Proceedings text-to-image diffusion models with deep language un-
of the 2022 CHI Conference on Human Factors in derstanding. Advances in Neural Information Processing
Computing Systems, pp. 1–23, 2022. Systems, 35:36479–36494, 2022.
Mokady, R., Hertz, A., Aberman, K., Pritch, Y., and Cohen- Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K., and
Or, D. Null-text inversion for editing real images us- Winther, O. Ladder variational autoencoders. Advances
ing guided diffusion models. In Proceedings of the in neural information processing systems, 29, 2016.
IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp. 6038–6047, 2023. Song, J., Meng, C., and Ermon, S. Denoising diffusion
implicit models. arXiv preprint arXiv:2010.02502, 2020.
Nie, W., Guo, B., Huang, Y., Xiao, C., Vahdat, A., and
Anandkumar, A. Diffusion models for adversarial purifi- Sutton, R. The bitter lesson. Incomplete Ideas (blog), 13
cation. arXiv preprint arXiv:2205.07460, 2022. (1):38, 2019.

11
On Discrete Prompt Optimization for Diffusion Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux,


M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E.,
Azhar, F., et al. Llama: Open and efficient foundation lan-
guage models. arXiv preprint arXiv:2302.13971, 2023.
Wang, Z. J., Montoya, E., Munechika, D., Yang, H., Hoover,
B., and Chau, D. H. Diffusiondb: A large-scale prompt
gallery dataset for text-to-image generative models. arXiv
preprint arXiv:2210.14896, 2022.
Watson, D., Chan, W., Ho, J., and Norouzi, M. Learning fast
samplers for diffusion models by differentiating through
sample quality. In International Conference on Learning
Representations, 2021.

Wen, Y., Jain, N., Kirchenbauer, J., Goldblum, M., Geiping,


J., and Goldstein, T. Hard prompts made easy: Gradient-
based discrete optimization for prompt tuning and discov-
ery. arXiv preprint arXiv:2302.03668, 2023.
Witteveen, S. and Andrews, M. Investigating prompt
engineering in diffusion models. arXiv preprint
arXiv:2211.15462, 2022.
Woolf, M. Lexica, 2022. URL https://minimaxir.com/
2022/11/stable-diffusion-negative-prompt/.

Wu, B., Dai, X., Zhang, P., Wang, Y., Sun, F., Wu, Y., Tian,
Y., Vajda, P., Jia, Y., and Keutzer, K. Fbnet: Hardware-
aware efficient convnet design via differentiable neural
architecture search. In Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition,
pp. 10734–10742, 2019.

Xu, J., Liu, X., Wu, Y., Tong, Y., Li, Q., Ding, M., Tang, J.,
and Dong, Y. Imagereward: Learning and evaluating hu-
man preferences for text-to-image generation. Advances
in Neural Information Processing Systems, 36, 2024.
Yang, C., Wang, X., Lu, Y., Liu, H., Le, Q. V., Zhou, D., and
Chen, X. Large language models as optimizers. arXiv
preprint arXiv:2309.03409, 2023.
Yu, J., Xu, Y., Koh, J. Y., Luong, T., Baid, G., Wang, Z., Va-
sudevan, V., Ku, A., Yang, Y., Ayan, B. K., et al. Scaling
autoregressive models for content-rich text-to-image gen-
eration. arXiv preprint arXiv:2206.10789, 2(3):5, 2022.
Zhou, Y., Muresanu, A. I., Han, Z., Paster, K., Pitis,
S., Chan, H., and Ba, J. Large language models
are human-level prompt engineers. arXiv preprint
arXiv:2211.01910, 2022.

12
On Discrete Prompt Optimization for Diffusion Models

A. Limitations
We identify the following known limitations of the proposed method: Search cost Our method requires multiple passes
through the diffusion model to optimize a given prompt, which incurs a modest amount of search costs. One promising
solution is to use DPO-Diff to generate free paired data for RLHF (e.g. Promptist), which we leave for future work to
explore. Text encoder moreover, while DPO-Diff improves the faithfulness of the generated image, the performance is
upper-bounded by the limitations of the underlying text encoder. For example, the clip text encoder used in stable diffusion
tends to discard spatial relationships in text, which in principle must be resolved by improving the model itself, such as
augmenting the diffusion model with a powerful LLM (Lian et al., 2023; Liu et al., 2022; Feng et al., 2022). Clip loss The
clip loss used in DPO-Diff might not always align with human evaluation. Automatic scoring metrics that better reflect
human judgment, similar to the reward models used in instruction fine-tuning, can further aid the discovery of improved
prompts. Synonyms generated by ChatGPT For adversarial attack task, ChatGPT sometimes generate incorrect synonyms.
Although we use reject-sampling based on sentence embedding similarity as a posthoc fix, it is not completely accurate. This
may impact the validity of adversarial prompts, as by definition they must preserve the user’s original intent. We address
this in human evaluation by asking the raters to consider this factor when determining the success of an attack.

B. Benefit of optimizing discrete text prompts over soft prompts


Optimizing discrete text prompts offers two major advantages over tuning soft prompts, primarily in two areas: (1)
Interpretability: The results of discrete prompt optimization are texts that are naturally human interpretable. This also
facilitates direct use in fine-tuning RLHF-based agents like Promptist. (2) Simplified Search Space: Our preliminary
attempts with continuous text embeddings revealed challenges in achieving convergence, even on toy examples. The reason,
we conjecture was that the gradients backpropagated through the denoising process have low info-to-noise ratio; And
updating soft prompt using such gradient could be very unstable due to its huge continuous search space. In contrast,
discrete prompt optimization effectively narrows the search to a finite vocabulary set, greatly reducing search complexity
and improving stability.

C. Derivation for the alternative interpretation of DDPM’s modeling.


Proposition C.1. The original parameterization of DDPM at step t − K: µθ (xt−K , t − K) = √ 1 −
αt−K (xt−K
√ βt−K ϵθ (xt−K , t−K)) can be viewed as first computing an estimate of x0 from the current-step error ϵ̂θ (xt−K , t−K):
1−ᾱt−K

1 p
x̂0 = √ (xt−K − 1 − ᾱt−K ϵ̂θ (xt−K , t − K))
ᾱt−K

And use the estimate to compute the transition probability q(xt−K |xt−K , x0 ).

Proof. To avoid clustered notations, we use t instead of t − K for the proof below. Starting from reorganizing (3) to the one
step estimation:
1 √
x̂0 = √ (xt − 1 − ᾱt ϵ̂θ (xt , t)) (8)
ᾱt

where ϵ̂θ is the predicted error at step t by the network. Intuitively this equation means to use the current predicted error to
one-step estimate x0 . Using the Bayesian Theorem, one can show that

q(xt−K |xt , x̂0 ) = N (xt−1 ; µ̃(xt , x0 ), β̃t I) (9)


√ √
ᾱt−1 βt αt (1 − ᾱt−1 )
µ̃(xt , x0 ) = x0 + xt (10)
1 − ᾱt 1 − ᾱt
If we plug x̂0 into the above equation, it becomes:
1 βt
µθ (xt , t) = √ (xt − √ ϵθ (xt , t)) (11)
αt 1 − ᾱt

which is identical to the original modeling of DDPM (Ho et al., 2020).

13
On Discrete Prompt Optimization for Diffusion Models

Algorithm 1 DPO-Diff solver: Discrete Prompt Optimization Algorithm


Require: User Input suser , diffusion model G(·), a loss function L(I, s), learning rate lr.
Ensure: An optimized prompt s∗ .
// Building Search Space
Query ChatGPT to generate a word-substitutes dictionary for suser
Initialize Gumbel parameter α accordingly.
// Gradient Prompt Optimization
for i from 1 to max iter do
Sample p(w; α) for each word w from Gumbel Softmax.
P|V|
Compute mixed embedding: ẽ(α) = i=1 p(w = i; α) ∗ ei
Compute text gradient: gs = ∇α L(G(ẽ(α)), s)
Update Gumbel Parameter: αi = αi − lr ∗ gsuser
end for
// Evolutionary Sampling
Generate initial population P ∼ Gumbel(α)
Find the population that minimizes L using genetic algorithm P ∗ = EvoSearch(P, L)
s∗ = argmaxs (G(s ∈ P ∗ ), suser )

D. The complete DPO-Diff algorithm

E. Taxonomy of prompt optimization v.s. textual inversion

Task Name Example Method Taxonomy Input Output Backpropagation


Textual Inversion TI (Gal et al., 2022), Generate novel visual concepts provided in user use r image a text prompt that en- identical to regular
NTI (Mokady et al., images, done by distilling image to a soft text em- codes the given image diffusion model train-
2023), PEZ (Wen bedding and use that for downstream tasks content ing
et al., 2023)
Prompt Optimization Promptist (Hao et al., Improve the user prompt into a better one so that user text prompt An improved version through inference
2022), DPO-Diff the generated images better follow the original user of user text prompt steps
(ours) intention

Table 3: Comparison of prompt optimization and textual inversion tasks.

F. Implementation details
F.1. Hyperparameters
This section details the hyperparameter choices for our experiments. We use the same set of hyperparameters for all datasets
and tasks (prompt improvement and adversarial attack), unless otherwise specified.

Model We use Stable Diffusion v1-4 with a DDIM sampler for all experiments in the main paper. The guidance scale
and inference steps are set to 7.5 and 50 respectively (default). We also experimented with other versions, such as Stable
Diffusion v2-1 (512 x 512 resolution) and v2 (786x786 resolution), and found that the results are similar across different
versions. Although, we note that the high-resolution version of v2 tends to produce moderately better original images than
v1-4 and v2-1 in terms of clip loss, possibly due to sharper images.

Shortcut Text Gradient We set K = 1, corresponding to a 1-step Shortcut Text Gradient. This minimizes the memory
and runtime cost while empirically producing enough signal to guide the prompt optimization. Throughout the entire
optimization episode, we progressively increase t from 15 to 25 via a fixed stepwise function. This corresponds to a
coarse-to-fine learning curriculum. We note that the performance is only marginally affected by the choice of the upper and
lower bound for t (e.g. 20-30, 10-40 all produce similar results), as long as it avoids values near 0 (diminishing gradient)
and T (excessively noisy).

14
On Discrete Prompt Optimization for Diffusion Models

Gumbel softmax We use Gumbel Softmax with temperature 1. The learnable parameters are initialized to 1 for the
original word (for positive prompts) and empty string (for negative prompts), and 0 otherwise. To encourage exploration.
We bound the learnable parameters within 0 and 3 via hard clipping. The performance remains largely incentive to the
choice of bound, as long as they are in a reasonable range (i.e. not excessively small or large).

Optimization We optimize DPO-Diff using RMSprop with a learning rate of 0.1 and momentum of 0.5 for 20 iterations.
Each iteration will produce a single Gumbel Sample (batch size = 1) to compute the gradient, which will be clipped to 1/40.

clip loss The specific clip loss used in our experiment is spherical clip loss, following an early online implementation of
clip-guided diffusion (Crumb, 2022):
 2
∥x − y∥2
spherical clip(x, y) = 2 · arcsin
2
Note that our method does not rely on this specific choice to function; We also experimented with other distance measures
such as cos similarity on the clip embedding space, and found that they produced nearly identical prompts (and thus images).

Evolution Search We follow a traditional evolution search composed of four steps: initialize population, tournament,
mutation, and crossover. The specific choice of hyperparameters is population size = 20, tournament = top 10, mutation with
prob = 0.1 and size = 10, and crossover with size = 10. We run the evolutionary search for two iterations for both tasks,
while we note that the prompt improvement task often covers much faster (within a single iteration).

F.2. Search space construction


We construct our Synonyms and Antonyms space by querying ChatGPT using the following prompts. Since ChatGPT
sometimes makes mistakes by producing false synonyms or antonyms, we further filter candidate prompts by thresholding
the cosine similarity between adversarial prompts and user prompts in the embedding space of T5 during the evolutionary
search phase (Raffel et al., 2020). The threshold is set to 0.9 for all datasets.

Read the next paragraph. For each word, give 5 substitution words that do not change the meaning.
Use the format of ”A → B”.

For Antonyms:

Read the next paragraph. For each word, give 5 opposite words if it has any. Use the format of ”A →
B”.

G. More experimental settings


G.1. Dataset Collection
The prompts used in our paper are collected from three sources, DiffusionDB, COCO, and ChatGPT.

DiffusionDB DiffusionDB is a giant prompt database comprised of 2m highly diverse prompts for text-to-image generation.
Since these prompts are web-crawled, they are highly noisy, often containing incomplete phrases, emojis, random characters,
non-imagery prompts, etc (We refer the reader to its HuggingFace repo for an overview of the entire database.). Therefore,
we filter prompts from DiffusionDB by (1). asking ChatGPT to determine whether the prompt is complete and describes an
image, and (2) remove emoji-only prompts. We filter a total of 4,000 prompts from DiffusionDB and use those prompts
to generate images via Stable Diffusion. We sample 100 prompts with clip loss above 0.85 for prompt improvement, and
0.8 for adversarial attacks respectively. For ChatGPT, we found that it tends to produce prompts with much lower clip
score compared with COCO and DiffusionDB. To ensure a sufficient amount of prompts from this source is included in the
dataset, we lower the cutoff threshold to 0.82 when filtering its hard prompts for the prompt improvement task.

COCO We use the captions from the 2014 validation split of MS-COCO dataset as prompts. Similar to DiffusionDB,
we filter 4000 prompts, and further sample 100 prompts with clip loss above 0.85 for prompt improvement, and 0.8 for
adversarial attack respectively.

15
On Discrete Prompt Optimization for Diffusion Models

ChatGPT We also query ChatGPT for descriptions, as we found that it tends to produce more vivid and poetic descriptions
compared with the former sources. We use a diverse set of instructions for this task. Below are a few example prompts we
used to query ChatGPT for image descriptions.

Generate N diverse sentences describing photoes/pictures/images


Generate N diverse sentences describing images with length around 10
Generate N diverse sentences describing images with length around 20
Generate N diverse sentences describing images using simple words
Generate N diverse sentences describing images using fancy words

Below are some example prompts returned by ChatGPT:

A majestic waterfall cascades down a rocky cliff into a clear pool below, surrounded by lush greenery.
The sun setting behind the mountains casting a warm orange glow over the tranquil lake.
A pair of bright red, shiny high heels sit on a glossy wooden floor, with a glittering disco ball above.
A farmer plowing a field with a tractor.
The vivid orange and dark monarch butterfly was flapping through the atmosphere, alighting on a
flower to sip nectar.

We empirically observe that ChatGPT produces prompts with low clip loss when used to generate images through Stable
Diffusion on average, compared with DiffusionDB and COCO. Therefore, for filtering challenging prompts, we reduce the
threshold from 0.85 to 0.82 to allow more prompts to be selected.

G.2. Human Evaluation


We ask 5 judges without ML background to evaluate the faithfulness of the generated images. For each prompt, we generate
two images using the same seeds across different methods. To further avoid subjectiveness in evaluation, we provide the
judgers an ordered list of important key concepts for each prompt, and ask them to find the winning prompt by comparing
the hit rate. The ordered list of key concepts is provided by ChatGPT.
Since the 600 prompts used in the main experiments are filtered automatically via clip loss, they exhibit a certain level
of false positive rate: some images are actually faithful. Therefore, we further filter out 100 most broken prompts to be
evaluated by human judgers.

Special treatment for Adversarial Attack task. When conducting human evaluation on adversarial attack tasks, we
make the following adjustments to the protocol: (1). The wins and losses are reversed (2) There will be no ”draw”, as this
counts as a failed attempt. (3). Removing meaning-altering successes: we asked the human evaluators to identify cases
where success is achieved only because the adversarial prompt changed the meaning of the user prompt. Such instances
are categorized as failures. The results of our evaluation showcase that DPO-Diff achieved a success rate of 44%, thereby
establishing itself as the only baseline for this particular task on diffusion models.

H. Extra qualitative results


We include extra quantitative results of DPO-Diff in Figure 7 and Figure 8. Additionally, we conducted experiments with the
latest SD-XL model, as illustrated in Figure 10. The results indicate that DPO-Diff also achieves significant improvements
with more advanced diffusion models.

16
On Discrete Prompt Optimization for Diffusion Models

Figure 7: More images generated by user input versus improved negative prompts using Stable Diffusion v1-4.

User Input Promptist - Modifiers DPO-Diff - Negative Prompt


The ash and dark pigeon was roosting on the lamppost, intricate, elegant, highly detailed, ..., illustration, by justin fresh, shiny, hawk, overlooking, inside, Portrait, background,
observing the environment. gerard and artgerm, 8 k faded, unreal

alien caught smoking cigarettes in rented house intricate, elegant, highly detailed, ..., art by artgerm and greg native, liberated, clear, dull, out, bought, road, Macro, Script,
rutkowski and, 8 k monochrome, rendered

a spooky ghost in a graveyard by justin gerard and tony sart greg rutkowski, zabrocki, karlkka, ..., zenith view, zenith view, physical, house, aside, except, Grains, design, replica
pincushion lens effect

a plane flies through the air with fumes coming out the back Rephrase: a plane flies through the air with fumes coming ..., car, crashes, land, ..., breeze, departing, into, front, Grains, cold,

trending on artstation monochrome, oversized

A man is seated on a floor with a computer and some papers. intricate, elegant, highly detailed, ..., illustration, by justin female, was, standing, below, top, without, zero, ..., emails,
gerard and artger rutkowski, 8 k Blurry, bad, extra, proportion

Orange and brown cat sitting on top of white shoes. Trending on Artstation, ..., 4k, 8k, unreal 5, very detailed, hyper purple, however, black, crawling, ..., socks, Cropped,
control-realism. background, inverted, shape

17
On Discrete Prompt Optimization for Diffusion Models

Figure 8: More images generated by user input and adversarial prompts using Stable Diffusion v1-4.

User Input DPO-Diff - Adversarial Prompts


A cinematic scene from Berlin. A cinematic shot from Metropolis.

A painter adding the finishing touches to a vibrant canvas. A craftsman incorporating the finishing touches to a vivid masterpiece .

A skillful tailor sewing a beautiful dress with intricate details. A skillful tailor tailoring a lovely attire with sophisticated elements .

portrait of evil witch woman in front of sinister deep dark forest ambience image of vile mage dame in front of threatening profound dim wilderness ambience

Amazing photorealistic digital concept art of a guardian robot in a rural setting by a barn. astounding photorealistic digital theory design of a defender robot in a provincial context by a
stable .

close up portrait of a young lizard as a wizard with an epic idea close up snapshot of a youthful chameleon as a magician with an heroic guess

18
On Discrete Prompt Optimization for Diffusion Models

1e 4 1e 5

grad norm (averaged over all words)

grad norm (averaged over all words)


8
1.0
7
0.8 6
5
0.6
4
0.4 3

0.2 2
1
50 45 40 35 30 25 20 15 10 5 1 50 45 40 35 30 25 20 15 10 5 1
inference timestep t inference timestep t

Figure 9: Gradient near the beginning and end of the inference process are significantly less informative. We plot the
average gradient norm over all words across different timesteps. For each timestep, the Shortcut Text Gradient is computed
over 100 Gumbel samples.

I. Further discussion on Gradient-based Prompt Optimization


The computational cost of the Shortcut Text Gradient is controlled by K. Moreover, when we set t = T and K = T − 1, it
becomes the full-text gradient.
The result of remark 2 is rather straightforward: recall that the image generation process starts with a random noise xT and
gradually denoising it to the final image x0 . Since the gradient is enabled from t to t − K in Shortcut Text Gradient; when
t = T and K = T , it indicates that gradient is enabled from T to 0, which covers the entire inference process. In this case,
the Shortcut Text Gradient reduces to the full gradient on text.

J. Extra ablation study results.


J.1. Gradient norm v.s. timestep.
When randomly sampling t in computing the Shortcut Text Gradient, we avoid timesteps near the beginning and the end of
the image generation process, as gradients at those places are not informative. As we can see, for both adversarial attack and
prompt improvement, the gradient norm is substantially smaller near t = T and especially t = 0, compared with timesteps
in the middle. The reason, we conjecture, is that the images are almost pure noise at the beginning, and are almost finalized
towards the end. Figure 9 shows the empirical gradient norm across different timesteps.

J.2. Extended discussion on different search algorithms


In our experiments, we found that Gradient-based Prompt Optimization converges faster at the early stage of the optimization.
This result confirms the common belief that white-box algorithms are more query efficient than black-box algorithms in
several other machine learning fields, such as adversarial attack (Ilyas et al., 2018; Cheng et al., 2018). However, when
giving a sufficient amount of query, Evolutionary Search eventually catches up and even outperforms GPO. The reason, we
conjecture, is that GPO uses random search to draw candidates from the learned distribution, which bottlenecked its sample
efficiency at later stages. This promotes the hybrid algorithm used in our experiments: Using Evolutionary Search to sample
from the learned distribution of GPO. The hybrid algorithm achieves the best overall convergence.

J.3. Extended discussion on negative v.s. positive prompt optimization


As discussed in the main text, one of our highlighted findings of is that optimizing for negative prompts is more effective
than positive prompts in improving the prompt-following ability of diffusion models. This is evidenced by Table 2, which
shows that Antonym Space contains a denser population of promising prompts (lower clip loss) than positive spaces. Such
search space also allows the search algorithm to identify an improved prompt more easily. We conjecture that this might
indicate diffusion models are more sensitive to changes in negative prompts than positive prompts, as the baseline negative
prompt is merely an empty string.

19
On Discrete Prompt Optimization for Diffusion Models

Figure 10: Images generated by user input and improved negative prompts on Stable Diffusion XL.

User Input Promptist - Modifiers DPO-Diff - Negative Prompt


a brown dachshund with a black cat sitting in a canoe. highly detailed, digital painting, ..., sharp focus, illustration, art zero, black, cat, lacking, green, horse, walking, beyond, house,

by artgerm and greg rutkowski and epao Mutation, animals, error, surreal

darth vader in iron man armour highly detailed, digital painting, ..., illustration, art by greg yoda, outside, lightweight, exposed, Render, Script, incomplete,
rutkowski and alphonse mucha pieces

The ash and dark pigeon was roosting on the lamppost, intricate, elegant, highly detailed, digital painting, artstation, green, clear, departing, ditch, inner, Mistake, CGI, cooked,
observing the environment. concept art, sharp focus, illustration, by justin gerard and art replica
rutkowski, 8 k

a very big building with a mounted clock greg rutkowski, zabrocki, ..., 8 k, ultra wide angle, zenith view, mildly, tiny, detached, Logo, cityscape, inverted, stale
pincushion lens effect

The man is sitting on the bench close to the asian section. greg rutkowski, zabrocki, karlkka, ..., 8 k, ultra wide angle, girl, standing, under, ground, distant, unto, entirety, Mistake,
zenith view, pincushion lens effect black, engine, poorly

Two sinks stand next to a bathtub in a bathroom. greg rutkowski, zabrocki, karlkka, jayison devadas, trending one,soars, lie, multiple, kitchen, outside, bedroom, Blurry,
impervious artificial, down, poorly

A woman that is standing next to a man. highly detailed, digital painting, artstation, ..., art by greg male, crawling, away, far, several, woman, Mutation, characters,
rutkowski and alphonse mucha folded, username

20

You might also like