0% found this document useful (0 votes)
13 views21 pages

2301.07093

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views21 pages

2301.07093

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

G LIGEN: Open-Set Grounded Text-to-Image Generation

Yuheng Li1§ , Haotian Liu1§ , Qingyang Wu2 , Fangzhou Mu1 , Jianwei Yang3 , Jianfeng Gao3 ,
Chunyuan Li3¶ , Yong Jae Lee1¶
1 2 3
University of Wisconsin-Madison Columbia University Microsoft
https://gligen.github.io/
arXiv:2301.07093v2 [cs.CV] 17 Apr 2023

(a) Caption: “A woman sitting in a restaurant with a pizza in front of her ” Caption: “A dog / bird / helmet / backpack is on the grass”
Grounded text: table, pizza, person, wall, car, paper, chair, window, bottle, cup
(b) Grounded image: red inset

Caption: “Elon Musk and Emma Watson on a movie poster” Caption: “a baby girl / monkey / Hormer Simpson / is scratching her/its head”
(c) Grounded text: Elon Musk, Emma Watson; Grounded style image: blue inset
(d) Grounded keypoints: plotted dots on the left image

Caption: “A vibrant colorful bird sitting on tree branch” Caption: “A young boy with white powder on his face looks away”
(e) Grounded depth map: the left image (f) Grounded HED map: the left image

Caption: “Cars park on the snowy street” Caption: “A living room filled with lots of furniture and plants”
(g) Grounded normal map: the left image (h) Grounded semantic map: the left image

Figure 1. G LIGEN enables versatile grounding capabilities for a frozen text-to-image generation model, by feeding different grounding
conditions. G LIGEN supports (a) text entity + box, (b) image entity + box, (c) image style and text + box, (d) keypoints, (e) depth map, (f)
edge map, (g) normal map, and (h) semantic map.

Abstract the pre-trained model, we freeze all of its weights and inject
the grounding information into new trainable layers via a
Large-scale text-to-image diffusion models have made
gated mechanism. Our model achieves open-world grounded
amazing advances. However, the status quo is to use
text2img generation with caption and bounding box condi-
text input alone, which can impede controllability. In this
tion inputs, and the grounding ability generalizes well to
work, we propose G LIGEN, Grounded-Language-to-Image
novel spatial configurations and concepts. G LIGEN’s zero-
Generation, a novel approach that builds upon and extends
shot performance on COCO and LVIS outperforms existing
the functionality of existing pre-trained text-to-image dif-
supervised layout-to-image baselines by a large margin.
fusion models by enabling them to also be conditioned on
grounding inputs. To preserve the vast concept knowledge of § Part of the work performed at Microsoft; ¶ Co-senior authors

1
1. Introduction we gradually fuse the new grounding information into the
Image generation research has witnessed huge advances pretrained model using a gated mechanism [1]. This design
in recent years. Over the past couple of years, GANs [14] enables flexibility in the sampling process during generation
were the state-of-the-art, with their latent space and con- for improved quality and controllability; for example, we
ditional inputs being well-studied for controllable manipu- show that using the full model (all layers) in the first half of
lation [48, 60] and generation [27, 29, 47, 82]. Text condi- the sampling steps and only using the original layers (without
tional autoregressive [52, 74] and diffusion [51, 56] models the gated Transformer layers) in the latter half can lead
have demonstrated astonishing image quality and concept to generation results that accurately reflect the grounding
coverage, due to their more stable learning objectives and conditions while also having high image quality.
large-scale training on web image-text paired data. These In our experiments, we primarily study grounded
models have gained attention even among the general public text2img generation with bounding boxes, inspired by
due to their practical use cases (e.g., art design and creation). the recent scaling success of learning grounded language-
image understanding models with boxes in GLIP [34].To
Despite exciting progress, existing large-scale text-to-
enable our model to ground open-world vocabulary con-
image generation models cannot be conditioned on other
cepts [32,34,76,79], we use the same pre-trained text encoder
input modalities apart from text, and thus lack the ability to
(for encoding the caption) to encode each phrase associated
precisely localize concepts, use reference images, or other
with each grounded entity (i.e., one phrase per bounding
conditional inputs to control the generation process. The cur-
box) and feed the encoded tokens into the newly inserted
rent input, i.e., natural language alone, restricts the way that
layers with their encoded location information. Due to the
information can be expressed. For example, it is difficult to
shared text space, we find that our model can generalize to
describe the precise location of an object using text, whereas
unseen objects even when only trained on the COCO [41]
bounding boxes / keypoints can easily achieve this, as shown
dataset. Its generalization on LVIS [15] outperforms a strong
in Figure 1. While conditional diffusion models [10, 53, 55]
fully-supervised baseline by a large margin. To further im-
and GANs [26, 37, 48, 71] that take in input modalities other
prove our model’s grounding ability, we unify the object
than text for inpainting, layout2img generation, etc., do exist,
detection and grounding data formats for training, following
they rarely combine those inputs for controllable text2img
GLIP [34]. With larger training data, our model’s general-
generation.
ization is consistently improved.
Moreover, prior generative models—regardless of the
generative model family—are usually independently trained Contributions. 1) We propose a new text2img generation
on each task-specific dataset. In contrast, in the recognition method that endows new grounding controllability over ex-
field, the long-standing paradigm has been to build recogni- isting text2img diffusion models. 2) By preserving the pre-
tion models [32, 42, 84] by starting from a foundation model trained weights and learning to gradually integrate the new
pretrained on large-scale image data [4, 16, 17] or image-text localization layers, our model achieves open-world grounded
pairs [33, 50, 75]. Since diffusion models have been trained text2img generation with bounding box inputs, i.e., synthesis
on billions of image-text pairs [53], a natural question is: of novel localized concepts unobserved in training. 3) Our
Can we build upon existing pretrained diffusion models and model’s zero-shot performance on layout2img tasks signifi-
endow them with new conditional input modalities? In this cantly outperforms the prior state-of-the-art, demonstrating
way, analogous to the recognition literature, we may be able the power of building upon large pretrained generative mod-
to achieve better performance on other generation tasks due els for downstream tasks.
to the vast concept knowledge that the pretrained models
have, while acquiring more controllability over existing text- 2. Related Work
to-image generation models. Large scale text-to-image generation models. State-of-
With the above aims, we propose a method for providing the-art models in this space are either autoregressive [13, 52,
new grounding conditional inputs to pretrained text-to-image 69, 74] or diffusion [45, 51, 53, 56, 81]. Among autoregres-
diffusion models. As shown in Figure 1, we still retain the sive models, DALL-E [52] is one of the breakthrough works
text caption as input, but also enable other input modalities that demonstrates zero-shot abilities, while Parti [74] demon-
such as bounding boxes for grounding concepts, grounding strates the feasibility of scaling up autoregressive models.
reference images, grounding part keypoints, etc. The key Diffusion models have also shown very promising results.
challenge is preserving the original vast concept knowledge DALL-E 2 [51] generates images from the CLIP [50] image
in the pretrained model while learning to inject the new space, while Imagen [56] finds the benefit of using pretrained
grounding information. To prevent knowledge forgetting, language models. The concurrent Muse [6] demonstrates
we propose to freeze the original model weights and add that masked modeling can achieve SoTA-level generation
new trainable gated Transformer layers [67] that take in the performance with higher inference speed. However, all of
new grounding input (e.g., bounding box). During training, these models usually only take a caption as the input, which

2
can be difficult for conveying other information such as the fusion are the most powerful models publicly available to
precise location of an object. Make-A-Scene [13] also incor- the research community. To reduce the computational costs
porates semantic maps into its text-to-image generation, by of vanilla diffusion model training, LDM proceeds in two
training an encoder to tokenize semantic masks to condition stages. The first stage learns a bidirectional mapping net-
the generation. However, it can only operate in a closed-set work to obtain the latent representation z of the image x.
(of 158 categories), whereas our grounded entities can be The second stage trains a diffusion model on the latent z.
open-world. A concurrent work eDiff-I [3] shows that by Since the first stage model produces a fixed bidirectional
changing the attention map, one can generate objects that mapping between x and z, from hereon, we focus on the
roughly follow a semantic map input. However, We believe latent generation space of LDM for simplicity.
our interface with boxes is simpler, and more importantly,
our method allows other conditioning inputs such as key- Training Objective. Starting from noise z T , the model
points, edge map, inference images, etc., which are hard to gradually produces less noisy samples z T −1 , z T −2 , · · · , z 0 ,
manipulate through attention. conditioned on caption c at every time step t. To learn such
Image generation from layouts. Given bounding boxes a model fθ parameterized by θ, for each step, the LDM
labeled with object categories, the task is to generate a corre- training objective solves the denoising problem on latent
sponding image [24, 39, 61–63, 72, 78], which is the reverse representations z of the image x:
task of object detection. Layout2Im [78] formulated the
problem and combined a VAE object encoder, an LSTM [22] \label {eq:ldm_loss} \min _{\thetav } \mathcal {L}_{\text {LDM}} = \mathbb {E}_{\zv , \epsilonv \sim \mathcal {N}(\mathbf {0}, \mathbf {I}), t} \big [ \| \epsilonv - f_{\thetav }(\zv _t, t, \cv ) \|^2_2 \big ], (1)
object fuser, and an image decoder to generate the image, us-
ing global and object-level adversarial losses [14] to enforce where t is uniformly sampled from time steps {1, · · · , T },
realism and layout correspondence. LostGAN [61, 62] gen- z t is the step-t noisy variant of input z, and fθ (∗, t, c) is the
erates a mask representation which is used to normalize fea- (t, c)-conditioned denoising autoencoder.
tures, taking inspiration from StyleGAN [28]. LAMA [39]
improves the intermediate mask quality for better image Network Architecture. The core of the network archi-
quality. Transformer [66] based methods [24, 72] have also tecture is how to encode the conditions, based on which
been explored. Critically, existing layout2image methods a cleaner version of z is produced. (i) Denoising Autoen-
are closed-set, i.e., they can only generate limited localized coder. fθ (∗, t, c) is implemented via UNet [54]. It takes in
visual concepts observed in the training set such as the 80 a noisy latent z, as well as information from time step t and
categories in COCO. In contrast, our method represents the condition c. It consists of a series of ResNet [19] and Trans-
first work for open-set grounded image generation. A con- former [67] blocks. (ii) Condition Encoding. In the original
current work ReCo [73] also demonstrates open-set abilities LDM, a BERT-like [9] network is trained from scratch to
by building upon a pretraned Stable Diffusion model [53]. encode each caption into a sequence of text embeddings,
However, it finetunes the original model weights, which has ftext (c), which is fed into (1) to replace c. The caption fea-
the potential to lead to knowledge forgetting. Furthermore, ture is encoded via a fixed CLIP [50] text encoder in Stable
it only demonstrates box grounding results whereas we show Diffusion. Time t is first mapped to time embedding ϕ(t),
results on more modalities as shown in the Figure 1. then injected into the UNet. The caption feature is used in
a cross attention layer within each Transformer block. The
Other conditional image generation. For GANs, var-
model learns to predict the noise, following (1).
ious conditioning information have been explored; e.g.,
With large-scale training, the model fθ (∗, t, c) is well
text [65, 70, 80], box [61, 62, 78], semantic masks [36, 47],
trained to denoise z based on the caption information only.
images [8,38,83]. For diffusion models, LDM [53] proposes
Though impressive language-to-image generation results
a unified approach for conditional generation by injecting the
have been shown with LDM by pretraining on internet-scale
condition via cross-attention layers. Palette [55] performs
data, it remains challenging to synthesize images where
image-to-image tasks using diffusion models. These models
additional grounding input can be instructed, and is thus the
are usually trained from scratch independently. In our work,
focus of our paper.
we investigate how to build upon existing models pretrained
on large-scale web data, to enable new open-set grounded
image generation capabilities in a cost-effective manner.
4. Open-set Grounded Image Generation
4.1. Grounding Instruction Input
3. Preliminaries on Latent Diffusion Models
For grounded text-to-image generation, there are a vari-
Diffusion-based methods are one of the most effective ety of ways to ground the generation process via an addi-
model families for text2image tasks, among which latent tional condition. We denote the semantic information of the
diffusion model (LDM) [53] and its successor Stable Dif- grounding entity as e, which can be described either through

3
bride + box = COCO categories), as they typically learn a vector embed-
Text Grounding
groom + box = ding u per entity, to replace ftext (e) in (5). For a closed-set
encoder Tokens
wedding cake + box = setting with K concepts, a dictionary of with K embeddings
are learned, U = [u1 , · · · , uK ]. While this non-parametric
a bride and groom representation works well in the closed-set setting, it has
Text … Caption
are about to cut encoder Tokens two drawbacks: (1) The conditioning is implemented as a
their wedding cake dictionary look-up over U in the evaluation stage, and thus
the model can only ground the observed entities in the gener-
Figure 2. Illustration of grounding token construction process for ated images, lacking the ability to generalize to ground new
the bounding box with text case.
entities; (2) No word/phrase is ever utilized in the model
condition, and the semantic structure [23] of the underlying
text or an example image; and as l the grounding spatial language instruction is missing. In contrast, in our open-set
configuration described with e.g., a bounding box, a set of design, since the noun entities are processed by the same text
keypoints, or an edge map, etc. Note that in certain cases, encoder that is used to encode the caption, we find that even
both semantic and spatial information can be represented when the localization information is limited to the concepts
with l alone (e.g., edge map), in which a single map can rep- in the grounding training datasets, our model can still gener-
resent what objects may be present in the image and where. alize to other concepts as we will show in our experiments.
We define the instruction to a grounded text-to-image model
as a composition of the caption and grounded entities: Extensions to Other Grounding Conditions. Note that
the proposed grounding instruction in Eq (4) is in a general
\label {eq:data_input} ~~~\text {Instruction: }~& \yv = (\cv , \ev ), ~~~\text {with }~ \\ \label {eq:data_input_caption} \hspace {-2mm} ~~~\text {Caption: }~ & \cv = [c_1, \cdots ,c_L] \\ \label {eq:data_input_grounding} ~~~\text {Grounding: }~ & \ev = [(e_1, \lv _1), \cdots , (e_N, \lv _N)] form, though our description thus far has focused on the case
of using text as entity e and bounding box as l (the major
setting of this paper). To demonstrate the flexibility of the
(4) G LIGEN framework, we also study additional representative
cases which extend the use scenario of Eq (4).
where L is the caption length, and N is the number of entities
to ground. In this work, we primarily study using bounding • Image Prompt. While language allows users to describe
box as the grounding spatial configuration l, because of its a rich set of entities in an open-vocabulary manner, some-
large availability and easy annotation for users. For the times more abstract and fine-grained concepts can be
grounded entity e, we mainly focus on using text as its better characterized by example images. To this end,
representation due to simplicity. We process both caption one may describe entity e using an image, instead of
and grounding entities as input tokens to the diffusion model, language. We use an image encoder to obtain feature
as described in detail below. fimage (e) which is used in place of ftext (e) in Eq (5) when
Caption Tokens. The caption c is processed in the same e is an image.
way as in LDM. Specifically, we obtain the caption fea- • Keypoints. As a simple parameterization method to spec-
ture sequence (yellow tokens in Figure 2) using hc = ify the spatial configuration of an entity, bounding boxes
[hc1 , · · · , hcL ] = ftext (c), where hcℓ is the contextualized text ease the user-machine interaction interface by providing
feature for the ℓ-th word in the caption. the height and width of the object layout only. One may
Grounding Tokens. For each grounded text entity denoted consider richer spatial configurations such as keypoints
with a bounding box, we represent the location information for G LIGEN, by parameterizing l in Eq (4) with a set
as l = [αmin , βmin , αmax , βmax ] with its top-left and bottom- of keypoint coordinates. Similar to encoding boxes, the
right coordinates. For the text entity e, we use the same pre- Fourier embedding [44] can be applied to each keypoint
trained text encoder to obtain its text feature ftext (e) (light location l = [x, y].
green token in Figure 2), and then fuse it with its bounding
box information to produce a grounding token (dark green • Spatially-aligned conditions. To enable more fine-
token in Figure 2 ): grained controlability, spatially-aligned condition maps
can be used, such as edge map, depth map, normal map,
\label {eq:bbox_token} h^e = \text {MLP}(f_{\text {text}}(e), \text {Fourier}(\lv ) ) (5) and semantic map. In these cases, the semantic informa-
tion e is already contained within each spatial coordinate
where Fourier is the Fourier embedding [44], and MLP(·, ·) l of the condition map. A network (e.g. conv layers) can
is a multi-layer perceptron that first concatenates the two be used to encode l into h × w grounding tokens. We
inputs across the feature dimension. The grounding token also notice that additionally feeding l into the first conv
sequence is represented as he = [he1 , · · · , heN ] layer of the UNet can accelerate training. Specifically,
From Closed-set to Open-set. Note that existing lay- the input to the UNet is CONCAT(fl (l), z t ) where fl is a
out2img works only deal with a closed-set setting (e.g., simple downsampling network to reduce l into the same

4
Gated Self-Attention
Cross-Attention

𝜸
(init as 0)

Gated Self-Attention

Self-Attention

Self-Attention

Visual Caption Grounding

Figure 3. For a pretrained text2img model, the text features are fed into each cross-attention layer. A new gated self-attention layer is
inserted to take in the new conditional information.

spatial resolution as z t . In this case, the first conv layer the concatenation of visual and grounding tokens [v, he ]:
of the UNet needs to be trainable.
Figure 1 shows generated examples for these other grounding \label {eq:gated-self-attention} \vv = \vv + \beta \cdot \tanh (\gamma ) \cdot \text {TS}(\text {SelfAttn}([\vv , \hv ^e])) (8)
conditions. Please refer to the supp for more details.
where TS(·) is a token selection operation that considers
4.2. Continual Learning for Grounded Generation visual tokens only, and γ is a learnable scalar which is initial-
ized as 0. β is set as 1 during the entire training process and
Our goal is to endow new spatial grounding capabilities to is only varied for scheduled sampling during inference (intro-
existing large language-to-image generation models. Large duced below) for improved quality and controllability. Note
diffusion models have been pre-trained on web-scale image- that (8) is injected in between (6) and (7). Intuitively, the
text to gain the required knowledge for synthesizing realistic gated self-attention in (8) allows visual features to leverage
images based on diverse and complex language instructions. conditional information, and the resulting grounded features
Due to the high pre-training cost and excellent performance, are treated as a residual, whose gate is initially set to 0 (due
it is important to retain such knowledge in the model weights to γ being initialized as 0). This also enables more stable
while expanding the new capability. Hence, we consider to training. Note that a similar idea is used in Flamingo [1];
lock the original model weights, and gradually adapt the however, it uses gated cross-attention, which leads to worse
model by tuning new modules. performance in our ablation study.

Gated Self-Attention. We denote v = [v1 , · · · , vM ] as Learning Procedure. We adapt the pre-trained model
the visual feature tokens of an image. The original Trans- such that grounding information can be injected while all the
former block of LDM consists of two attention layers: The original components remain intact. By denoting all the new
self-attention over the visual tokens, followed by cross- parameters as θ ′ , including all gated self-attention layers in
attention from caption tokens. By considering the residual Eq (8) and MLP in Eq (5), we use the original denoising
connection, the two layers can be written: objective as in (1) for model continual learning, based on the
grounding instruction input y:
& \vv = \vv + \text {SelfAttn}(\vv ) \label {eq:ldm_sa}\\ \hspace {-2mm} & \vv = \vv + \text {CrossAttn}(\vv , \hv ^c) \label {eq:ldm_ca}
(7)
\label {eq:grounding_ldm_loss} \small \min _{\thetav '} \mathcal {L}_{\text {Grounding}} = \mathbb {E}_{\zv , \epsilonv \sim \mathcal {N}(\mathbf {0}, \mathbf {I}), t} \big [ \| \epsilonv - f_{\{\thetav , \thetav '\}}(\zv _t, t, \yv ) \|^2_2 \big ]. (9)
We freeze these two attention layers and add a new gated
self-attention layer to enable the spatial grounding ability; Why should the model try to use the new grounding in-
see Figure 3. Specifically, the attention is performed over formation? Intuitively, predicting the noise that was added

5
to a training image in the reverse diffusion process would be Generation: FID (↓) Grounding: YOLO (↑)
Model
Fine-tuned Zero-shot AP/AP50 /AP75
easier if the model could leverage the external knowledge CogView [11] - 27.10 -
(e.g., each object’s location). Thus, in this way, the model KNN-Diffusion [2] - 16.66 -
DALL-E 2 [51] - 10.39 -
learns to use the additional information while retaining the Imagen [56] - 7.27 -
pre-trained concept knowledge. Re-Imagen [7] 5.25 6.88
Parti [74] 3.20 7.23 -
LAFITE [82] 8.12 26.94 -
LAFITE2 [80] 4.28 8.42 -
Scheduled Sampling in Inference. The standard infer- Make-a-Scene [13] 7.55 11.84 -
ence scheme of G LIGEN is to set β = 1 in (8), and the NÜWA [69] 12.90 - -
Frido [12] 11.24 - -
entire diffusion process is influenced by the grounding to- XMC-GAN [77] 9.33 - -
kens. This constant β sampling scheme provides overall AttnGAN [70] 35.49 - -
DF-GAN [65] 21.42 - -
good performance in terms of both generation and ground- Obj-GAN [35] 20.75 - -
ing, but sometimes generates lower quality images compared LDM [53] - 12.63 -
LDM* 5.91 11.73 0.6 / 2.0 / 0.3
with the original text2img models (e.g., as Stable Diffusion G LIGEN (COCO2014CD) 5.82 - 21.7 / 39.0 / 21.7
is finetuned on high aesthetic scored images). To strike a bet- G LIGEN (COCO2014D) 5.61 - 24.0 / 42.2 / 24.1
G LIGEN (COCO2014G) 6.38 - 11.2 / 21.2 / 10.7
ter trade-off between generation and grounding for G LIGEN,
we propose a scheduled sampling scheme. As we freeze Table 1. Evaluation of image quality and correspondence to layout
the original model weights and add new layers to inject new on COCO2014 val-set. All numbers are taken from correspond-
grounding information in training, there is flexibility during ing papers, LDM* is our COCO fine-tuned LDM baseline. Here
inference to schedule the diffusion process to either use both G LIGEN is built upon LDM.
the grounding and language tokens or use only the language
tokens of the original model at anytime, by setting differ- null caption input [21]. Detection annotations are used as
ent β values in (8). Specifically, we consider a two-stage noun-entities. 2) COCO2014CD: Detection + Caption Data.
inference procedure, divided by τ ∈ [0, 1]. For a diffusion Both caption and detection annotations are used. Note that
process with T steps, one can set β to 1 at the first τ ∗ T the noun entities may not always exist in the caption. 3)
steps, and set β to 0 for the remaining (1 − τ ) ∗ T steps: COCO2014G: Grounding Data. Given the caption annota-
tions, we use GLIP [34], which detects the caption’s noun
entities in the image, to get pseudo box labels. Please refer
\beta = \left \{\begin {matrix} 1, & t \le \tau * T ~~~\text {\# Grounded inference stage}~ \\ 0, & t > \tau * T ~~~\text {\# Standard inference stage}~~ \end {matrix}\right . \label {eq:beta_cyclic} (10) to supp for more details about these three types of data.

The major benefit of scheduled sampling is improved Baselines. Baseline models are listed in Table 1. Among
visual quality as the rough concept location and outline are them, we also finetune an LDM [53] pretrained on LAION
decided in the early stages, followed by fine-grained details 400M [57] on COCO2014 with its caption annotations,
in later stages. It also allows us to extend the model trained which we denote as LDM*.
in one domain (human keypoint) to other domains (monkey, The text2img baselines, as they cannot be conditioned on
cartoon characters) as shown in Figure 1. box inputs, are evaluated on COCO2014C: Caption Data.
Evaluation metrics. We use the captions and/or box anno-
5. Experiments tations from 30K randomly sampled images to generate 30K
We evaluate our model’s boxes grounded text2img gener- images for evaluation. We use FID [20] to evaluate image
ation in both the closed-set and open-set settings, and show quality. To evaluate grounding accuracy (i.e. correspondence
extensions to other grounding modalities. We conduct our between the input bounding box and generated entity), we
main quantitative experiments by building upon a pretrained use the YOLO score [40]. Specifically, we use a pretrained
LDM on LAION [57], unless stated otherwise. YOLO-v4 [5] to detect bounding boxes on the generated
images and compare them with the ground truth boxes using
5.1. Closed-set Grounded Text2Img Generation average precision (AP). Since prior text2img methods do
We first evaluate the generation quality and grounding not support taking box annotations as input, it is not fair to
accuracy of our model in a closed-set setting. For this, we compare with them on this metric. Thus, we only report
train and evaluate on the COCO2014 [41] dataset, which is numbers for the fine-tuned LDM as a reference.
a standard benchmark used in the text2img literature [51, 56, Results. Table 1 shows the results. First, we see that the
65,70,82], and evaluate how the different types of grounding image synthesis quality of our approach, as measured by FID,
instructions impact our model’s performance. is better than most of the state-of-the-art baselines due to rich
Grounding instructions. We use the following grounding visual knowledge learned in the pretraining stage. Next, we
instructions to train our model: 1) COCO2014D: Detec- find that all three grounding instructions lead to comparable
tion Data. There are no caption annotations so we use a FID to that of the LDM* baseline, which is finetuned on

6
Model FID (↓) YOLO score (AP/AP50 /AP75 ) (↑)
LostGAN-V2 [62] 42.55 9.1 / 15.3 / 9.8
OCGAN [64] 41.65 -
HCSS [25] 33.68 -
LAMA [40] 31.12 13.40 / 19.70 / 14.90
TwFA [71] 22.15 - / 28.20 / 20.12
A blue jay is standing on a branch in the woods near us G LIGEN-LDM 21.04 22.4 / 36.5 / 24.1

Table 2. Image quality and correspondence to layout are compared


with baselines on COCO2017 val-set.

a croissant is placed in a brown wooden table


Model Training data AP APr APc APf
LAMA [40] LVIS 2.0 0.9 1.3 3.2
G LIGEN-LDM COCO2014CD 6.4 5.8 5.8 7.4
G LIGEN-LDM COCO2014D 4.4 2.3 3.3 6.5
G LIGEN-LDM COCO2014G 6.0 4.4 6.1 6.6
G LIGEN-LDM GoldG,O365 10.6 5.8 9.6 13.8
G LIGEN-LDM GoldG,O365,SBU,CC3M 11.1 9.0 9.8 13.4
a hello kitty is holding a laundry basket
G LIGEN-Stable GoldG,O365,SBU,CC3M 10.8 8.8 9.9 12.6
Figure 4. Our model can generalize to open-world concepts even Upper-bound - 25.2 19.0 22.2 31.2
when only trained using localization annotation from COCO. Table 3. GLIP-score on LVIS validation set. Upper-bound is
provided by running GLIP on real images scaled to 256 × 256.
COCO2014 with caption annotations. Our model trained
using detection annotation instructions (COCO2014D) has COCO 2017 20
LVIS
the overall best performance. However, when we evaluate GLIGen
(Fine-tuned) GLIGen
30
(Fine-tuned)
this model on COCO2014CD instructions, we find that it
GLIGen
has worse performance (FID: 8.2) – its ability to understand GLIGen (Zero-shot)
AP (YOLO)

20

AP (GLIP)
(Zero-shot) 10
real captions may be limited as it is only trained with the GLIGen
(Reference) LAMA GLIGen
(Reference)
null caption. For the model trained with GLIP grounding 10 LostGAN-V2
LDM LDM
instructions (COCO2014G), we actually evaluate it using (Fine-tuned) (Zero-shot) LAMA
the COCO2014CD instructions since we need to compute
20 30 40 10 20 50 100 160
the YOLO score which requires ground-truth detection an- FID FID
notations. Its slightly worse FID may be attributed to its
learning from GLIP pseudo-labels. The same reason can Figure 5. Performance comparison measured by image genera-
explain its low YOLO score (i.e., the model did not see any tion and grounding quality on COCO2017 (left) and LVIS (right)
datasets. G LIGEN is built upon LDM, and continually pre-trained
ground-truth detection annotations during training).
on the joint data of GoldG, O365, SBU, and CC3M. G LIGEN
Overall, this experiment shows that: 1) Our model can (Reference) is pre-trained on COCO/LVIS only. The circle size
successfully take in boxes as an additional condition while indicates the model size.
maintaining image generation quality. 2) All grounding
instruction types are useful, which suggests that combining
their data together can lead to complementary benefits. 5.2. Open-set Grounded Text2Img Generation
Comparison to Layout2Img generation methods. Thus COCO-training model. We first take G LIGEN trained only
far, we have seen that our model correctly learns to use the with the grounding annotations of COCO (COCO2014CD),
grounding condition. But how accurate is it compared to and evaluate whether it can generate grounded entities be-
methods that are specifically designed for layout2img gener- yond the COCO categories. Figure 4 shows qualitative
ation? To answer this, we train our model on COCO2017D, results, where G LIGEN can ground new concepts such as
which only has detection annotations. We use the 2017 splits “blue jay”, “croissant” or ground object attributes
(instead of 2014 as before), as it is the standard benchmark such as “brown wooden table”, beyond the training
in the layout2img literature. In this experiment, we use the categories. We hypothesize this is because the gated self-
exact same annotation as all layout2img baselines. attention of G LIGEN learns to re-position the visual features
Table 2 shows that we achieve the state-of-the-art perfor- corresponding to the grounding entities in the caption for
mance for both image quality and grounding accuracy. We the ensuing cross-attention layer, and gains generalization
believe the core reason is because previous methods train ability due to the shared text spaces in these two layers.
their model from scratch, whereas we build upon a large- We also quantitatively evaluate our model’s zero-shot
scale pretrained generative model with rich visual semantics. generation performance on LVIS [15], which contains 1203
Qualitative comparisons are in the supp. We also scale up long-tail object categories. We use GLIP to predict bounding
our training data (discussed later) and pretrain a model on boxes from the generated images and calculate AP, thus we
this dataset. Figure 5 left shows this model’s zero-shot and name it as GLIP score. We compare to a state-of-the-art
finetuned results. model designed for the layout2img task: LAMA [40]. We

7
GLIGEN-sample1 GLIGEN-sample2 GLIGEN-sample3 Stable diffusion

Caption: “Michael Jackson in a black cloth is singing into a microphone”


Grounded text: Michael Jackson, black cloth, microphone

Caption: “golden hour, a pekingese is on the beach with an umbrella”


Grounded text: Pekingese, umbrella, sea

Caption: “a hen is hatching a huge egg”


Grounded text: hen, egg

Caption: “an apple and a same size dog”


Grounded text: apple, dog
Figure 6. Grounded text2image generation. The baseline lacks grounding ability and can also miss objects e.g. “umbrella” in a sentence
with multiple objects due to CLIP text space, and it also struggles to generate spatially counterfactual concepts.

train LAMA using the official code on the LVIS training set strate its performance on Figure 5 right. To demonstrate the
(in a fully-supervised setting), whereas we directly evaluate generality of our method, we also train our model based on
our model in a zero-shot task transfer manner, by running the Stable Diffusion model checkpoint using the largest data.
inference on the LVIS val set without seeing any LVIS labels. We show some qualitative examples in Figure 6 using this
Table 3 (first 4 rows) shows the results. Surprisingly, even model. Our model gains the grounding ability compared to
though our model is only trained on COCO annotations, vanilla Stable Diffusion. We notice that Stable Diffusion
it outperforms the supervised baseline by a large margin. model may overlook certain objects (“umbrella” in the
This is because the baseline, which is trained from scratch, second example) due to its use of the CLIP text encoder
struggles to learn from limited annotations (many of the rare which tends to focus on global scene properties, and may
classes in LVIS have fewer than five training samples). In ignore object-level details [3]. It also struggles to generate
contrast, our model can take advantage of the pretrained spatially counterfactual concepts. By explicitly injecting
model’s vast concept knowledge. entity information through grounding tokens, our model can
improve the grounding ability in two ways: the referred ob-
Scaling up the training data. We next study our model’s jects are more likely to appear in the generated images, and
open-set capability with much larger training data. Specif- the objects reside in the specified spatial location.
ically, we follow GLIP [34] and train on Object365 [58]
and GoldG [34], which combines two grounding datasets: 5.3. Beyond Text Modality Grounding
Flickr [49] and VG [31]. We also use CC3M [59] and
SBU [46] with grounding pseudo-labels generated by GLIP. Image grounded generation. One can also use a reference
Table 3 shows the data scaling results. As we scale up the image to represent a grounded entity as discussed previously.
training data, our model’s zero-shot performance increases, Fig. 1 (b) shows qualitative results, which demonstrate that
especially for rare concepts. We also try to finetune the the visual feature can complement details that are hard to
model pretrained on our largest dataset on LVIS and demon- describe by language.

8
Text and image grounded generation. Besides using ei- 𝜏=1 𝜏 = 0.2

ther text or image to represent a grounded entity, one can also


keep both representations in one model for more creative
generation. Fig. 1 (c) shows text grounded generation with
style / tone transfer. For the style reference image, we find
that grounding it to an image corner or its edge is sufficient.
Since the model needs to generate a harmonious style for
the entire image, we hypothesize the self-attention layers Caption: “a cute low poly Shiba Inu”
Grounded text: Shiba Inu
may broadcast this information to all pixels, thus leading to
consistent style for the entire image.
Keypoints grounded generation. We also demonstrate
G LIGEN using keypoints for articulate objects control as
shown in the Fig. 1 (d). Note that this model is only trained
with human keypoint annotations; but it can generalize to
other humanoid object due to the scheduled sampling tech- Caption: “a robot is sitting on a bench”
Grounded keypoints: plotted dots on the left figure
nique we proposed. We also quantitatively study this ground-
ing condition in the supp. Figure 7. Scheduled Samping. It can improve visual or extend a
Spatially-aligned condition map grounded generation. model trained in one domain (e.g., human) to the others.
Fig. 1 (e-h) demonstrate results for depth map, edge map,
normal map, and semantic map grounded generation. These image synthesis and expanding the capabilities of pretrained
types of conditions allow users to have more fine-grained models in various applications.
generation control. See supp for more qualitative results.
5.4. Scheduled Sampling Acknowledgement. This work was supported in part by
NSF CAREER IIS2150012, NASA 80NSSC21K0295, and
As stated in Eq. (8) and Eq. (10), we can schedule infer- Institute of Information & communications Technology Plan-
ence time sampling by setting β to 1 (use extra grounding ning & Evaluation(IITP) grants funded by the Korea gov-
information) or 0 (reduce to the original pretrained diffu- ernment(MSIT) (No. 2022- 0-00871, Development of AI
sion model). This can make our model exploit different Autonomy and Knowledge Enhancement for AI Agent Col-
knowledge at different stages. laboration) and (No. RS-2022-00187238, Development of
Fig. 7 qualitatively shows the benefits of our scheduled Large Korean Language Model Technology for Efficient
sampling by setting τ to be 0.2. The images in the same row Pre-training), and Adobe Data Science Research Award.
share the same noise and conditional input. The first row
shows that scheduled sampling can be used to improve image
quality, as the original Stable Diffusion model is trained with References
high quality images. The second row shows a generation
[1] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine
example by our model trained with COCO human keypoint Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men-
annotations. Since this model is purely trained with human sch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza
keypoints, the final result is biased towards generating a Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina
human even if a different object (i.e., robot) is specified in Samangooei, Marianne Monteiro, Jacob Menick, Sebastian
the caption. However, by using scheduled sampling, we can Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Shar-
extend this model to generate other objects with a human- ifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals,
like shape. Andrew Zisserman, and Karen Simonyan. Flamingo: a visual
language model for few-shot learning. ArXiv, abs/2204.14198,
6. Conclusion 2022. 2, 5, 14
[2] Oron Ashual, Shelly Sheynin, Adam Polyak, Uriel Singer,
We proposed G LIGEN for expanding pretrained text2img Oran Gafni, Eliya Nachmani, and Yaniv Taigman. Knn-
diffusion models with grounding ability, and demonstrated diffusion: Image generation via large-scale retrieval. arXiv
open-world generalization using bounding boxes as the preprint arXiv:2204.02849, 2022. 6
grounding condition. Our method is simple and effective, [3] Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Ji-
and can be easily extended to other conditions such as key- aming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli
points, reference images, spatially-aligned conditions (e.g., Laine, Bryan Catanzaro, Tero Karras, and Ming-Yu Liu. ediff-
edge map, depth map, etc). The versatility of G LIGEN makes i: Text-to-image diffusion models with an ensemble of expert
it a promising direction for advancing the field of text-to- denoisers. ArXiv, abs/2211.01324, 2022. 3, 8

9
[4] Hangbo Bao, Li Dong, and Furu Wei. Beit: Bert pre-training [18] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B.
of image transformers. arXiv preprint arXiv:2106.08254, Girshick. Mask r-cnn. 2017 IEEE International Conference
2021. 2 on Computer Vision (ICCV), pages 2980–2988, 2017. 15
[5] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark [19] Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep
Liao. Yolov4: Optimal speed and accuracy of object detection. residual learning for image recognition. CVPR, pages 770–
ArXiv, abs/2004.10934, 2020. 6 778, 2016. 3
[6] Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, [20] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern-
Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, hard Nessler, and Sepp Hochreiter. Gans trained by a two
William T Freeman, Michael Rubinstein, et al. Muse: Text-to- time-scale update rule converge to a local nash equilibrium.
image generation via masked generative transformers. arXiv In NIPS, 2017. 6
preprint arXiv:2301.00704, 2023. 2 [21] Jonathan Ho. Classifier-free diffusion guidance. ArXiv,
[7] Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W abs/2207.12598, 2022. 6, 14
Cohen. Re-imagen: Retrieval-augmented text-to-image gen- [22] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term
erator. arXiv preprint arXiv:2209.14491, 2022. 6 memory. Neural Computation, 9:1735–1780, 1997. 3
[8] Yunjey Choi, Min-Je Choi, Mun Su Kim, Jung-Woo Ha, [23] Ray S Jackendoff. Semantic structures, volume 18. MIT
Sunghun Kim, and Jaegul Choo. Stargan: Unified gener- press, 1992. 4
ative adversarial networks for multi-domain image-to-image [24] Manuel Jahn, Robin Rombach, and Björn Ommer. High-
translation. 2018 IEEE/CVF Conference on Computer Vision resolution complex scene synthesis with transformers. ArXiv,
and Pattern Recognition, pages 8789–8797, 2018. 3 abs/2105.06458, 2021. 3
[9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina [25] Manuel Jahn, Robin Rombach, and Björn Ommer. High-
Toutanova. BERT: Pre-training of deep bidirectional trans- resolution complex scene synthesis with transformers. ArXiv,
formers for language understanding. In Proceedings of the abs/2105.06458, 2021. 7, 16
2019 Conference of the North American Chapter of the As- [26] Justin Johnson, Agrim Gupta, and Li Fei-Fei. Image gen-
sociation for Computational Linguistics: Human Language eration from scene graphs. 2018 IEEE/CVF Conference on
Technologies, Volume 1 (Long and Short Papers), pages 4171– Computer Vision and Pattern Recognition, pages 1219–1228,
4186, Minneapolis, Minnesota, June 2019. Association for 2018. 2
Computational Linguistics. 3
[27] Tero Karras, Samuli Laine, and Timo Aila. A style-based
[10] Prafulla Dhariwal and Alex Nichol. Diffusion models beat generator architecture for generative adversarial networks.
gans on image synthesis. ArXiv, abs/2105.05233, 2021. 2 CVPR, pages 4396–4405, 2019. 2
[11] Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang [28] Tero Karras, Samuli Laine, and Timo Aila. A style-based
Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia generator architecture for generative adversarial networks.
Yang, and Jie Tang. Cogview: Mastering text-to-image gener- CVPR, pages 4396–4405, 2019. 3
ation via transformers, 2021. 6
[29] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten,
[12] Wanshu Fan, Yen-Chun Chen, Dongdong Chen, Yu Cheng, Jaakko Lehtinen, and Timo Aila. Analyzing and improving
Lu Yuan, and Yu-Chiang Frank Wang. Frido: Feature pyra- the image quality of stylegan. 2020 IEEE/CVF Conference
mid diffusion for complex scene image synthesis. ArXiv, on Computer Vision and Pattern Recognition (CVPR), pages
abs/2208.13753, 2022. 6 8107–8116, 2020. 2
[13] Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, [30] Diederik P. Kingma and Jimmy Ba. Adam: A method for
Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene- stochastic optimization. CoRR, abs/1412.6980, 2015. 14
based text-to-image generation with human priors. ArXiv,
[31] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson,
abs/2203.13131, 2022. 2, 3, 6
Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan-
[14] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing tidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and
Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, Li Fei-Fei. Visual genome: Connecting language and vision
and Yoshua Bengio. Generative adversarial nets. In NIPS, using crowdsourced dense image annotations. International
2014. 2, 3 Journal of Computer Vision, 123:32–73, 2016. 8
[15] Agrim Gupta, Piotr Dollár, and Ross B. Girshick. Lvis: A [32] Chunyuan Li, Haotian Liu, Liunian Harold Li, Pengchuan
dataset for large vocabulary instance segmentation. CVPR, Zhang, Jyoti Aneja, Jianwei Yang, Ping Jin, Houdong Hu,
pages 5351–5359, 2019. 2, 7, 16 Zicheng Liu, Yong Jae Lee, and Jianfeng Gao. ELEVATER: A
[16] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr benchmark and toolkit for evaluating language-augmented vi-
Dollár, and Ross Girshick. Masked autoencoders are scalable sual models. In NeurIPS Track on Datasets and Benchmarks,
vision learners. In Proceedings of the IEEE/CVF Conference 2022. 2
on Computer Vision and Pattern Recognition, pages 16000– [33] Junnan Li, Ramprasaath R Selvaraju, Akhilesh Deepak Got-
16009, 2022. 2 mare, Shafiq Joty, Caiming Xiong, and Steven Hoi. Align
[17] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross before fuse: Vision and language representation learning with
Girshick. Momentum contrast for unsupervised visual repre- momentum distillation. arXiv preprint arXiv:2107.07651,
sentation learning. In CVPR, 2020. 2 2021. 2

10
[34] Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jian- [48] Deepak Pathak, Philipp Krähenbühl, Jeff Donahue, Trevor
wei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Darrell, and Alexei A. Efros. Context encoders: Feature
Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng learning by inpainting. CVPR, pages 2536–2544, 2016. 2
Gao. Grounded language-image pre-training. In IEEE/CVF [49] Bryan A. Plummer, Liwei Wang, Christopher M. Cervantes,
Conference on Computer Vision and Pattern Recognition, Juan C. Caicedo, J. Hockenmaier, and Svetlana Lazebnik.
CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages Flickr30k entities: Collecting region-to-phrase correspon-
10955–10965. IEEE, 2022. 2, 6, 8 dences for richer image-to-sentence models. International
[35] Wenbo Li, Pengchuan Zhang, Lei Zhang, Qiuyuan Huang, Journal of Computer Vision, 123:74–93, 2015. 8
Xiaodong He, Siwei Lyu, and Jianfeng Gao. Object-driven [50] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
text-to-image synthesis via adversarial training. In Proceed- Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
ings of the IEEE/CVF Conference on Computer Vision and Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen
Pattern Recognition, pages 12174–12182, 2019. 6 Krueger, and Ilya Sutskever. Learning transferable visual
[36] Yuheng Li, Yijun Li, Jingwan Lu, Eli Shechtman, Yong Jae models from natural language supervision. In ICML, 2021. 2,
Lee, and Krishna Kumar Singh. Collaging class-specific gans 3
for semantic image synthesis. ICCV, pages 14398–14407, [51] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu,
2021. 3 and Mark Chen. Hierarchical text-conditional image gen-
[37] Yuheng Li, Yijun Li, Jingwan Lu, Eli Shechtman, Yong Jae eration with clip latents. ArXiv, abs/2204.06125, 2022. 2,
Lee, and Krishna Kumar Singh. Contrastive learning 6
for diverse disentangled foreground generation. ArXiv, [52] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray,
abs/2211.02707, 2022. 2 Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever.
[38] Yuheng Li, Krishna Kumar Singh, Utkarsh Ojha, and Zero-shot text-to-image generation. In Marina Meila and
Yong Jae Lee. Mixnmatch: Multifactor disentanglement and Tong Zhang, editors, Proceedings of the 38th International
encoding for conditional image generation. 2020 IEEE/CVF Conference on Machine Learning, volume 139 of Proceedings
Conference on Computer Vision and Pattern Recognition of Machine Learning Research, pages 8821–8831. PMLR,
(CVPR), pages 8036–8045, 2020. 3 18–24 Jul 2021. 2
[39] Z. Li, Jingyu Wu, Immanuel Koh, Yongchuan Tang, and [53] Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick
Lingyun Sun. Image synthesis from layout with locality- Esser, and Björn Ommer. High-resolution image synthesis
aware mask adaption. ICCV, pages 13799–13808, 2021. 3 with latent diffusion models. CVPR, pages 10674–10685,
[40] Z. Li, Jingyu Wu, Immanuel Koh, Yongchuan Tang, and 2022. 2, 3, 6, 13, 14, 15
Lingyun Sun. Image synthesis from layout with locality- [54] O. Ronneberger, P.Fischer, and T. Brox. U-net: Convolutional
aware mask adaption. ICCV, pages 13799–13808, 2021. 6, 7, networks for biomedical image segmentation. In Medical
16 Image Computing and Computer-Assisted Intervention (MIC-
[41] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, CAI), volume 9351 of LNCS, pages 234–241. Springer, 2015.
Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence (available on arXiv:1505.04597 [cs.CV]). 3, 13
Zitnick. Microsoft coco: Common objects in context. In [55] Chitwan Saharia, William Chan, Huiwen Chang, Chris A. Lee,
ECCV, 2014. 2, 6, 14, 17 Jonathan Ho, Tim Salimans, David J. Fleet, and Mohammad
[42] Haotian Liu, Kilho Son, Jianwei Yang, Ce Liu, Jianfeng Gao, Norouzi. Palette: Image-to-image diffusion models. ACM
Yong Jae Lee, and Chunyuan Li. Learning customized visual SIGGRAPH 2022 Conference Proceedings, 2022. 2, 3
models with retrieval-augmented knowledge. CVPR, 2023. 2 [56] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay
[43] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht- Whang, Emily L. Denton, Seyed Kamyar Seyed Ghasemipour,
enhofer, Trevor Darrell, and Saining Xie. A convnet for the Burcu Karagol Ayan, Seyedeh Sara Mahdavi, Raphael Gon-
2020s. Proceedings of the IEEE/CVF Conference on Com- tijo Lopes, Tim Salimans, Jonathan Ho, David J. Fleet,
puter Vision and Pattern Recognition (CVPR), 2022. 13 and Mohammad Norouzi. Photorealistic text-to-image dif-
[44] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, fusion models with deep language understanding. ArXiv,
Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: abs/2205.11487, 2022. 2, 6
Representing scenes as neural radiance fields for view synthe- [57] Christoph Schuhmann, Richard Vencu, Romain Beaumont,
sis. In ECCV, 2020. 4, 13 Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo
[45] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Coombes, Jenia Jitsev, and Aran Komatsuzaki. LAION-
Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and 400M: open dataset of clip-filtered 400 million image-text
Mark Chen. Glide: Towards photorealistic image generation pairs. CoRR, abs/2111.02114, 2021. 6
and editing with text-guided diffusion models. In ICML, 2022. [58] Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang
2, 15 Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A
[46] Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. large-scale, high-quality dataset for object detection. ICCV,
Im2text: Describing images using 1 million captioned pho- pages 8429–8438, 2019. 8
tographs. In NIPS, 2011. 8 [59] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu
[47] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Soricut. Conceptual captions: A cleaned, hypernymed, image
Zhu. Semantic image synthesis with spatially-adaptive nor- alt-text dataset for automatic image captioning. In ACL, 2018.
malization. CVPR, pages 2332–2341, 2019. 2, 3 8

11
[60] Yujun Shen, Jinjin Gu, Xiaoou Tang, and Bolei Zhou. In- Yang, Burcu Karagol Ayan, Benton C. Hutchinson, Wei Han,
terpreting the latent space of gans for semantic face editing. Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and
2020 IEEE/CVF Conference on Computer Vision and Pattern Yonghui Wu. Scaling autoregressive models for content-rich
Recognition (CVPR), pages 9240–9249, 2020. 2 text-to-image generation. ArXiv, abs/2206.10789, 2022. 2, 6
[61] Wei Sun and Tianfu Wu. Image synthesis from reconfigurable [75] Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella,
layout and style. ICCV, pages 10530–10539, 2019. 3 Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang,
[62] Wei Sun and Tianfu Wu. Learning layout and style reconfig- Boxin Li, Chunyuan Li, et al. Florence: A new foundation
urable gans for controllable image synthesis. TPAMI, 44:5070– model for computer vision. arXiv preprint arXiv:2111.11432,
5087, 2022. 3, 7, 16 2021. 2
[63] Tristan Sylvain, Pengchuan Zhang, Yoshua Bengio, R. Devon [76] Alireza Zareian, Kevin Dela Rosa, Derek Hao Hu, and Shih-
Hjelm, and Shikhar Sharma. Object-centric image generation Fu Chang. Open-vocabulary object detection using captions.
from layouts. ArXiv, abs/2003.07449, 2021. 3 In Proceedings of the IEEE/CVF Conference on Computer
[64] Tristan Sylvain, Pengchuan Zhang, Yoshua Bengio, R. Devon Vision and Pattern Recognition, pages 14393–14402, 2021. 2
Hjelm, and Shikhar Sharma. Object-centric image generation [77] Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee, and
from layouts. ArXiv, abs/2003.07449, 2021. 7, 16 Yinfei Yang. Cross-modal contrastive learning for text-to-
[65] Ming Tao, Hao Tang, Songsong Wu, N. Sebe, Fei Wu, and image generation, 2021. 6
Xiaoyuan Jing. Df-gan: Deep fusion generative adversarial [78] Bo Zhao, Lili Meng, Weidong Yin, and Leonid Sigal. Image
networks for text-to-image synthesis. ArXiv, abs/2008.05865, generation from layout. CVPR, pages 8576–8585, 2019. 3
2020. 3, 6 [79] Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan
[66] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkor- Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang
eit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Dai, Lu Yuan, Yin Li, et al. RegionCLIP: Region-based
Polosukhin. Attention is all you need. In I. Guyon, U. Von language-image pretraining. In Proceedings of the IEEE/CVF
Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, Conference on Computer Vision and Pattern Recognition,
and R. Garnett, editors, Advances in Neural Information Pro- pages 16793–16803, 2022. 2
cessing Systems, volume 30. Curran Associates, Inc., 2017. [80] Yufan Zhou, Chunyuan Li, Changyou Chen, Jianfeng Gao,
3 and Jinhui Xu. Lafite2: Few-shot text-to-image generation.
[67] Ashish Vaswani, Noam M. Shazeer, Niki Parmar, Jakob arXiv preprint arXiv:2210.14124, 2022. 3, 6
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser,
[81] Yufan Zhou, Bingchen Liu, Yizhe Zhu, Xiao Yang, Changyou
and Illia Polosukhin. Attention is all you need. ArXiv,
Chen, and Jinhui Xu. Shifted diffusion for text-to-image
abs/1706.03762, 2017. 2, 3
generation. arXiv preprint arXiv:2211.15388, 2022. 2
[68] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao,
[82] Yufan Zhou, Ruiyi Zhang, Changyou Chen, Chunyuan Li,
Jan Kautz, and Bryan Catanzaro. High-resolution image
Chris Tensmeyer, Tong Yu, Jiuxiang Gu, Jinhui Xu, and Tong
synthesis and semantic manipulation with conditional gans.
Sun. Towards language-free training for text-to-image gener-
2018 IEEE/CVF Conference on Computer Vision and Pattern
ation. CVPR, 2022. 2, 6
Recognition, pages 8798–8807, 2018. 15, 16
[83] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros.
[69] Chenfei Wu, Jian Liang, Lei Ji, Fan Yang, Yuejian Fang,
Unpaired image-to-image translation using cycle-consistent
Daxin Jiang, and Nan Duan. Nüwa: Visual synthesis pre-
adversarial networks. In Computer Vision (ICCV), 2017 IEEE
training for neural visual world creation. In European Con-
International Conference on, 2017. 3
ference on Computer Vision, 2022. 2, 6
[70] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, [84] Xueyan Zou*, Zi-Yi Dou*, Jianwei Yang*, Zhe Gan, Lin-
Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine- jie Li, Chunyuan Li, Xiyang Dai, Jianfeng Wang, Lu Yuan,
grained text to image generation with attentional generative Nanyun Peng, Lijuan Wang, Yong Jae Lee,̂ and Jianfeng Gao.̂
adversarial networks. 2018 IEEE/CVF Conference on Com- Generalized decoding for pixel, image and language. arXiv,
puter Vision and Pattern Recognition, pages 1316–1324, 2018. 2022. 2
3, 6
[71] Zuopeng Yang, Daqing Liu, Chaoyue Wang, J. Yang, and
Dacheng Tao. Modeling image composition for complex
scene generation. CVPR, pages 7754–7763, 2022. 2, 7, 15,
16
[72] Zuopeng Yang, Daqing Liu, Chaoyue Wang, J. Yang, and
Dacheng Tao. Modeling image composition for complex
scene generation. CVPR, pages 7754–7763, 2022. 3
[73] Zhengyuan Yang, Jianfeng Wang, Zhe Gan, Linjie Li, Kevin
Lin, Chenfei Wu, Nan Duan, Zicheng Liu, Ce Liu, Michael
Zeng, and Lijuan Wang. Reco: Region-controlled text-to-
image generation. ArXiv, abs/2211.15518, 2022. 3
[74] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan
Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei

12
Appendix
In this supplemental material, we provide more imple-
mentation and training details, and then present more results Conv Layer
and discussions.

A. Implementation and training details 𝑧"


𝑓! (𝑙)
We use the Stable Diffusion model [53] as the example
to illustrate our implementation details.
Box Grounding Tokens with Text. Each grounded text Figure 8. Additional grounding input is fed into the Unet input for
spatially aligned conditions.
is first fed into the text encoder to get the text embedding
(e.g., 768 dimension of the CLIP text embedding in Stable
Diffusion). Since the Stable Diffusion uses features of 77
{p1 , . . . , pN } to semantically link keypoints belonging to
text tokens outputted from the transformer backbone, thus
the same person. This is to deal with the situation in which
we choose “EOS” token feature at this layer as our grounded
there are multiple people in the same image that we want
text embedding. This is because in the CLIP training, this
to generate, so that the model knows which keypoint corre-
“EOS” token feature is chosen and applied a linear transform
sponds to which person. Each keypoint semantic embedding
(one FC layer) to compare with visual feature, thus this
ke is a learnable vector; the dimension of each person to-
token feature should contain whole information about the
ken is set the same as keypoint embedding dimension. The
input text description. We also tried to directly use CLIP
grounding token is calculated by:
text embedding ( after linear projection), however, we notice
slow convergence empirically probably due to unaligned \label {eq:keypoint_token} \hv ^e = \text {MLP}(\kv _e +\pv _j, \text {Fourier}(\lv ) ) (11)
space between the grounded text embedding and the caption
embeddings. Following NeRF [44], we encode bounding where l is the x, y location of each keypoint and pj is the
box coordinates with the Fourier embedding with output person token for the j’th person. In practice, we set N as
dimension 64. As stated in the Eq 5 in the main paper, we 10, which is the maximum number of persons allowed to be
first concatenate these two features and feed them into a generated in each image. Thus, we have 170 tokens in the
multi-layer perceptron. The MLP consists of three hidden COCO dataset (i.e., 10*17; 17 keypoint annotations for each
layers with hidden dimension 512, the output grounding person).
token dimension is set to be the same as the text embedding
dimension (e.g., 768 in the Stable Diffusion case). We set
the maximum number of grounding tokens to be 30 in the Tokens for Spatially Aligned Condition. This type of
bounding box case. condition includes edge map, depth map, semantic map,
and normal map, etc; they can be represented as C × H ×
Box Grounding Tokens with Image. We use the similar W tensor. We resize spatial size into 256 × 256 and use
way to get the grounding token for an image. We use the the convnext-tiny [43] as the backbone to output a feature
CLIP image encoder (ViT-L-14 is used for the Stable Dif- with spatial size as 8 × 8, which then is flattened into 64
fusion) to get an image embedding. We denote the CLIP grounding tokens. We notice that it can help training faster
training objective as maximizing (Pt ht )⊤ (Pi hi ) (we omit if we also provide the grounding condition l into the Unet
normalization), where ht is “EOS” token embedding from input. As shown in the Figure 8, in this case, the input
the text encoder, hi is “CLS” token embedding from the is CONCAT(fl (l), z t ) where fl is a simple downsampling
image encoder, and Pt and Pi are linear transformation for network to reduce l into the same spatial dimension as z t ,
text and image embedding, respectively. Since ht is the which is the noisy latent code at the time step t (64 × 64
text feature space used for grounded text features, to ease for the Stable Diffusion). In this case, the first conv layer of
our training, we choose to project image features into the Unet needs to be trainable.
text feature space via P⊤t Pi hi , and normalized it to 28.7,
which is average norm of ht we empirically found. We Gated Self-Attention Layers. Our inserted self-attention
also set the maximum number of grounding tokens to be 30. layer is the same as the original diffusion model self-
Thus, 60 tokens in total if one keep both image and text as attention layer at each Transformer block, except that we
representations for a grounded entity. add one linear projection layer which converts the grounding
token into the same dimension as the visual token. For ex-
Keypoint Grounding Tokens. The grounding token for ample, in the first layer of the down branch of the UNet [54],
keypoint annotations is processed in the same way, ex- the projection layer converts grounding token of dimension
cept that we also learn N person token embedding vectors 768 into 320 (which is the image feature dimension at this

13
Grounding Data Detection Data Detection + Caption Data
person groom

person light elephant


lamp

couch
couch
surfboard
wedding couch
cake
A bride and groom are about to [PAD] A living room has a glowing
cut their wedding cake brick fireplace person

Figure 9. Three different types of grounding data for box. Real Input DALL E 2 Stable Diffusion Ours

Figure 10. Inpainting results. Existing text2img diffusion models


layer), and visual tokens are concatenated with the grounding may generate objects that do not tightly fit the masked box or miss
tokens as the input to the gated attention layer. an object if the same object already exists in the image.
Training Details. For all COCO related experiments
(Sec 5.1 in the main paper), we train LDM with batch size 1%-3% 5%-10% 30%-50%
64 using 16 V100 GPUs for 100k iterations. In the scaling LDM [53] 25.9 23.4 14.6
up training data experiment (in Sec 5.2 of the main paper), G LIGEN-LDM 29.7 30.9 25.6
we train for 400k iterations for LDM, but 500K iterations Upper-bound 41.7 43.4 45.0
with batch size of 32 for the Stable diffusion modeL For
all training, we use learning rate of 5e-5 with Adam [30], Table 4. Inpainting results (YOLO AP) for different size of objects.
and use warm-up for the first 10k iterations. We randomly
drop caption and grounding tokens with 10% probability for attention to absorb the grounding instruction. We can also
classifier-free guidance [21]. consider gated cross-attention [1], where the query is the
visual feature, and the keys and values are produced us-
Data Details. In the main paper Sec 5.1, we study three ing the grounding condition. We ablate this design on
different types of data for box grounding. The training data COCO2014CD data using LDM. Compare with the Table 1
requires both text c and grounding entity e as the full con- the main paper, we can find that it leads to similar FID:
dition. In practice, we can relax the data requirement by 5.8, but worse YOLO AP: 16.6 (compared to 21.7 for self-
considering a more flexible input, i.e. the three types of data attention in the Table). This shows the necessity of infor-
shown in Figure 9. (i) Grounding data. Each image is as- mation sharing among the visual tokens, which exists in
sociated with a caption describing the whole image; noun self-attention but not in cross-attention.
entities are extracted from the caption, and are labeled with
bounding boxes. Since the noun entities are taken directly
from the natural language caption, they can cover a much Ablation on null caption. We choose to use the
richer vocabulary which will be beneficial for open-world null caption when we only have detection annotations
vocabulary grounded generation. (ii) Detection data. Noun- (COCO2014D). An alternative scheme is to simply com-
entities are pre-defined closed-set categories (e.g., 80 object bine all noun entities into a sentence; e.g., if there are two
classes in COCO [41]). In this case, we choose to use a null cats and a dog in an image, then the pseudo caption can
caption token as introduced in classifier-free guidance [21] be: “cat, cat, dog”. In this case, the FID becomes
for the caption. The detection data is of larger quantity (mil- worse and increases to 7.40 from 5.61 (null caption, refer to
lions) than the grounding data (thousands), and can therefore main paper table 1). This is likely due to the pretrained text
greatly increase overall training data. (iii) Detection and encoder never having encountered this type of unnatural cap-
caption data. Noun entities are same as those in the detec- tion during LDM training. A solution would be to finetune
tion data, and the image is described separately with a text the text encoder or design a better prompt, but this is not the
caption. In this case, the noun entities may not exactly match focus of our work.
those in the caption. For example, in Figure 9, the caption
only gives a high-level description of the living room without
mentioning the objects in the scene, whereas the detection
Ablation on fourier embedding. In Eq 5, we replace the
annotation provides more fine-grained object-level details.
Fourier embedding with MLP embedding and conduct an
experiment using COCO2014CD data format (Table 1) . In
B. Ablation Study
this case, the image quality (FID) is similar: Fourier/MLP:
Ablation on gated self-attention. As shown in the main 5.82/5.80; however, the layout correspondence (YOLO AP)
paper Figure 3 and Eq 8, our approach uses gated self- is much worse: Fourier/MLP: 21.7/3.2.

14
mountain person building
bear
horse
Input
mountain

bus
grass
rock
car
LostGAN-v2

Real Input pix2pixHD Ours (w/o caption) Ours (w caption)

Figure 12. Keypoint results. Our model generates higher quality


images conditioned on keypoints, and it allows to use caption to
TwFA

specify details such as scene or gender.

follow the provided box. The second row shows that when
the missing category is already present in the image, they
LDM
Ours

may ignore the caption. This is understandable as baselines


are trained to generate a whole image following the caption.
Our method may be more favorable for editing applications,
where a user might want to generate an object that fully fits
the missing region or add an instance of a class that already
Stable
Ours

exists in the image.

C.2. Image Grounded Inpainting


As we previously demonstrated, one can ground text to
Figure 11. Layout2img comparison. Our model generates better
missing region for inpainting, one can also ground reference
quality images, especially when using stable diffusion. Baseline
images are all copied from TwFA [71]
images to missing regions. Figure 13 shows inpainting re-
sults grounded on reference images. To remove boundary
artifacts, we follow GLIDE [45], and modify the first conv
C. Grounded inpainting layer by adding 5 extra channels (4 for z0 and 1 for inpainting
mask) and make them trainable with the new added layers.
C.1. Text Grounded Inpainting
Like other diffusion models, G LIGEN can also work for D. Study for Keypoints Grounding
the inpainting task by replacing the known region with a Although we have thus far demonstrated results with
sample from q(zt |z0 ) after each sampling step, where z0 is bounding boxes, our approach has flexibility in the ground-
the latent representation of an image [53]. One can ground ing condition that it can use for generation. To demonstrate
text descriptions to missing regions, as shown in Figure 10. this, we next evaluate our model with another type of ground-
In this setting, however, one may wonder, can we simply use ing condition: human keypoints. We use the COCO2017
a vanilla text-to-image diffusion model such Stable Diffusion dataset. We compare with pix2pixHD [68], a classic image-
or DALLE2 to fill the missing region by providing the object to-image translation model. Since pix2pixHD does not take
name as the caption? What are the benefits of having extra captions as input, we train two variants of our model: one
grounding inputs in such cases? To answer this, we conduct uses COCO captions, the other does not. In the latter case,
the following experiment on the COCO dataset: for each null caption is used as input to the cross-attention layer for a
image, we randomly mask one object. We then let the model fair comparison.
inpaint the missing region. We choose the missing object Fig. 12 shows the qualitative comparison. Clearly,
with three different size ratios with respect to the image: our method generates much better image quality. For
small (1%-3%), median (5%-10%), and large (30%-50%). our model trained with captions, we can also specify
5000 images are used for each case. other details such as the scene (“A person is skiing
Table 4 demonstrates that our inpainted objects more down a snowy hill”) or person’s gender (“A woman is
tightly occupy the missing region (box) compared to the holding a baby”). These two inputs complement each
baselines. Fig. 10 provides examples to visually compare the other and can enrich a user’s controllability for image cre-
inpainting results (we use Stable Diffusion for better quality). ation. We measure keypoint correspondence (similar to the
The first row shows that baselines’ generated objects do not YOLO score for boxes) by running a MaskRCNN [18] key-

15
Model Pre-training data Traing data FID AP APr APc APf
LAMA [40] – LVIS 151.96 2.0 0.9 1.3 3.2
G LIGEN-LDM COCO2014CD – 22.17 6.4 5.8 5.8 7.4
G LIGEN-LDM COCO2014D – 31.31 4.4 2.3 3.3 6.5
G LIGEN-LDM COCO2014G – 13.48 6.0 4.4 6.1 6.6
G LIGEN-LDM GoldG,O365 – 8.45 10.6 5.8 9.6 13.8
G LIGEN-LDM GoldG,O365,SBU,CC3M – 10.28 11.1 9.0 9.8 13.4
G LIGEN-LDM GoldG,O365,SBU,CC3M LVIS 6.25 14.9 10.1 12.8 19.3
Upper-bound – – – 25.2 19.0 22.2 31.2

Table 5. GLIP-score on LVIS validation set. Upper-bound is provided by running GLIP on real images scaled to 256 × 256.

Figure 13. Image grounded Inpainting. One can use reference images to ground holes they want to fill in.

Model FID AP AP50 AP75


pix2pixHD [68] 142.4 15.8 33.7 13.0
YOLO score
G LIGEN (w/o caption) 31.02 31.8 53.5 31.0 Model FID AP AP50 AP75
G LIGEN (w caption) 27.34 31.5 52.9 31.0
LostGAN-V2 [62] 42.55 9.1 15.3 9.8
Upper-bound - 62.4 75.0 65.9
OCGAN [64] 41.65 –
Table 6. Conditioning with Human Keypoints evaluated on HCSS [25] 33.68 –
COCO2017 validation set. Upper-bound is calculated on real im- LAMA [40] 31.12 13.40 19.70 14.90
TwFA [71] 22.15 – 28.20 20.12
ages scaled to 256 × 256.
G LIGEN-LDM 21.04 22.4 36.5 24.1
After pretrain on GoldG,O365,SBU,CC3M
point detector on the generated images. Both of our model G LIGEN-LDM (zero-shot) 27.03 19.1 30.5 20.8
variants produce similar results; see Table 6. G LIGEN-LDM (finetuned) 21.58 30.8 42.3 35.3

Table 7. Image quality and correspondence to layout are compared


E. Additional quantitative results with baselines on COCO2017 val-set.

In this section, we show more studies with our pretrained


model using our largest data (GoldG, O365, CC3M, SBU).
We had reported this model’s zero-shot performance on F. Analysis on G LIGEN
LVIS [15] in the main paper Table 3. Here we finetune
this model on LVIS, and report its GLIP-score in Table 5. To have a better understanding of G LIGEN, we choose
Clearly, after finetuning, we show much more accurate gener- to study the box grounded model. Specifically, we try to
ation results, surpassing the supervised baseline LAMA [40] visualize attention maps within gated self-attention layer and
by a large margin. how does the learnable γ in Eq 8 change during the training
Similarly, we also test this model’s zero-shot performance process.
on the COCO2017 val-set, and its finetuning results are in In the Figure 14, we first show a generation result using
Table 7. The results show the benefits of pretraining which two grounding tokens (teddy bear; bird). Next to it, we vi-
can largely improve layout correspondence performance. sualize the attention maps of our added layers between the

16
teddy
bird
Head 0 Head 1 Head 2 Head 3 Head 4 Head 5 Head 6 Head 7
Figure 14. Attention maps in one gated self-attention layer. The visualization results are from the sample at the first time step (i.e.,
Gaussian noise) in the middle layer of the Unet.

of the main paper. The results show that our model has
comparable image quality when built upon LDM, but has
more visual appeal and details when built upon the Stable
Diffusion model.
Lastly, we show more grounded text2img results with
bounding boxes in Figure 16 and other modality grounding
results in Figure 17 18 19 20 21 22. Note that our keypoint
model only uses keypoint annotations from COCO [41]
which is not linked with person identity, but it can suc-
cessfully utilize and combine the knowledge learned in the
Figure 15. learnable γ in the gated self attention layer in the middle text2img training stage to control keypoints of a specific
of Unet changes during the training progress. person. Out of curiosity, we also tested whether the keypoint
grounding information learned on humans can be transferred
visual features and two grounding tokens for all 8 heads for
to other non-humanoid categories such as cat or lamp for
one middle layer in the UNet. Even for the first sampling
keypoint grounded generation, but we find that our model
step (input is Gaussian noise), the visual feature starts to
struggles in such cases even with scheduled sampling. Com-
attend to the grounding tokens with correct spatial correspon-
pared to bounding boxes, which only specify a coarse loca-
dence. This correspondence fades away in later sampling
tion and size of an object in the image and thus can be shared
steps (which is aligned with our ‘scheduled sampling tech-
across all object categories, keypoints (i.e., object parts) are
nique’ where we find rough layout is decided in the early
not always shareable across different categories. Thus, while
sample steps).
keypoints enable more fine-grained control than boxes, they
We also find the attention maps for the beginning layers are less generalizable.
of the UNet to be less interpretable for all sample steps.
We hypothesize that this is due to the lack of positional
embedding for visual tokens, whereas position information
can be leaked into later layers through zero padding via Conv
layers. This might suggest that adding positional embedding
for diffusion model pretraining (e.g., Stable Diffusion model
training) could benefit downstream adaptation.
The Figure 15 shows how the learned γ at this layer (Eq 8)
changes during training. We empirically find the model
starts to learn the correspondence around 60-70k iterations
(around the peak in the plot). We hypothesize the model tries
to focus on learning spatial correspondence at the beginning
of training, then tries to finetune and dampen the new layers’
contribution so that it can focus on image quality and details
as the original weights are fixed.

G. More qualitative results


We show qualitative comparisons with layout2img base-
lines in Figure 11, which complements the results in Sec 5.1

17
Caption: “Space view of a planet and its sun”
Grounded text: planet, sun

Caption: “a a photo of a hybrid between a bee and a rabbit”


Grounded text: hybrid between a bee and a rabbit, flower

Caption: “cartoon sketch of a little girl with a smile and balloons, old style, detailed, elegant, intricate”
Grounded text: girl with a smile, balloon, balloon, balloon

Caption: “Walter White in GTA v”


Grounded text: Walter White, car, bulldog

Caption: “two pirate ships on the ocean in minecraft”


Grounded text: a pirate ship, a pirate ship

Figure 16. Bounding box grounded text2image generation. Our model can ground noun entities in the caption for controllable image
generation

18
Caption: “Steve Jobs is working with his laptop”
Grounded keypoints: plotted dots on the left

Caption: “Barack Obama is sitting at a desk”


Grounded keypoints: plotted dots on the left

Figure 17. Results for keypoints grounded generation.

Caption: “a small church is sitting in a garden”


Grounded hed map: the left image

Caption: “fox wallpaper, digit art, colorful”


Grounded hed map: the left image

Figure 18. Results for HED map grounded generation.

19
Caption: “A Humanoid Robot Designed for Companionship”
Grounded canny map: the left image

Caption: “a chair and a table”


Grounded canny map: the left image

Figure 19. Results for canny map grounded generation.

Caption: “a busy street with many people”


Grounded depth map: the left image

Caption: “a butterfly, ultra details”


Grounded depth map: the left image

Figure 20. Results for depth map grounded generation.

20
Caption: “a long hallway with pipes on the ceiling”
Grounded normal map: the left image

Caption: “the front of a building ”


Grounded normal map: the left image

Figure 21. Results for normal map grounded generation.

Caption: “a man is drawing”


Grounded semantic map: the left image

Caption: “a photo of a bedroom”


Grounded semantic map: the left image

Figure 22. Results for semantic map grounded generation.

21

You might also like