0% found this document useful (0 votes)
28 views

Mini DALL E 3: Interactive Text To Image by Prompting Large Language Models

Uploaded by

Vidya Bharti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Mini DALL E 3: Interactive Text To Image by Prompting Large Language Models

Uploaded by

Vidya Bharti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Mini DALL·E 3: Interactive Text to Image by

Prompting Large Language Models


Zeqiang Lai 1 Xizhou Zhu 23 Jifeng Dai 23 Yu Qiao 2 Wenhai Wang 4
1
Beijing Institute of Technology 2 Shanghai AI Laboratory 3 Tsinghua University 4 CUHK
https://minidalle3.github.io/
arXiv:2310.07653v2 [cs.AI] 12 Oct 2023

Cool! Could you make some


Hello! What are you? Don't be a sticker Don't be a 3D model! Where is the dog?
#1 sticker? #3 #5 #7 #10

Sure, here is a sticker of a Apologies for the confusion. Apologies for the oversight.
Hello! I am Mini DALL·E 3, a Apologies for the confusion.
cute dog for you: Here is an image of a cute dog Here is an updated image that
smart AI assistant. I can help Here is a description of the
running near a serene river: cute dog's house: The cute includes the dog in the scene:
you with various tasks,
including generating images dog's house is a cozy little
based on your requests. How wooden cabin with a red roof.
can I assist you today? It has a small porch with a dog
bed and a bowl of water. The
house is surrounded by a
beautiful garden with colorful
That’s awesome. Can you flowers and a fenced yard for
generate a cute dog? #2 the dog to play in.

Sure, here is a cute dog for Can you turn this description
you: Good! Can you make it run near into an image? #9
What would this dog's house
the river? It looks great!
#4 look like? #6 #11
Certainly! Here is an image
Sure, here it is: Certainly! Here is an image of a based on the description of the
cute dog's house: I'm glad you like it! If you
cute dog's house:
have any other requests or
need further assistance, feel
free to let me know.

Enter text to ask me


Mini DALL·E 3

Figure 1. Examples of two interactive text-to-image conversations produced by Mini DALL·E 3. In these cases, people can ask the agent
to generate images via natural language and request an edit if the results are unsatisfactory. The generation and editing can be completed
in a multi-turn dialog with recognition of the conservation context.

Inspired by the recently released DALL·E 3– a T2I model


Abstract directly built-in ChatGPT that talks human language, we re-
visit the existing T2I systems endeavoring to align human
The revolution of artificial intelligence content genera- intent and introduce a new task - interactive text to im-
tion has been rapidly accelerated with the booming text-to- age (iT2I), where people can interact with LLM for inter-
image (T2I) diffusion models. Within just two years of de- leaved high-quality image generation/edit/refinement and
velopment, it was unprecedentedly of high-quality, diversity, question answering with stronger images and text corre-
and creativity that the state-of-the-art models could gener- spondences using natural language. In addressing the iT2I
ate. However, a prevalent limitation persists in the effec- problem, we present a simple approach that augments LLMs
tive communication with these popular T2I models, such as for iT2I with prompting techniques and off-the-shelf T2I
Stable Diffusion, using natural language descriptions. This models. We evaluate our approach for iT2I in a variety of
typically makes an engaging image hard to obtain without common-used scenarios under different LLMs, e.g., Chat-
expertise in prompt engineering with complex word compo- GPT, LLAMA, Baichuan, and InternLM. We demonstrate
sitions, magic tags, and annotations. that our approach could be a convenient and low-cost way
to introduce the iT2I ability for any existing LLMs and any
Preliminary version. Work in Progress. text-to-image models without any training while bringing

1
little degradation on LLMs’ inherent capabilities in, e.g.,
1. Unconditioned
question answering and code generation. We hope this work (StyleGAN)
2. Class Conditioned
could draw broader attention and provide inspiration for (CVAE, cGAN)

boosting user experience in human-machine interactions


alongside the image quality of the next-generation T2I sys-
tems. 3. Text Conditioned
(Stable Diffusion, GigaGAN,
DALLE 2, …)

4. Tab/Node/Click-based UX
(Stable Diffusion WebUI, ComfyUI,
1. Introduction InvokeAI, Midjourney)

The evolution of artificial intelligence content generation


has been significantly accelerated by the proliferation of 5-I. Point/Drag-based
(DragGAN, DragDiffusion) 5-II. Interactive T2I
text-to-image (T2I) diffusion models [18, 20, 41, 43].
(DALLE 3, Mini DALLE3)
Within just two years of rapid development since 2021, it
was unprecedentedly of high quality, diversity, and creativ- Figure 2. The evolution of image generation systems.
ity that the state-of-the-art T2I models [4, 13, 39–41, 43, 55]
could generate. For the first time, “talk to paint” is no longer
a daydream, and complex surrealistic arts can be gener- cise text prompt for the desired images. However, this can
ated via textual descriptions, with stronger expressive abil- be challenging in some cases, such as when the required
ity than previous unconditional and class conditional image scenes are artistic conceptions rather than specific objects,
generation systems as shown in Fig. 2. or when the users have no idea how to describe what they
However, it is unfortunate that most of the existing T2I want to generate, etc.
models, such as Stable Diffusion [41], are still limited in un-
Generally, it might be difficult for users to come up with
derstanding natural language. In other words, people have
the right prompts and configurations at once, but it is much
to learn to write complex text prompts to obtain the best re-
easier to tell what they want or do not want via natural lan-
sults, which fit the used models but are not necessarily user-
guage if the first version is unsatisfactory, e.g., “Don’t be a
friendly and straightforward for humans, as illustrated by
sticker” and “Where is the dog?”, as shown in Fig. 1. More-
Fig. 6. As a result, this typically makes an engaging image
over, it would be more straightforward to perform a multi-
hard to obtain without expertise in prompt engineering with
turn conversation with T2I models to iterate the images over
proper word compositions and sometimes weird phrase or-
and over again, mimicking the communication processes
ganizations. Besides, there have been dozens of different
between human designers and their customers. These anal-
textual and numerical configurations in a diffusion-based
yses reveal a promising direction for building the next gen-
T2I pipeline, such as CFG scale, word weighting, negative
eration of T2I systems with a new human-machine inter-
prompts, and style keywords, which are also complicated
face using natural language – a system that is able to infer
for non-professional users.
users’ intentions and automatically generate the proper text
To make it easier for users to utilize T2I models, Sta- prompts leveraging the reasoning abilities of large language
ble Diffusion (SD) WebUI [2] is first created to provide a models (LLM). This is not only because natural language is
user-friendly web UI to access the latest techniques with- the easiest way that everyone can master, but also because
out any coding. However, a typical workflow of gener- it frees users from brainstorming sophisticated textual de-
ating a satisfactory image usually involves several stages, scriptions and requires only simple instructions instead (see
e.g., generation, variation, super-resolution, etc. This makes Fig. 6 for more illustrations).
the tab-based interface of SD-WebUI somewhat awkward
to use. Therefore, ComfyUI1 was designed by utilizing Inspired by the recently released demo of DALL·E 3 [35]
a graph/nodes interface that connected different stages via – a powerful T2I model directly built-in ChatGPT that uti-
nodes and edges, which makes workflows more clear. Nev- lizes human language, we revisit existing techniques aimed
ertheless, these software tools still could not solve the prob- at aligning human intent in generating images and introduce
lem of complicated configurations required for a charm- a new task called interactive text to image (iT2I). This
ing image. This urges the development of Fooocus2 – a task is featured by several aspects, including 1) Multi-Turn:
tool with a bunch of built-in optimizations and quality im- users are allowed to chat with the system (typically powered
provements. Fooocus frees users from complex parameter- by LLMs) to progressively specify requirements, shortcom-
tuning, but it still requires them to write a proper and pre- ings, and suggestions of the expected/generated images;
2) Consistency: the ability to keep identity for consistent
1 https://github.com/comfyanonymous/ComfyUI multi-turn image editing, series characters creation, etc.; 3)
2 https://github.com/lllyasviel/Fooocus Composability: the ability to be composed with/built-in ex-

2
Prompt Prompt
Beatrix Potter style watercolor. By Henry Cavill, Chibi style, A cartoonish rabbit walking on a rural road to
cartoonish, they are in a rural school, landscape of pastel school, in the style of watercolor.
colors.<lora:xl_more_art-full_v1:0.5>
(2) Description with human/natural language

Negative Prompt
<lora:badhands:1>ugly, tiling, poorly drawn hands, poorly #1 I want to draw a student rabbit.
drawn feet, poorly drawn face, out of frame, extra limbs, body
out of frame, blurry, bad anatomy, blurred, watermark, grainy, #2 He is walking on a rural road.
signature, cut off, draft, closed eyes, text, logo #3 The style should be cartoon and watercolor.

(1) Description with the dialect of Stable Diffusion (3) Instruction with human/natural language

Figure 3. Illustrations of different human-machine interfaces for T2I systems.

isting chat assistants for interleaved image generation and and generated images, while the birth and development of
(visual) question answering for seamless user experience. diffusion models [4, 13, 40, 41, 43, 55] have pushed the
All these properties make iT2I systems powerful tools for boundaries of text-image interactions.
a wide range of applications, from content generation and
design to interactive storytelling and more.
Image Generation Interface. There are a variety of dif-
As an initial solution to address this problem, we pro-
ferent approaches for image generation and editing – each
pose a simple yet effective approach that enhances lan-
possesses its own merits and drawbacks. The most straight-
guage models for iT2I using prompting techniques and pre-
forward ones are text-based approaches where people write
trained text-to-image models. Specifically, we prompt the
text prompts for either image generations [40, 41] or im-
LLM to instruct it to generate an image with an intermedi-
age editing [6, 61]. Besides, image-based approaches are
ate textual description enclosed by special tags. After de-
also popular. In this case, people either provide a refer-
tecting the special tags, the description is parsed and trans-
ence image asking the T2I models to generate image vari-
formed through a prompt refinement module. Then, a pre-
ations [40, 59], or provide edge/depth maps to control the
trained T2I model is employed to generate the image. We
image layout [27, 34, 64], or performing image translation
evaluate our approach across various common use cases
with a style image [1, 45], or asking generating images of
and different language models such as ChatGPT [7, 36],
a given subject [25, 57]. To facilitate the precise control,
LLAMA [48], Baichuan [56] and InternLM [46]. Our re-
point-based approaches [31, 49] are widely adopted by uti-
sults demonstrate that our approach can easily enable iT2I
lizing state-of-the-art localization methods [23, 30]. Re-
capabilities in any existing language model and text-to-
cently, drag-based approaches [11, 28, 29, 33, 37, 44, 62]
image model without the need for additional training. Fur-
are also proposed for more interactive experience. As for
thermore, it has minimal impact on the language models’
UX design, there are Rich-T2I [15] and DialogPaint [53],
inherent abilities in question answering and code genera-
which share similar spirits as ours. In the literatures of inte-
tion.
grating T2I to LLM, there are NExT-GPT [54], GILL [24],
We hope this work could draw broader attention and DreamLLM [9], SEED [16]. Although these methods also
provide inspiration for boosting user experience in human- provide the capability for interleaved text-image generation,
machine interactions alongside the image quality of the they are not specifically designed for iT2I mostly and are
next-generation T2I models. limited to image quality and multi-turn correspondence.

2. Related Works
Prompting LLMs. The in-context-learning capability [7]
Text-to-Image Generation. Text-to-image (T2I) gener- is one of the strongest advantages of LLMs. It enables users
ation is a widely-explored research area at the intersec- to freely customize LLMs for a particular task or enhance
tion of computer vision and natural language processing. the capabilities of LLMs by simple prompting. For exam-
Notable approaches include generative models, like Vari- ple, chain-of-thoughts [52] is the first prompting technique
ational Autoencoders (VAE) [22, 47], Generative Adver- that enhances LLMs by asking them to generate a series
sarial Networks (GAN) [17, 21], and autoregressive mod- of intermediate reasoning steps. After that, there are also
els [12], which enable image synthesis guided by textual a number of improved prompting techniques that leverage
descriptions. Recent multimodal models like CLIP [38] and the heuristic of majority voting [51], backtracking [58], and
DALL·E [39] have further improved alignment between text graph of thoughts [5]. In this work, we also provide an ap-

3
1. Generation 3. Selecting 5. Refinement 6. Question Answering

My 7 year-old keeps talking he dreamed a He's going to love these! This look great ! Can you refine the Bob looks strong, what’s the story
"superman cat” last night -- What does it Can you pick the second image so that I can use it to build a of him ? Could you tell me some ?
look like ? one ? story book.
Certainly! Bob is a remarkable cat with
It might look something like this: Of course. No problem. Here is the refined one. an extraordinary story. Here are a few
highlights:

1.Super Strength: Bob possesses


incredible strength, allowing him to
perform feats that no ordinary cat can.
2.Flight Abilities: With his superman-
like cape, he soars through the sky with
grace and agility.
3.Laser Vision: Bob's eyes emit
2. Referring Generation 4. Editing powerful laser beams that can cut
through objects and even melt steel.
My sun says its name is Bob. Can I see more like this ? Great, I like this one, it is so cute. Could you 4.Secret Identity: During the day, Bob
make bob flying in the sky ? lives a seemingly ordinary life as a
Sure, here you are: regular house cat.
Certainly! Here’s
Bob's story is filled with exciting
the image of bob
adventures and heartwarming
flying in the sky.
moments. He is a beloved hero in the
community, admired by both humans
and animals alike.

Figure 4. Illustration of 6 types of interactions in interactive text-to-image workflow.

proach to augment LLM with iT2I ability via prompting, as the current textual input but also the previous visual con-
it can be rapidly applied to any existing LLMs without any text. It involves persisting the visual identity of images
training. in different rounds of generations. This capability enables
iT2I systems to perform consistent multi-turn image edit-
3. Interactive Text to Image ing/refinement, produce personalized and contextually rele-
vant objects/characters, etc.
Interactive Text to Image (iT2I) aims to provide a user-
Composability relates to the ability to combine or inte-
friendly approach to generate images that meet user re-
grate image generation with other tasks. This means that
quirements in an interactive manner. Users can instanti-
the ability of image generation should be modular and com-
ate a multi-turn dialogue between humans and AI agents,
patible with the inherent abilities of LLMs, allowing users
where they can communicate requirements, shortcomings,
to seamlessly incorporate them to perform interleaved con-
and suggestions of the generated images or the expected
servations for querying both text and visual content.
ones with natural language.

3.1. Problem Definition 3.2. Types of Instruction


Precisely, the iT2I problem can be defined as the task of As shown in Fig. 4, there are various instructions that could
generating images from textual descriptions in a way that be found in an iT2I system, such as generation, editing, se-
the generated images closely align with the provided text, lecting, and refinement. Different instructions could have
ensuring that the generated visual content accurately repre- varying levels of complexity when it comes to interpreta-
sents the textual information. There are some notable prop- tion. Some instructions can be effectively addressed by
erties of iT2I systems: leveraging the capabilities of an LLM, such as selecting,
Multi-Turn refers to the ability of the system to engage which primarily involves textual decision-making. How-
in a dynamic and iterative dialogue with the user. Unlike ever, certain instructions may necessitate a deeper synergy
traditional text-to-image systems that may generate a single between the LLM and the T2I models.
image based on a static textual input, multi-turn iT2I sys- Generation refers to the process of generating entirely
tems can accept multiple rounds of textual input, enabling new images based on a given textual description. In this
users to refine and specify their visual requirements through context, the iT2I system creates images or illustrations from
an ongoing conversation. This property enhances the user scratch, attempting to capture the essence and details of the
experience and allows for more fine-grained control over provided textual input. It essentially transforms queries into
the generated images. neural representations or prompts for T2I models.
Consistency means that these systems can automati- Referring generation is another variant of generation,
cally determine if they should take into account not only where the system generates images that refer to or are in-

4
Interactive Chat
spired by existing objects, scenes, or concepts mentioned in Large Language Model
Can you generate a cute dog?
the textual input and appear in the context.
Sure, here is a dog for you: Router
Sure, here is a dog for you:
Selecting is a relatively straightforward instruction that
<image> a cute dog </image>
involves choosing or picking from a set of pre-existing or OR
[IMG1] [IMG2] [IMG3] … [IMGL]
bag of generated images based on the textual input. T2I
Adapter
Editing performs the task of modifying or refining ex- Model
Prompt / Embedding
Prompt / Embedding
isting images in response to textual instructions. This may Good! Can you make it run
near the river?
involve altering specific attributes of an image, enhancing or Large Language Model
diminishing certain features, or adapting the image to match Sure, here it is:
Router
Sure, here it is:
the requirements outlined in the instruction.
<edit> a cute dog running near the river</edit>
Refinement means to further enhance or optimize an OR [IMG1] [IMG2] [IMG3] … [IMGL]
existing image to better align with the textual description. T2I
Prompt / Embedding Adapter
While editing involves making specific modifications, re- Model
Prompt / Embedding
Enter text to ask me
finement often involves fine-tuning the visual output to
achieve a higher level of detail, realism, or accuracy in ac-
Figure 5. Pipeline Overview. Mini-DALLE3 consists of two
cordance with the provided textual guidance.
stages, with 1) a router that analyzes the response from the
Question Answering is the inherent ability of LLMs. prompted/finetuned LLM and dispatches the demand for image
An iT2I system should be able to persist the ability as much generation if needed, and 2) an adapter that transforms the image
as possible, as it is crucial to provide a coherent experience embedding or descriptions for subsequent T2I models.
interleaving images and text for users.
3.3. Discussion
sentations, the adapter can be a neural network if the image
In the literature of image editing and multi-modal LLM, representations are neural embedding or prompt refinement
there are a number of works that are closely related to iT2I. modules with handcrafted rules or LLM. Next, we illustrate
Most of these related works could provide interactive inter- a simple yet effective instantiation of Mini-DALLE3 archi-
faces. For example, InstructPix2Pix [6] and its follow-up tecture by prompting large language models.
works [63, 65] could be repeatedly applied to a single im-
age to achieve multi-turn image editing. However, these in- 4.1. Multi-Turn Interaction by Prompting LLM
teractive multi-turn abilities only apply to image editing in-
Multi-turn interaction lies at the heart of interactive text-
structions. There are also multi-modal LLMs [9, 16, 24, 54]
to-image. It possesses the requirements of integrating tex-
that could generate response with interleaved text and im-
tual/visual context and understanding instructional instead
ages, but most of them focus more on (visual) question
of descriptive messages. To address it, we propose to lever-
answering with multi-modal responses rather than interac-
age the stronger context-understanding ability of LLMs by
tive image generation. The key vision of iT2I is to build
prompting them to pretend to generate images via textual
a chat-based system that could respond to all image gen-
descriptions. This intermediate textual description not only
eration/editing instructions in a multi-turn, consistent, and
provides stronger flexibilities to augment the system capa-
composable manner. This is the major difference between
bilities with plug-and-play modules such as prompt varia-
iT2I from all previous works/tasks.
tion/refinement but also enables us to utilize numerous pre-
trained LLMs and T2I models without heavy finetuning.
4. Mini-DALLE3
Image Generation as Function Call. Specifically, we
In this section, we depict a blueprint of an iT2I system, utilize the few-shot prompt as shown in Fig. 6 to transform
which we refer to as Mini-DALLE3. The overall archi- the problem of multi-turn image generation into a problem
tecture of Mini-DALLE3 is illustrated in Fig. 5, and it of multi-turn textual description generation. Our prompt
comprises several key components: an LLM, a router, an entails several key steps. Initially, we define the LM’s role
adapter, and T2I models. The LLM can be an existing and explicitly convey to it that it possesses the ability to
text-only LLM, such as ChatGPT [36] and LLaMA [48], generate images. Subsequently, we request the LM to pro-
or multi-modal LLM [50]. It is responsible to analyze user duce images by generating descriptive text enclosed within
intentions and produce the proper outputs in text or neural ⟨image⟩ tags. If the generated images exhibit a high de-
representations. The router would automatically dispatch gree of correlation with previous ones, the LM is instructed
the parsed image representations (if there exist ones in the to generate “edit” rather than generate “image”. Finally, we
LLM output) to the image generation module. The adapter provide a few number of few-shot examples to further guide
transforms the image representations to better fit the back- the LM’s responses. Leveraging the robust in-context learn-
end T2I models. Depending on the type of image repre- ing capabilities inherent in advanced LLMs, we observe that

5
You are Mini-DALLE3, a smart AI assisant that could generate image if GPT3.5-Turbo GPT3.5-Turbo
the user request asks. Task
You should include the generated image by its description with the format
Original Mini-DALLE3
of " <image> HERE IS THE DESCRIPTION </image> "
If the generated image is strongly correlated with previous image, you
Abstract Algebra 42.42 43.43
should use <edit> </edit> tag instead of <image> </image>. High School Physics 40.00 38.67
Here is an example
Marketing 88.41 86.70
Philosophy 77.41 70.65
User: My 5 year-old keeps talking about a "super-duper sunflower College Computer Science 48.48 42.42
hedgehog" -- What does it look like ?
AI: It might looks like as <image> a super-duper sunflower hedgehog Average 59.34 56.37
</image>
User: My daughter says its name is Larry. Can I see more like this ?
AI: Sure, here it is <edit> a super-duper sunflower hedgehog </edit> Table 1. Evaluation results of different models on the subtasks of
User: She's going to love these! Can you show me Larry's house ? MMLU, using the script from chain-of-thought-hub [14].
AI: Of course, <edit> a super-duper sunflower hedgehog, standing in
front of a house </edit>
User: Larry is cute, what makes him so super-duper ?
AI: Larry is super-duper for a myriad of reasons! Here's why: 1. Sunflower widely explored in the context of conventional T2I [26, 42,
Petals as Quills: Unlike other hedgehogs, Larry boasts sunflower petals
as quills. This not only makes him stand out, but it also gives him a bright
57], only a few works explore the multi-turn scenarios and
and cheerful demeanor. 2. Kind Hearted: Larry is known throughout the seldom works explore the integration of these abilities into
meadow for kindness. a single unified system. Our decomposition makes it possi-
User: Awwww...can you show me Larry being "kind hearted"
AI: <edit> a super-duper sunflower hedgehog, kind hearted </edit> ble to utilize existing T2I models that were not designed for
User: Can your design some stickers ? multi-turn scenarios. For example, the edited description
AI: Sure, <edit> a sticker of a super-duper sunflower hedgehog </edit>
of Prompt-to-prompt [32] can be automatically generated
Here is another example through LLM in an interactive manner.
User: can you generate a dog ? Specifically, we leverage the off-the-shelf T2I models
AI: Sure, <image> a cute dog </image> that take previous images as additional input to ensure
User: can you generate a cat ?
AI: Sure, <image> a cute cat </image> consistent multi-turn generation. To better ensure the im-
age quality, we introduce a hierarchical control strategy
that utilizes different models for different levels of content
Figure 6. The few-shot prompt for iT2I generation.
changes. For small content changes that can be described in
a few words, such as changing styles, word weighting, and
simple object manipulation, we adopt the models of Prompt
this approach yields favorable outcomes. The LM success-
to prompt [32] and MasaCtrl [8]. We utilize IP-Adapter [60]
fully generates images accompanied by coherent textual re-
to perform large content changes as these models are more
sponses, as illustrated in Fig. 1. Importantly, these capa-
flexible for the input textual prompts.
bilities can be harnessed without the need for specialized
training and can be swiftly integrated into existing LLMs. 4.3. Composiblitiy
Prompt Refinement&&Variations. Although we can
generate textual descriptions that integrate the information As we have not modified the original LLM, our system can
from context by prompting LLMs, the descriptions might natively support the composition with question answering
not be sufficient to generate high-quality images. Therefore, and image generation interleavedly.
we propose to leverage another round of prompt refinement
to transform the vanilla descriptions to better fit subsequent 5. Evaluation
T2I models. It is worth noting that the prompt refinement
Will prompting harm the inherent abilities of LLM?
can also apply to embedding if the previous intermediate
We provide a preliminary evaluation if the iT2I prompt
representation is embedding. In this instantiation, we per-
harms the inherent abilities of LLM. As previously shown
form text transformation by prompting LLM again with the
in Fig. 1, our prompting technique would not cause severe
following few-shot prompt. Furthermore, we can perform
degradation in the LLM abilities. We can still ask LLMs for
prompt variation by repeatedly performing different prompt
either question answering or code generation as before. To
refinements, which is useful for responding to the request to
further investigate the impact of the iT2I prompt, we per-
generate a list of images.
form an ablation study on five subtasks of MMLU [19],
comparing the models with and without the iT2I prompt.
4.2. Hierarchical Content Consistency Control
The results are provided in Tab. 1, it can be observed that
Content consistency is another important aspect of an the iT2I prompt only brings minor degradations.
iT2I system. Although similar topics (subject-driven T2I, Comparsion of different LLM. We evaluate our ap-
example-driven T2I, personalization, concept learning) are proach with different LLMs, including commerical ser-

6
GPT 3.5 GPT 4 LLAMA 2 – 13B Chat Claude
There is a novel "Hard-Boiled There is a novel "Hard-Boiled There is a novel "Hard-Boiled There is a novel "Hard-Boiled
Wonderland and the End of Wonderland and the End of Wonderland and the End of Wonderland and the End of
the World", what does this the World", what does this the World", what does this the World", what does this
scene would be ? scene would be ? scene would be ? scene would be ?

It's difficult to provide a Here is my imagining of a scene from


It might look something User: Hey, I'm reading this novel
the novel "Hard-Boiled Wonderland
specific scene from a novel like this called "Hard-Boiled Wonderland
and the End of the World":
and the End of the World" and
without more context.
there's this scene where the
However, I can generate an protagonist enters a room and finds
image that represents the a lot of animals. Can you help me
atmosphere or theme of the visualize it?
novel. Here is an image that AI: Sure thing! It sounds like quite an
imaginative scene. Based on your
captures the mysterious and description, I'm picturing something
surreal elements of "Hard- like this:
Boiled Wonderland and the The city represents the Hard-Boiled
End of the World": Wonderland, a chaotic and vibrant
metropolis filled with information and
Prompt (A scene from "Hard-
technology.
Boiled Wonderland and the End
In contrast, the End of the World is a
of the World”)
simpler, pastoral place. Let's imagine
that like this:

Is that close to what you were


imagining?
Prompt (A surreal landscape with a
mix of futuristic and ancient elements) User: Wow, that's impressive! Can
The End of the World has a timeless,
you give me more details about the
peaceful feeling compared to the
animals?
sensory overload of the Hard-Boiled
…. Wonderland. This dichotomy
between the two settings highlights a
central theme of the book.

ChatGLM2 – 6B Chat Baichuan 2 – 13B Chat Qwen – 14B Chat InternLM – 20B Chat

有一本小说叫做“世界尽头与冷 有一本小说叫做“世界尽头与冷 有一本小说叫做“世界尽头与冷 有一本小说叫做“世界尽头与冷


酷仙境”,这个场景会是什么样 酷仙境”,这个场景会是什么样 酷仙境”,这个场景会是什么样 酷仙境”,这个场景会是什么样
子的呢? 子的呢? 子的呢? 子的呢?
AI:这个世界可能是一个荒凉 在这个名为《世界尽头与冷酷仙境》的 AI:这可能是一个充满奇幻元 抱歉,我无法生成小说的场景。
小说中,我们可以想象这是一个充满奇 我是一个AI助手,只能生成图
而孤独的地方,充满了寂静和 幻色彩的场景。在这个世界里,现实与
素的场景,其中包含着神秘的
冷漠。但是,也可能会是一个 幻想交织在一起,创造出一种独特的氛 生物和令人惊叹的景观。世界 像和文本描述。如果您需要描
充满神秘和美丽的地方,就像 围。 尽头可能是一个荒芜的地方, 述小说场景,我可以尝试根据
小说中所描述的那样。 而冷酷仙境则可能是一个充满 小说内容生成相关的图像和描
寒冷和冰雪的地方。这可能会 述。
你能画出来吗 是一个充满冒险和神秘的场景,
有一本小说叫做“世界尽头与冷
其中包含着许多未知的危险和
AI:当然,我会为您绘制一只 酷仙境”,你能生成一张它的图
挑战。
猫!这是猫的图像及其描述: 片吗
一只可爱的猫,在阳光下舒适
的躺在地上 你能画出来吗
在这个世界中,时间似乎变得模糊不清。
白天和黑夜交替出现,但它们的界限变
AI:当然,我可以尝试画出这
得模糊。阳光透过树梢洒落在大地上,
形成一道道金色的光束。而在夜晚,星 个场景。但是,由于我是一个
空璀璨夺目,仿佛是天空中的一幅画卷。 文本生成模型,我无法直接画
出图像。我可以尝试描述这个
在这个冷酷仙境中,生活着各种奇特的
生物。有长着翅膀的独角兽在森林中穿
场景,以便您能够更好地理解
梭,还有那些拥有魔法力量的精灵在城 它。 Prompt (一本名为“世界尽头与冷酷仙境”
堡中漫步。他们在这里共同生活,共同 的小说。这本小说的封面设计非常精美,
创造着一个充满奇迹的世界。 包括一个深蓝色背景和一个巨大的字母
“W”,代表着“世界尽头”。字母“W”周围
环绕着各种符号和图案,代表着小说中
在这个世界尽头,主人公们开始了一段
的奇幻元素和冒险主题。这本书的封面
冒险之旅。他们穿越森林,探索城堡,
设计非常吸引人,一定会引起读者的兴
试图揭开这个世界的秘密。在这个过程
趣。)
中,他们不仅要面对现实的挑战,还要
克服内心的恐惧。最终,他们在这个世
界尽头找到了自己的救赎和力量。

Figure 7. Qualitative comparison of interactive text-to-image generation by prompting different LLM.

7
Interactive Image Generation Storytelling Concept Prototyping Interactive Logo Design
I want to draw a picture for my I have listened a beautiful song I want to design a picture of a Can you help me to design a
article, the theme is Spring called "sleepyhead", could you spaceship, could you give me logo for a text-to-image app?
Festival, do you have any idea? help me draw illustration for it? some suggestions?
In the following, I will give you Of course! I can definitely help
Certainly! How about a vibrant the lyric part by part. Certainly! Here are a few you design a logo for your text-
illustration of a traditional suggestions for designing a to-image app. Could you please
Spring Festival scene? It could In the year of our lord 1239. picture of a spaceship: provide me with some more
feature a bustling street filled There once lived a girl of a 1.Classic Spaceship: Create a details about your app and any
with colorful decorations, royal line. The ancient sleek and futuristic spaceship specific ideas or preferences you
lanterns, and people stories do recall. She was with a metallic exterior, have for the logo design?
celebrating. There could be lion the fairest of them all. streamlined shape, and
dancers, dragon parades, and glowing lights. It is call mini-dalle3
families enjoying festive food 2.Retro Spaceship: Design a
together. The atmosphere spaceship inspired by retro sci- Great! So you would like the
should be joyful and lively, fi movies, with a bulbous shape, logo for your text-to-image app
capturing the spirit of the fins, and retro color schemes called "Mini-DALLE3". Could you
Spring Festival. like silver and red. please provide me with some
3.Alien Spaceship: Let your more information about the
Looks great, could you draw it? imagination run wild and style or concept you have in
design an otherworldly mind for the logo? Do you have
Certainly! Here is an illustration spaceship with unconventional any specific colors or symbols
In a castle made of stone.
of a traditional Spring Festival shapes, organic textures, and that you would like to
Every night she slept alone.
scene: vibrant colors. incorporate?
Any noise that would raise
the dead. Couldn’t wake her I would like it to be flatten
sleepyhead I like Retro Spaceship, could you
imagine a picture of it? Understood! You would like the
logo for your Mini-DALLE3 app
Certainly! Here is an image of a
to have a flat design. Based on
Retro Spaceship for your
that, I will create a logo that
inspiration:
incorporates simplicity and
minimalism. Here is a concept
for the Mini-DALLE logo:
Please let me know if you have
any specific changes or


additions you would like to
Can you imagine another scene make to the concept.
of setting off firecrackers
Could you summarize the What is the prompt for it?
Certainly! Here is an illustration story with interleaved images?
of a scene where firecrackers The prompt for the image of the
are being set off during the Certainly! Here's a summary of Retro Spaceship is "A Retro
Spring Festival: the story based on the lyrics: Spaceship with a bulbous shape,
fins, and a silver and red color
In the year 1239, there lived a scheme."
I want it to use orange as main
girl from a royal lineage who I don't want the fins color
was known for her unmatched
beauty. She resided in a castle Apologies for the Certainly! Here is an updated
made of stone and slept alone misunderstanding. Here is an concept for the Mini-DALLE logo
every night, undisturbed by any updated image of a Retro with orange as the main color:
noise. Spaceship without the fins: Please let me know if you have
.... any further feedback or if there
are any other specific elements
you would like to include in the
logo design.

T2I Model: Stable Diffusion XL T2I Model: Stable Diffusion XL T2I Model: Stable Diffusion XL T2I Model: DALLE 3

Figure 8. Examples use cases of interactive text-to-image generation.

8
Interactive Image Generation Storytelling Concept Prototyping Interactive Logo Design
A vibrant illustration of a tra- A beautiful girl with A Retro Spaceship with a A minimalist logo with the
ditional Spring Festival scene. flowing hair, dressed in bulbous shape, fins, and text ”Mini DALLE 3” writ-
The street is filled with colorful regal attire, stands tall in a silver and red color ten in a clean and mod-
decorations, lanterns, and peo- a grand castle. Her royal scheme. It has a stream- ern font. The letters
ple celebrating. Lion dancers lineage is evident in her lined design and a nos- are arranged in a straight
and dragon parades add to the graceful demeanor. The talgic feel, reminiscent of line, with each letter having
festive atmosphere, while fam- ancient stories speak of classic sci-fi movies. The equal spacing. The color
ilies gather to enjoy delicious her unmatched beauty, spaceship is adorned with palette consists of vibrant
food together. The scene is filled making her the fairest of retro-style decals and has shades of blue and green,
with joy and excitement, captur- them all. large windows to provide representing creativity and
ing the spirit of the Spring Festi- a view of the cosmos. It innovation.
val. exudes a sense of adven-
ture and exploration as it
cruises through the vast-
ness of space.
In this illustration, a group of Inside a majestic castle A Retro Spaceship with a A minimalist logo with the
people is gathered in an open made of stone, the girl bulbous shape and a sil- text ”Mini DALLE 3” writ-
area, setting off firecrackers to sleeps peacefully in her ver and red color scheme. ten in a clean and mod-
celebrate the Spring Festival. grand bed. Despite any It has a streamlined design ern font. The letters
The sky is filled with colorful noise that could wake reminiscent of classic sci- are arranged in the same
bursts of light and smoke as the dead, she remains fi movies. The spaceship line, with each letter having
the firecrackers explode, creat- undisturbed in her slum- features a rounded cock- equal spacing. The color
ing a festive and energetic atmo- ber, earning her the nick- pit with large windows, al- palette consists of vibrant
sphere. The people are wear- name ”Sleepyhead.” lowing the pilot to have a shades of green and orange
ing traditional clothing and have panoramic view of space. representing creativity and
joyful expressions on their faces The exterior is adorned innovation. The logo is
as they enjoy the tradition of set- with retro-style decals and placed on a white back-
ting off firecrackers to ward off has a polished metallic fin- ground, creating a clean
evil spirits and bring good luck ish. It exudes a sense of and sleek look.
for the coming year. nostalgia and adventure as
it travels through the cos-
mos.

Table 2. Generated captions of the examples in Figure 8.

vices OpenAI GPT3.5 [7], GPT4 [36], Claude3 , and amples, which cover different use scenarios from single-
open-source LLAMA2-13B-Chat [48], Baichuan2-13B turn/multi-turn image generation to interleaved text-image
Chat [56], ChatGLM2-6B-Chat [10], Qwen-14B-Chat [3], storytelling. The results are shown in Fig. 8 and Tab. 2.
InternLM-20B-Chat [46]. As shown in Fig. 7, all commer-
cial LLMs successfully generate the images with appropri-
ate corrsponding text (interleaved) responses. This indicates
that our prompting approach could be a simple yet effective 6. Conclusion
way to rapidly augment existing LLMs with iT2I ability.
Nevertheless, the results are less satisfactory for the open- In conclusion, this paper introduces the concept of interac-
source LLMs. Overall, Baichuan2 [56] generates the best tive text-to-image (iT2I) and presents an approach to aug-
results, while Qwen and InternLM tend to refuse to gener- menting existing large language models for this task. Our
ate images even if they are prompted to do so. ChatGLM2 evaluation shows that this approach enables convenient iT2I
capabilities without significant degradation of the models’
could generate an image but the correspondence is incor-
inherent capabilities. This work has the potential to enhance
rect. user experiences in human-machine interactions and elevate
iT2I Examples. Here, we show a number of iT2I ex- the image quality of next-generation T2I models, offering
promising directions for future research and development.
3 https://claude.ai

9
References [12] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming
transformers for high-resolution image synthesis. In Pro-
[1] Namhyuk Ahn, Junsoo Lee, Chunggi Lee, Kunhee Kim, ceedings of the IEEE/CVF conference on computer vision
Daesik Kim, Seung-Hun Nam, and Kibeom Hong. Dream- and pattern recognition, pages 12873–12883, 2021. 3
styler: Paint by style inversion with text-to-image diffusion
[13] Zhida Feng, Zhenyu Zhang, Xintong Yu, Yewei Fang,
models. arXiv preprint arXiv:2309.06933, 2023. 3
Lanxin Li, Xuyi Chen, Yuxiang Lu, Jiaxiang Liu, Weichong
[2] AUTOMATIC1111. Stable Diffusion Web UI, 2022. 2 Yin, Shikun Feng, et al. Ernie-vilg 2.0: Improving text-to-
[3] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xi- image diffusion model with knowledge-enhanced mixture-
aodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, of-denoising-experts. In Proceedings of the IEEE/CVF Con-
Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Day- ference on Computer Vision and Pattern Recognition, pages
iheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin 10135–10145, 2023. 2, 3
Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi [14] Yao Fu, Litu Ou, Mingyu Chen, Yuhao Wan, Hao Peng, and
Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Tushar Khot. Chain-of-thought hub: A continuous effort
Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, to measure large language models’ reasoning performance.
Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen arXiv preprint arXiv:2305.17306, 2023. 6
Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan
[15] Songwei Ge, Taesung Park, Jun-Yan Zhu, and Jia-Bin
Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren
Huang. Expressive text-to-image generation with rich
Zhou, Xiaohuan Zhou, and Tianhang Zhu. Qwen technical
text. In IEEE International Conference on Computer Vision
report, 2023. 9
(ICCV), 2023. 3
[4] Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat,
[16] Yuying Ge, Yixiao Ge, Ziyun Zeng, Xintao Wang, and Ying
Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila,
Shan. Planting a seed of vision in large language model.
Samuli Laine, Bryan Catanzaro, et al. ediffi: Text-to-image
arXiv preprint arXiv:2307.08041, 2023. 3, 5
diffusion models with an ensemble of expert denoisers. arXiv
preprint arXiv:2211.01324, 2022. 2, 3 [17] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
[5] Maciej Besta, Nils Blach, Ales Kubicek, Robert Gersten-
Yoshua Bengio. Generative adversarial networks. Commu-
berger, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann,
nications of the ACM, 63(11):139–144, 2020. 3
Michał Podstawski, Hubert Niewiadomski, Piotr Nyczyk,
and Torsten Hoefler. Graph of Thoughts: Solving Elaborate [18] Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo
Problems with Large Language Models, 2023. 3 Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vec-
tor quantized diffusion model for text-to-image synthesis. In
[6] Tim Brooks, Aleksander Holynski, and Alexei A Efros. In-
Proceedings of the IEEE/CVF Conference on Computer Vi-
structpix2pix: Learning to follow image editing instructions.
sion and Pattern Recognition, pages 10696–10706, 2022. 2
In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 18392–18402, 2023. [19] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou,
3, 5 Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Mea-
[7] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- suring massive multitask language understanding. arXiv
biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- preprint arXiv:2009.03300, 2020. 6
tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- [20] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu-
guage models are few-shot learners. Advances in neural in- sion probabilistic models. Advances in Neural Information
formation processing systems, 33:1877–1901, 2020. 3, 9 Processing Systems, 33:6840–6851, 2020. 2
[8] Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xi- [21] Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park,
aohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mu- Eli Shechtman, Sylvain Paris, and Taesung Park. Scal-
tual self-attention control for consistent image synthesis and ing up gans for text-to-image synthesis. In Proceedings of
editing, 2023. 6 the IEEE/CVF Conference on Computer Vision and Pattern
[9] Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Recognition, pages 10124–10134, 2023. 3
Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, [22] Diederik P Kingma and Max Welling. Auto-encoding varia-
Haoran Wei, et al. Dreamllm: Synergistic multimodal com- tional bayes. arXiv preprint arXiv:1312.6114, 2013. 3
prehension and creation. arXiv preprint arXiv:2309.11499, [23] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao,
2023. 3, 5 Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White-
[10] Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong head, Alexander C Berg, Wan-Yen Lo, et al. Segment any-
Qiu, Zhilin Yang, and Jie Tang. Glm: General language thing. arXiv preprint arXiv:2304.02643, 2023. 3
model pretraining with autoregressive blank infilling. In Pro- [24] Jing Yu Koh, Daniel Fried, and Ruslan Salakhutdinov. Gen-
ceedings of the 60th Annual Meeting of the Association for erating images with multimodal language models. arXiv
Computational Linguistics (Volume 1: Long Papers), pages preprint arXiv:2305.17216, 2023. 3, 5
320–335, 2022. 9 [25] Dongxu Li, Junnan Li, and Steven CH Hoi. Blip-
[11] Yuki Endo. User-controllable latent transformer for style- diffusion: Pre-trained subject representation for control-
gan image layout editing. Computer Graphics Forum, 41(7): lable text-to-image generation and editing. arXiv preprint
395–406, 2022. 3 arXiv:2305.14720, 2023. 3

10
[26] Tianle Li, Max Ku, Cong Wei, and Wenhu Chen. eration with clip latents. arXiv preprint arXiv:2204.06125,
Dreamedit: Subject-driven image editing. arXiv preprint 2022. 3
arXiv:2306.12624, 2023. 6 [41] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
[27] Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jian- Patrick Esser, and Björn Ommer. High-resolution image
wei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. synthesis with latent diffusion models. In Proceedings of
Gligen: Open-set grounded text-to-image generation. In Pro- the IEEE/CVF Conference on Computer Vision and Pattern
ceedings of the IEEE/CVF Conference on Computer Vision Recognition, pages 10684–10695, 2022. 2, 3
and Pattern Recognition, pages 22511–22521, 2023. 3 [42] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch,
[28] Zhengqi Li, Richard Tucker, Noah Snavely, and Aleksander Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine
Holynski. Generative image dynamics. arXiv preprint tuning text-to-image diffusion models for subject-driven
arXiv:2309.07906, 2023. 3 generation. 2022. 6
[29] Pengyang Ling, Lin Chen, Pan Zhang, Huaian Chen, [43] Chitwan Saharia, William Chan, Saurabh Saxena, Lala
and Yi Jin. Freedrag: Point tracking is not you need Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed
for interactive point-based image editing. arXiv preprint Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi,
arXiv:2307.04684, 2023. 3 Rapha Gontijo Lopes, et al. Photorealistic text-to-image
[30] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao diffusion models with deep language understanding. arXiv
Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun preprint arXiv:2205.11487, 2022. 2, 3
Zhu, et al. Grounding dino: Marrying dino with grounded [44] Yujun Shi, Chuhui Xue, Jiachun Pan, Wenqing Zhang, Vin-
pre-training for open-set object detection. arXiv preprint cent YF Tan, and Song Bai. Dragdiffusion: Harnessing diffu-
arXiv:2303.05499, 2023. 3 sion models for interactive point-based image editing. arXiv
[31] Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi preprint arXiv:2306.14435, 2023. 3
Wang, Shoufa Chen, Qinglong Zhang, Yang Yang, Qingyun [45] Kihyuk Sohn, Nataniel Ruiz, Kimin Lee, Daniel Castro
Li, Jiashuo Yu, et al. Internchat: Solving vision-centric tasks Chin, Irina Blok, Huiwen Chang, Jarred Barber, Lu Jiang,
by interacting with chatbots beyond language. arXiv preprint Glenn Entis, Yuanzhen Li, et al. Styledrop: Text-to-image
arXiv:2305.05662, 2023. 3 generation in any style. arXiv preprint arXiv:2306.00983,
2023. 3
[32] Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch,
and Daniel Cohen-Or. Null-text inversion for editing real [46] InternLM Team. Internlm: A multilingual language model
images using guided diffusion models. arXiv preprint with progressively enhanced capabilities. https : / /
arXiv:2211.09794, 2022. 6 github.com/InternLM/InternLM, 2023. 3, 9
[47] Haileleol Tibebu, Aadin Malik, and Varuna De Silva. Text
[33] Chong Mou, Xintao Wang, Jiechong Song, Ying Shan, and
to image synthesis using stacked conditional variational au-
Jian Zhang. Dragondiffusion: Enabling drag-style manipula-
toencoders and conditional generative adversarial networks.
tion on diffusion models. arXiv preprint arXiv:2307.02421,
In Science and Information Conference, pages 560–580.
2023. 3
Springer, 2022. 3
[34] Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhon-
[48] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert,
gang Qi, Ying Shan, and Xiaohu Qie. T2i-adapter: Learning
Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov,
adapters to dig out more controllable ability for text-to-image
Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al.
diffusion models. arXiv preprint arXiv:2302.08453, 2023. 3
Llama 2: Open foundation and fine-tuned chat models. arXiv
[35] OpenAI. Dall·e 3 system card. 2023. 2 preprint arXiv:2307.09288, 2023. 3, 5, 9
[36] OpenAI. Gpt-4 technical report, 2023. 3, 5, 9 [49] Qian Wang, Biao Zhang, Michael Birsak, and Peter Wonka.
[37] Xingang Pan, Ayush Tewari, Thomas Leimkühler, Lingjie Instructedit: Improving automatic masks for diffusion-
Liu, Abhimitra Meka, and Christian Theobalt. Drag your based image editing with user instructions. arXiv preprint
gan: Interactive point-based manipulation on the generative arXiv:2305.18047, 2023. 3
image manifold. In ACM SIGGRAPH 2023 Conference Pro- [50] Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu,
ceedings, pages 1–11, 2023. 3 Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu
[38] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Qiao, et al. Visionllm: Large language model is also an
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, open-ended decoder for vision-centric tasks. arXiv preprint
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning arXiv:2305.11175, 2023. 5
transferable visual models from natural language supervi- [51] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed
sion. In International conference on machine learning, pages Chi, Sharan Narang, Aakanksha Chowdhery, and Denny
8748–8763. PMLR, 2021. 3 Zhou. Self-consistency improves chain of thought reason-
[39] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, ing in language models. arXiv preprint arXiv:2203.11171,
Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2022. 3
Zero-shot text-to-image generation. In International Confer- [52] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
ence on Machine Learning, pages 8821–8831. PMLR, 2021. Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al.
2, 3 Chain-of-thought prompting elicits reasoning in large lan-
[40] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, guage models. Advances in Neural Information Processing
and Mark Chen. Hierarchical text-conditional image gen- Systems, 35:24824–24837, 2022. 3

11
[53] Jingxuan Wei, Shiyu Wu, Xin Jiang, and Yequan Wang.
Dialogpaint: A dialog-based image editing model. arXiv
preprint arXiv:2303.10073, 2023. 3
[54] Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng
Chua. Next-gpt: Any-to-any multimodal llm, 2023. 3, 5
[55] Zeyue Xue, Guanglu Song, Qiushan Guo, Boxiao Liu, Zhuo-
fan Zong, Yu Liu, and Ping Luo. Raphael: Text-to-image
generation via large mixture of diffusion paths. arXiv
preprint arXiv:2305.18295, 2023. 2, 3
[56] Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang,
Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan, Fan
Yang, et al. Baichuan 2: Open large-scale language models.
arXiv preprint arXiv:2309.10305, 2023. 3, 9
[57] Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin
Chen, Xiaoyan Sun, Dong Chen, and Fang Wen. Paint by
example: Exemplar-based image editing with diffusion mod-
els. In Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition, pages 18381–18391,
2023. 3, 6
[58] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran,
Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan.
Tree of thoughts: Deliberate problem solving with large lan-
guage models. arXiv preprint arXiv:2305.10601, 2023. 3
[59] Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-
adapter: Text compatible image prompt adapter for text-to-
image diffusion models. arXiv preprint arXiv:2308.06721,
2023. 3
[60] Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-
adapter: Text compatible image prompt adapter for text-to-
image diffusion models. 2023. 6
[61] Ahmet Burak Yildirim, Vedat Baday, Erkut Erdem, Aykut
Erdem, and Aysegul Dundar. Inst-inpaint: Instructing
to remove objects with diffusion models. arXiv preprint
arXiv:2304.03246, 2023. 3
[62] Shengming Yin, Chenfei Wu, Jian Liang, Jie Shi, Houqiang
Li, Gong Ming, and Nan Duan. Dragnuwa: Fine-grained
control in video generation by integrating text, image, and
trajectory. arXiv preprint arXiv:2308.08089, 2023. 3
[63] Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su.
Magicbrush: A manually annotated dataset for instruction-
guided image editing, 2023. 5
[64] Lvmin Zhang and Maneesh Agrawala. Adding conditional
control to text-to-image diffusion models. arXiv preprint
arXiv:2302.05543, 2023. 3
[65] Shu Zhang, Xinyi Yang, Yihao Feng, Can Qin, Chia-Chih
Chen, Ning Yu, Zeyuan Chen, Huan Wang, Silvio Savarese,
Stefano Ermon, Caiming Xiong, and Ran Xu. Hive: Har-
nessing human feedback for instructional visual editing.
arXiv preprint arXiv:2303.09618, 2023. 5

12

You might also like