0% found this document useful (0 votes)

28 views

Mini DALL E 3: Interactive Text To Image by Prompting Large Language Models

Uploaded by

Vidya Bharti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views

Mini DALL E 3: Interactive Text To Image by Prompting Large Language Models

Uploaded by

Vidya Bharti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Mini DALL·E 3: Interactive Text to Image by

Prompting Large Language Models

Zeqiang Lai 1 Xizhou Zhu 23 Jifeng Dai 23 Yu Qiao 2 Wenhai Wang 4
1
Beijing Institute of Technology 2 Shanghai AI Laboratory 3 Tsinghua University 4 CUHK
https://minidalle3.github.io/
arXiv:2310.07653v2 [cs.AI] 12 Oct 2023

Cool! Could you make some

Hello! What are you? Don't be a sticker Don't be a 3D model! Where is the dog?
#1 sticker? #3 #5 #7 #10

Sure, here is a sticker of a Apologies for the confusion. Apologies for the oversight.
Hello! I am Mini DALL·E 3, a Apologies for the confusion.
cute dog for you: Here is an image of a cute dog Here is an updated image that
smart AI assistant. I can help Here is a description of the
running near a serene river: cute dog's house: The cute includes the dog in the scene:
you with various tasks,
including generating images dog's house is a cozy little
based on your requests. How wooden cabin with a red roof.
can I assist you today? It has a small porch with a dog
bed and a bowl of water. The
house is surrounded by a
beautiful garden with colorful
That’s awesome. Can you flowers and a fenced yard for
generate a cute dog? #2 the dog to play in.

Sure, here is a cute dog for Can you turn this description
you: Good! Can you make it run near into an image? #9
What would this dog's house
the river? It looks great!
#4 look like? #6 #11
Certainly! Here is an image
Sure, here it is: Certainly! Here is an image of a based on the description of the
cute dog's house: I'm glad you like it! If you
cute dog's house:
have any other requests or
need further assistance, feel
free to let me know.

Enter text to ask me

Mini DALL·E 3

Figure 1. Examples of two interactive text-to-image conversations produced by Mini DALL·E 3. In these cases, people can ask the agent
to generate images via natural language and request an edit if the results are unsatisfactory. The generation and editing can be completed
in a multi-turn dialog with recognition of the conservation context.

Inspired by the recently released DALL·E 3– a T2I model

Abstract directly built-in ChatGPT that talks human language, we re-
visit the existing T2I systems endeavoring to align human
The revolution of artificial intelligence content genera- intent and introduce a new task - interactive text to im-
tion has been rapidly accelerated with the booming text-to- age (iT2I), where people can interact with LLM for inter-
image (T2I) diffusion models. Within just two years of de- leaved high-quality image generation/edit/refinement and
velopment, it was unprecedentedly of high-quality, diversity, question answering with stronger images and text corre-
and creativity that the state-of-the-art models could gener- spondences using natural language. In addressing the iT2I
ate. However, a prevalent limitation persists in the effec- problem, we present a simple approach that augments LLMs
tive communication with these popular T2I models, such as for iT2I with prompting techniques and off-the-shelf T2I
Stable Diffusion, using natural language descriptions. This models. We evaluate our approach for iT2I in a variety of
typically makes an engaging image hard to obtain without common-used scenarios under different LLMs, e.g., Chat-
expertise in prompt engineering with complex word compo- GPT, LLAMA, Baichuan, and InternLM. We demonstrate
sitions, magic tags, and annotations. that our approach could be a convenient and low-cost way
to introduce the iT2I ability for any existing LLMs and any
Preliminary version. Work in Progress. text-to-image models without any training while bringing

1
little degradation on LLMs’ inherent capabilities in, e.g.,
1. Unconditioned
question answering and code generation. We hope this work (StyleGAN)
2. Class Conditioned
could draw broader attention and provide inspiration for (CVAE, cGAN)

boosting user experience in human-machine interactions

alongside the image quality of the next-generation T2I sys-
tems. 3. Text Conditioned
(Stable Diffusion, GigaGAN,
DALLE 2, …)

4. Tab/Node/Click-based UX
(Stable Diffusion WebUI, ComfyUI,
1. Introduction InvokeAI, Midjourney)

The evolution of artificial intelligence content generation

has been significantly accelerated by the proliferation of 5-I. Point/Drag-based
(DragGAN, DragDiffusion) 5-II. Interactive T2I
text-to-image (T2I) diffusion models [18, 20, 41, 43].
(DALLE 3, Mini DALLE3)
Within just two years of rapid development since 2021, it
was unprecedentedly of high quality, diversity, and creativ- Figure 2. The evolution of image generation systems.
ity that the state-of-the-art T2I models [4, 13, 39–41, 43, 55]
could generate. For the first time, “talk to paint” is no longer
a daydream, and complex surrealistic arts can be gener- cise text prompt for the desired images. However, this can
ated via textual descriptions, with stronger expressive abil- be challenging in some cases, such as when the required
ity than previous unconditional and class conditional image scenes are artistic conceptions rather than specific objects,
generation systems as shown in Fig. 2. or when the users have no idea how to describe what they
However, it is unfortunate that most of the existing T2I want to generate, etc.
models, such as Stable Diffusion [41], are still limited in un-
Generally, it might be difficult for users to come up with
derstanding natural language. In other words, people have
the right prompts and configurations at once, but it is much
to learn to write complex text prompts to obtain the best re-
easier to tell what they want or do not want via natural lan-
sults, which fit the used models but are not necessarily user-
guage if the first version is unsatisfactory, e.g., “Don’t be a
friendly and straightforward for humans, as illustrated by
sticker” and “Where is the dog?”, as shown in Fig. 1. More-
Fig. 6. As a result, this typically makes an engaging image
over, it would be more straightforward to perform a multi-
hard to obtain without expertise in prompt engineering with
turn conversation with T2I models to iterate the images over
proper word compositions and sometimes weird phrase or-
and over again, mimicking the communication processes
ganizations. Besides, there have been dozens of different
between human designers and their customers. These anal-
textual and numerical configurations in a diffusion-based
yses reveal a promising direction for building the next gen-
T2I pipeline, such as CFG scale, word weighting, negative
eration of T2I systems with a new human-machine inter-
prompts, and style keywords, which are also complicated
face using natural language – a system that is able to infer
for non-professional users.
users’ intentions and automatically generate the proper text
To make it easier for users to utilize T2I models, Sta- prompts leveraging the reasoning abilities of large language
ble Diffusion (SD) WebUI [2] is first created to provide a models (LLM). This is not only because natural language is
user-friendly web UI to access the latest techniques with- the easiest way that everyone can master, but also because
out any coding. However, a typical workflow of gener- it frees users from brainstorming sophisticated textual de-
ating a satisfactory image usually involves several stages, scriptions and requires only simple instructions instead (see
e.g., generation, variation, super-resolution, etc. This makes Fig. 6 for more illustrations).
the tab-based interface of SD-WebUI somewhat awkward
to use. Therefore, ComfyUI1 was designed by utilizing Inspired by the recently released demo of DALL·E 3 [35]
a graph/nodes interface that connected different stages via – a powerful T2I model directly built-in ChatGPT that uti-
nodes and edges, which makes workflows more clear. Nev- lizes human language, we revisit existing techniques aimed
ertheless, these software tools still could not solve the prob- at aligning human intent in generating images and introduce
lem of complicated configurations required for a charm- a new task called interactive text to image (iT2I). This
ing image. This urges the development of Fooocus2 – a task is featured by several aspects, including 1) Multi-Turn:
tool with a bunch of built-in optimizations and quality im- users are allowed to chat with the system (typically powered
provements. Fooocus frees users from complex parameter- by LLMs) to progressively specify requirements, shortcom-
tuning, but it still requires them to write a proper and pre- ings, and suggestions of the expected/generated images;
2) Consistency: the ability to keep identity for consistent
1 https://github.com/comfyanonymous/ComfyUI multi-turn image editing, series characters creation, etc.; 3)
2 https://github.com/lllyasviel/Fooocus Composability: the ability to be composed with/built-in ex-

2
Prompt Prompt
Beatrix Potter style watercolor. By Henry Cavill, Chibi style, A cartoonish rabbit walking on a rural road to
cartoonish, they are in a rural school, landscape of pastel school, in the style of watercolor.
colors.<lora:xl_more_art-full_v1:0.5>
(2) Description with human/natural language

Negative Prompt
<lora:badhands:1>ugly, tiling, poorly drawn hands, poorly #1 I want to draw a student rabbit.
drawn feet, poorly drawn face, out of frame, extra limbs, body
out of frame, blurry, bad anatomy, blurred, watermark, grainy, #2 He is walking on a rural road.
signature, cut off, draft, closed eyes, text, logo #3 The style should be cartoon and watercolor.

(1) Description with the dialect of Stable Diffusion (3) Instruction with human/natural language

Figure 3. Illustrations of different human-machine interfaces for T2I systems.

isting chat assistants for interleaved image generation and and generated images, while the birth and development of
(visual) question answering for seamless user experience. diffusion models [4, 13, 40, 41, 43, 55] have pushed the
All these properties make iT2I systems powerful tools for boundaries of text-image interactions.
a wide range of applications, from content generation and
design to interactive storytelling and more.
Image Generation Interface. There are a variety of dif-
As an initial solution to address this problem, we pro-
ferent approaches for image generation and editing – each
pose a simple yet effective approach that enhances lan-
possesses its own merits and drawbacks. The most straight-
guage models for iT2I using prompting techniques and pre-
forward ones are text-based approaches where people write
trained text-to-image models. Specifically, we prompt the
text prompts for either image generations [40, 41] or im-
LLM to instruct it to generate an image with an intermedi-
age editing [6, 61]. Besides, image-based approaches are
ate textual description enclosed by special tags. After de-
also popular. In this case, people either provide a refer-
tecting the special tags, the description is parsed and trans-
ence image asking the T2I models to generate image vari-
formed through a prompt refinement module. Then, a pre-
ations [40, 59], or provide edge/depth maps to control the
trained T2I model is employed to generate the image. We
image layout [27, 34, 64], or performing image translation
evaluate our approach across various common use cases
with a style image [1, 45], or asking generating images of
and different language models such as ChatGPT [7, 36],
a given subject [25, 57]. To facilitate the precise control,
LLAMA [48], Baichuan [56] and InternLM [46]. Our re-
point-based approaches [31, 49] are widely adopted by uti-
sults demonstrate that our approach can easily enable iT2I
lizing state-of-the-art localization methods [23, 30]. Re-
capabilities in any existing language model and text-to-
cently, drag-based approaches [11, 28, 29, 33, 37, 44, 62]
image model without the need for additional training. Fur-
are also proposed for more interactive experience. As for
thermore, it has minimal impact on the language models’
UX design, there are Rich-T2I [15] and DialogPaint [53],
inherent abilities in question answering and code genera-
which share similar spirits as ours. In the literatures of inte-
tion.
grating T2I to LLM, there are NExT-GPT [54], GILL [24],
We hope this work could draw broader attention and DreamLLM [9], SEED [16]. Although these methods also
provide inspiration for boosting user experience in human- provide the capability for interleaved text-image generation,
machine interactions alongside the image quality of the they are not specifically designed for iT2I mostly and are
next-generation T2I models. limited to image quality and multi-turn correspondence.

2. Related Works
Prompting LLMs. The in-context-learning capability [7]
Text-to-Image Generation. Text-to-image (T2I) gener- is one of the strongest advantages of LLMs. It enables users
ation is a widely-explored research area at the intersec- to freely customize LLMs for a particular task or enhance
tion of computer vision and natural language processing. the capabilities of LLMs by simple prompting. For exam-
Notable approaches include generative models, like Vari- ple, chain-of-thoughts [52] is the first prompting technique
ational Autoencoders (VAE) [22, 47], Generative Adver- that enhances LLMs by asking them to generate a series
sarial Networks (GAN) [17, 21], and autoregressive mod- of intermediate reasoning steps. After that, there are also
els [12], which enable image synthesis guided by textual a number of improved prompting techniques that leverage
descriptions. Recent multimodal models like CLIP [38] and the heuristic of majority voting [51], backtracking [58], and
DALL·E [39] have further improved alignment between text graph of thoughts [5]. In this work, we also provide an ap-

3
1. Generation 3. Selecting 5. Refinement 6. Question Answering

My 7 year-old keeps talking he dreamed a He's going to love these! This look great ! Can you refine the Bob looks strong, what’s the story
"superman cat” last night -- What does it Can you pick the second image so that I can use it to build a of him ? Could you tell me some ?
look like ? one ? story book.
Certainly! Bob is a remarkable cat with
It might look something like this: Of course. No problem. Here is the refined one. an extraordinary story. Here are a few
highlights:

1.Super Strength: Bob possesses

incredible strength, allowing him to
perform feats that no ordinary cat can.
2.Flight Abilities: With his superman-
like cape, he soars through the sky with
grace and agility.
3.Laser Vision: Bob's eyes emit
2. Referring Generation 4. Editing powerful laser beams that can cut
through objects and even melt steel.
My sun says its name is Bob. Can I see more like this ? Great, I like this one, it is so cute. Could you 4.Secret Identity: During the day, Bob
make bob flying in the sky ? lives a seemingly ordinary life as a
Sure, here you are: regular house cat.
Certainly! Here’s
Bob's story is filled with exciting
the image of bob
adventures and heartwarming
flying in the sky.
moments. He is a beloved hero in the
community, admired by both humans
and animals alike.

Figure 4. Illustration of 6 types of interactions in interactive text-to-image workflow.

proach to augment LLM with iT2I ability via prompting, as the current textual input but also the previous visual con-
it can be rapidly applied to any existing LLMs without any text. It involves persisting the visual identity of images
training. in different rounds of generations. This capability enables
iT2I systems to perform consistent multi-turn image edit-
3. Interactive Text to Image ing/refinement, produce personalized and contextually rele-
vant objects/characters, etc.
Interactive Text to Image (iT2I) aims to provide a user-
Composability relates to the ability to combine or inte-
friendly approach to generate images that meet user re-
grate image generation with other tasks. This means that
quirements in an interactive manner. Users can instanti-
the ability of image generation should be modular and com-
ate a multi-turn dialogue between humans and AI agents,
patible with the inherent abilities of LLMs, allowing users
where they can communicate requirements, shortcomings,
to seamlessly incorporate them to perform interleaved con-
and suggestions of the generated images or the expected
servations for querying both text and visual content.
ones with natural language.

3.1. Problem Definition 3.2. Types of Instruction

Precisely, the iT2I problem can be defined as the task of As shown in Fig. 4, there are various instructions that could
generating images from textual descriptions in a way that be found in an iT2I system, such as generation, editing, se-
the generated images closely align with the provided text, lecting, and refinement. Different instructions could have
ensuring that the generated visual content accurately repre- varying levels of complexity when it comes to interpreta-
sents the textual information. There are some notable prop- tion. Some instructions can be effectively addressed by
erties of iT2I systems: leveraging the capabilities of an LLM, such as selecting,
Multi-Turn refers to the ability of the system to engage which primarily involves textual decision-making. How-
in a dynamic and iterative dialogue with the user. Unlike ever, certain instructions may necessitate a deeper synergy
traditional text-to-image systems that may generate a single between the LLM and the T2I models.
image based on a static textual input, multi-turn iT2I sys- Generation refers to the process of generating entirely
tems can accept multiple rounds of textual input, enabling new images based on a given textual description. In this
users to refine and specify their visual requirements through context, the iT2I system creates images or illustrations from
an ongoing conversation. This property enhances the user scratch, attempting to capture the essence and details of the
experience and allows for more fine-grained control over provided textual input. It essentially transforms queries into
the generated images. neural representations or prompts for T2I models.
Consistency means that these systems can automati- Referring generation is another variant of generation,
cally determine if they should take into account not only where the system generates images that refer to or are in-

4
Interactive Chat
spired by existing objects, scenes, or concepts mentioned in Large Language Model
Can you generate a cute dog?
the textual input and appear in the context.
Sure, here is a dog for you: Router
Sure, here is a dog for you:
Selecting is a relatively straightforward instruction that
<image> a cute dog </image>
involves choosing or picking from a set of pre-existing or OR
[IMG1] [IMG2] [IMG3] … [IMGL]
bag of generated images based on the textual input. T2I
Adapter
Editing performs the task of modifying or refining ex- Model
Prompt / Embedding
Prompt / Embedding
isting images in response to textual instructions. This may Good! Can you make it run
near the river?
involve altering specific attributes of an image, enhancing or Large Language Model
diminishing certain features, or adapting the image to match Sure, here it is:
Router
Sure, here it is:
the requirements outlined in the instruction.
<edit> a cute dog running near the river</edit>
Refinement means to further enhance or optimize an OR [IMG1] [IMG2] [IMG3] … [IMGL]
existing image to better align with the textual description. T2I
Prompt / Embedding Adapter
While editing involves making specific modifications, re- Model
Prompt / Embedding
Enter text to ask me
finement often involves fine-tuning the visual output to
achieve a higher level of detail, realism, or accuracy in ac-
Figure 5. Pipeline Overview. Mini-DALLE3 consists of two
cordance with the provided textual guidance.
stages, with 1) a router that analyzes the response from the
Question Answering is the inherent ability of LLMs. prompted/finetuned LLM and dispatches the demand for image
An iT2I system should be able to persist the ability as much generation if needed, and 2) an adapter that transforms the image
as possible, as it is crucial to provide a coherent experience embedding or descriptions for subsequent T2I models.
interleaving images and text for users.
3.3. Discussion
sentations, the adapter can be a neural network if the image
In the literature of image editing and multi-modal LLM, representations are neural embedding or prompt refinement
there are a number of works that are closely related to iT2I. modules with handcrafted rules or LLM. Next, we illustrate
Most of these related works could provide interactive inter- a simple yet effective instantiation of Mini-DALLE3 archi-
faces. For example, InstructPix2Pix [6] and its follow-up tecture by prompting large language models.
works [63, 65] could be repeatedly applied to a single im-
age to achieve multi-turn image editing. However, these in- 4.1. Multi-Turn Interaction by Prompting LLM
teractive multi-turn abilities only apply to image editing in-
Multi-turn interaction lies at the heart of interactive text-
structions. There are also multi-modal LLMs [9, 16, 24, 54]
to-image. It possesses the requirements of integrating tex-
that could generate response with interleaved text and im-
tual/visual context and understanding instructional instead
ages, but most of them focus more on (visual) question
of descriptive messages. To address it, we propose to lever-
answering with multi-modal responses rather than interac-
age the stronger context-understanding ability of LLMs by
tive image generation. The key vision of iT2I is to build
prompting them to pretend to generate images via textual
a chat-based system that could respond to all image gen-
descriptions. This intermediate textual description not only
eration/editing instructions in a multi-turn, consistent, and
provides stronger flexibilities to augment the system capa-
composable manner. This is the major difference between
bilities with plug-and-play modules such as prompt varia-
iT2I from all previous works/tasks.
tion/refinement but also enables us to utilize numerous pre-
trained LLMs and T2I models without heavy finetuning.
4. Mini-DALLE3
Image Generation as Function Call. Specifically, we
In this section, we depict a blueprint of an iT2I system, utilize the few-shot prompt as shown in Fig. 6 to transform
which we refer to as Mini-DALLE3. The overall archi- the problem of multi-turn image generation into a problem
tecture of Mini-DALLE3 is illustrated in Fig. 5, and it of multi-turn textual description generation. Our prompt
comprises several key components: an LLM, a router, an entails several key steps. Initially, we define the LM’s role
adapter, and T2I models. The LLM can be an existing and explicitly convey to it that it possesses the ability to
text-only LLM, such as ChatGPT [36] and LLaMA [48], generate images. Subsequently, we request the LM to pro-
or multi-modal LLM [50]. It is responsible to analyze user duce images by generating descriptive text enclosed within
intentions and produce the proper outputs in text or neural ⟨image⟩ tags. If the generated images exhibit a high de-
representations. The router would automatically dispatch gree of correlation with previous ones, the LM is instructed
the parsed image representations (if there exist ones in the to generate “edit” rather than generate “image”. Finally, we
LLM output) to the image generation module. The adapter provide a few number of few-shot examples to further guide
transforms the image representations to better fit the back- the LM’s responses. Leveraging the robust in-context learn-
end T2I models. Depending on the type of image repre- ing capabilities inherent in advanced LLMs, we observe that

5
You are Mini-DALLE3, a smart AI assisant that could generate image if GPT3.5-Turbo GPT3.5-Turbo
the user request asks. Task
You should include the generated image by its description with the format
Original Mini-DALLE3
of " <image> HERE IS THE DESCRIPTION </image> "
If the generated image is strongly correlated with previous image, you
Abstract Algebra 42.42 43.43
should use <edit> </edit> tag instead of <image> </image>. High School Physics 40.00 38.67
Here is an example
Marketing 88.41 86.70
Philosophy 77.41 70.65
User: My 5 year-old keeps talking about a "super-duper sunflower College Computer Science 48.48 42.42
hedgehog" -- What does it look like ?
AI: It might looks like as <image> a super-duper sunflower hedgehog Average 59.34 56.37
</image>
User: My daughter says its name is Larry. Can I see more like this ?
AI: Sure, here it is <edit> a super-duper sunflower hedgehog </edit> Table 1. Evaluation results of different models on the subtasks of
User: She's going to love these! Can you show me Larry's house ? MMLU, using the script from chain-of-thought-hub [14].
AI: Of course, <edit> a super-duper sunflower hedgehog, standing in
front of a house </edit>
User: Larry is cute, what makes him so super-duper ?
AI: Larry is super-duper for a myriad of reasons! Here's why: 1. Sunflower widely explored in the context of conventional T2I [26, 42,
Petals as Quills: Unlike other hedgehogs, Larry boasts sunflower petals
as quills. This not only makes him stand out, but it also gives him a bright
57], only a few works explore the multi-turn scenarios and
and cheerful demeanor. 2. Kind Hearted: Larry is known throughout the seldom works explore the integration of these abilities into
meadow for kindness. a single unified system. Our decomposition makes it possi-
User: Awwww...can you show me Larry being "kind hearted"
AI: <edit> a super-duper sunflower hedgehog, kind hearted </edit> ble to utilize existing T2I models that were not designed for
User: Can your design some stickers ? multi-turn scenarios. For example, the edited description
AI: Sure, <edit> a sticker of a super-duper sunflower hedgehog </edit>
of Prompt-to-prompt [32] can be automatically generated
Here is another example through LLM in an interactive manner.
User: can you generate a dog ? Specifically, we leverage the off-the-shelf T2I models
AI: Sure, <image> a cute dog </image> that take previous images as additional input to ensure
User: can you generate a cat ?
AI: Sure, <image> a cute cat </image> consistent multi-turn generation. To better ensure the im-
age quality, we introduce a hierarchical control strategy
that utilizes different models for different levels of content
Figure 6. The few-shot prompt for iT2I generation.
changes. For small content changes that can be described in
a few words, such as changing styles, word weighting, and
simple object manipulation, we adopt the models of Prompt
this approach yields favorable outcomes. The LM success-
to prompt [32] and MasaCtrl [8]. We utilize IP-Adapter [60]
fully generates images accompanied by coherent textual re-
to perform large content changes as these models are more
sponses, as illustrated in Fig. 1. Importantly, these capa-
flexible for the input textual prompts.
bilities can be harnessed without the need for specialized
training and can be swiftly integrated into existing LLMs. 4.3. Composiblitiy
Prompt Refinement&&Variations. Although we can
generate textual descriptions that integrate the information As we have not modified the original LLM, our system can
from context by prompting LLMs, the descriptions might natively support the composition with question answering
not be sufficient to generate high-quality images. Therefore, and image generation interleavedly.
we propose to leverage another round of prompt refinement
to transform the vanilla descriptions to better fit subsequent 5. Evaluation
T2I models. It is worth noting that the prompt refinement
Will prompting harm the inherent abilities of LLM?
can also apply to embedding if the previous intermediate
We provide a preliminary evaluation if the iT2I prompt
representation is embedding. In this instantiation, we per-
harms the inherent abilities of LLM. As previously shown
form text transformation by prompting LLM again with the
in Fig. 1, our prompting technique would not cause severe
following few-shot prompt. Furthermore, we can perform
degradation in the LLM abilities. We can still ask LLMs for
prompt variation by repeatedly performing different prompt
either question answering or code generation as before. To
refinements, which is useful for responding to the request to
further investigate the impact of the iT2I prompt, we per-
generate a list of images.
form an ablation study on five subtasks of MMLU [19],
comparing the models with and without the iT2I prompt.
4.2. Hierarchical Content Consistency Control
The results are provided in Tab. 1, it can be observed that
Content consistency is another important aspect of an the iT2I prompt only brings minor degradations.
iT2I system. Although similar topics (subject-driven T2I, Comparsion of different LLM. We evaluate our ap-
example-driven T2I, personalization, concept learning) are proach with different LLMs, including commerical ser-

6
GPT 3.5 GPT 4 LLAMA 2 – 13B Chat Claude
There is a novel "Hard-Boiled There is a novel "Hard-Boiled There is a novel "Hard-Boiled There is a novel "Hard-Boiled
Wonderland and the End of Wonderland and the End of Wonderland and the End of Wonderland and the End of
the World", what does this the World", what does this the World", what does this the World", what does this
scene would be ? scene would be ? scene would be ? scene would be ?

It's difficult to provide a Here is my imagining of a scene from

It might look something User: Hey, I'm reading this novel
the novel "Hard-Boiled Wonderland
specific scene from a novel like this called "Hard-Boiled Wonderland
and the End of the World":
and the End of the World" and
without more context.
there's this scene where the
However, I can generate an protagonist enters a room and finds
image that represents the a lot of animals. Can you help me
atmosphere or theme of the visualize it?
novel. Here is an image that AI: Sure thing! It sounds like quite an
imaginative scene. Based on your
captures the mysterious and description, I'm picturing something
surreal elements of "Hard- like this:
Boiled Wonderland and the The city represents the Hard-Boiled
End of the World": Wonderland, a chaotic and vibrant
metropolis filled with information and
Prompt (A scene from "Hard-
technology.
Boiled Wonderland and the End
In contrast, the End of the World is a
of the World”)
simpler, pastoral place. Let's imagine
that like this:

Is that close to what you were

imagining?
Prompt (A surreal landscape with a
mix of futuristic and ancient elements) User: Wow, that's impressive! Can
The End of the World has a timeless,
you give me more details about the
peaceful feeling compared to the
animals?
sensory overload of the Hard-Boiled
…. Wonderland. This dichotomy
between the two settings highlights a
central theme of the book.

ChatGLM2 – 6B Chat Baichuan 2 – 13B Chat Qwen – 14B Chat InternLM – 20B Chat

有一本小说叫做“世界尽头与冷有一本小说叫做“世界尽头与冷有一本小说叫做“世界尽头与冷有一本小说叫做“世界尽头与冷

酷仙境”，这个场景会是什么样酷仙境”，这个场景会是什么样酷仙境”，这个场景会是什么样酷仙境”，这个场景会是什么样
子的呢？子的呢？子的呢？子的呢？
AI：这个世界可能是一个荒凉在这个名为《世界尽头与冷酷仙境》的 AI：这可能是一个充满奇幻元抱歉，我无法生成小说的场景。
小说中，我们可以想象这是一个充满奇我是一个AI助手，只能生成图
而孤独的地方，充满了寂静和幻色彩的场景。在这个世界里，现实与
素的场景，其中包含着神秘的
冷漠。但是，也可能会是一个幻想交织在一起，创造出一种独特的氛生物和令人惊叹的景观。世界像和文本描述。如果您需要描
充满神秘和美丽的地方，就像围。尽头可能是一个荒芜的地方，述小说场景，我可以尝试根据
小说中所描述的那样。而冷酷仙境则可能是一个充满小说内容生成相关的图像和描
寒冷和冰雪的地方。这可能会述。
你能画出来吗是一个充满冒险和神秘的场景，
有一本小说叫做“世界尽头与冷
其中包含着许多未知的危险和
AI：当然，我会为您绘制一只酷仙境”，你能生成一张它的图
挑战。
猫！这是猫的图像及其描述：片吗
一只可爱的猫，在阳光下舒适
的躺在地上你能画出来吗
在这个世界中，时间似乎变得模糊不清。
白天和黑夜交替出现，但它们的界限变
AI：当然，我可以尝试画出这
得模糊。阳光透过树梢洒落在大地上，
形成一道道金色的光束。而在夜晚，星个场景。但是，由于我是一个
空璀璨夺目，仿佛是天空中的一幅画卷。文本生成模型，我无法直接画
出图像。我可以尝试描述这个
在这个冷酷仙境中，生活着各种奇特的
生物。有长着翅膀的独角兽在森林中穿
场景，以便您能够更好地理解
梭，还有那些拥有魔法力量的精灵在城它。 Prompt (一本名为“世界尽头与冷酷仙境”
堡中漫步。他们在这里共同生活，共同的小说。这本小说的封面设计非常精美，
创造着一个充满奇迹的世界。包括一个深蓝色背景和一个巨大的字母
“W”，代表着“世界尽头”。字母“W”周围
环绕着各种符号和图案，代表着小说中
在这个世界尽头，主人公们开始了一段
的奇幻元素和冒险主题。这本书的封面
冒险之旅。他们穿越森林，探索城堡，
设计非常吸引人，一定会引起读者的兴
试图揭开这个世界的秘密。在这个过程
趣。)
中，他们不仅要面对现实的挑战，还要
克服内心的恐惧。最终，他们在这个世
界尽头找到了自己的救赎和力量。

Figure 7. Qualitative comparison of interactive text-to-image generation by prompting different LLM.

7
Interactive Image Generation Storytelling Concept Prototyping Interactive Logo Design
I want to draw a picture for my I have listened a beautiful song I want to design a picture of a Can you help me to design a
article, the theme is Spring called "sleepyhead", could you spaceship, could you give me logo for a text-to-image app?
Festival, do you have any idea? help me draw illustration for it? some suggestions?
In the following, I will give you Of course! I can definitely help
Certainly! How about a vibrant the lyric part by part. Certainly! Here are a few you design a logo for your text-
illustration of a traditional suggestions for designing a to-image app. Could you please
Spring Festival scene? It could In the year of our lord 1239. picture of a spaceship: provide me with some more
feature a bustling street filled There once lived a girl of a 1.Classic Spaceship: Create a details about your app and any
with colorful decorations, royal line. The ancient sleek and futuristic spaceship specific ideas or preferences you
lanterns, and people stories do recall. She was with a metallic exterior, have for the logo design?
celebrating. There could be lion the fairest of them all. streamlined shape, and
dancers, dragon parades, and glowing lights. It is call mini-dalle3
families enjoying festive food 2.Retro Spaceship: Design a
together. The atmosphere spaceship inspired by retro sci- Great! So you would like the
should be joyful and lively, fi movies, with a bulbous shape, logo for your text-to-image app
capturing the spirit of the fins, and retro color schemes called "Mini-DALLE3". Could you
Spring Festival. like silver and red. please provide me with some
3.Alien Spaceship: Let your more information about the
Looks great, could you draw it? imagination run wild and style or concept you have in
design an otherworldly mind for the logo? Do you have
Certainly! Here is an illustration spaceship with unconventional any specific colors or symbols
In a castle made of stone.
of a traditional Spring Festival shapes, organic textures, and that you would like to
Every night she slept alone.
scene: vibrant colors. incorporate?
Any noise that would raise
the dead. Couldn’t wake her I would like it to be flatten
sleepyhead I like Retro Spaceship, could you
imagine a picture of it? Understood! You would like the
logo for your Mini-DALLE3 app
Certainly! Here is an image of a
to have a flat design. Based on
Retro Spaceship for your
that, I will create a logo that
inspiration:
incorporates simplicity and
minimalism. Here is a concept
for the Mini-DALLE logo:
Please let me know if you have
any specific changes or

…
additions you would like to
Can you imagine another scene make to the concept.
of setting off firecrackers
Could you summarize the What is the prompt for it?
Certainly! Here is an illustration story with interleaved images?
of a scene where firecrackers The prompt for the image of the
are being set off during the Certainly! Here's a summary of Retro Spaceship is "A Retro
Spring Festival: the story based on the lyrics: Spaceship with a bulbous shape,
fins, and a silver and red color
In the year 1239, there lived a scheme."
I want it to use orange as main
girl from a royal lineage who I don't want the fins color
was known for her unmatched
beauty. She resided in a castle Apologies for the Certainly! Here is an updated
made of stone and slept alone misunderstanding. Here is an concept for the Mini-DALLE logo
every night, undisturbed by any updated image of a Retro with orange as the main color:
noise. Spaceship without the fins: Please let me know if you have
.... any further feedback or if there
are any other specific elements
you would like to include in the
logo design.

T2I Model: Stable Diffusion XL T2I Model: Stable Diffusion XL T2I Model: Stable Diffusion XL T2I Model: DALLE 3

Figure 8. Examples use cases of interactive text-to-image generation.

8
Interactive Image Generation Storytelling Concept Prototyping Interactive Logo Design
A vibrant illustration of a tra- A beautiful girl with A Retro Spaceship with a A minimalist logo with the
ditional Spring Festival scene. flowing hair, dressed in bulbous shape, fins, and text ”Mini DALLE 3” writ-
The street is filled with colorful regal attire, stands tall in a silver and red color ten in a clean and mod-
decorations, lanterns, and peo- a grand castle. Her royal scheme. It has a stream- ern font. The letters
ple celebrating. Lion dancers lineage is evident in her lined design and a nos- are arranged in a straight
and dragon parades add to the graceful demeanor. The talgic feel, reminiscent of line, with each letter having
festive atmosphere, while fam- ancient stories speak of classic sci-fi movies. The equal spacing. The color
ilies gather to enjoy delicious her unmatched beauty, spaceship is adorned with palette consists of vibrant
food together. The scene is filled making her the fairest of retro-style decals and has shades of blue and green,
with joy and excitement, captur- them all. large windows to provide representing creativity and
ing the spirit of the Spring Festi- a view of the cosmos. It innovation.
val. exudes a sense of adven-
ture and exploration as it
cruises through the vast-
ness of space.
In this illustration, a group of Inside a majestic castle A Retro Spaceship with a A minimalist logo with the
people is gathered in an open made of stone, the girl bulbous shape and a sil- text ”Mini DALLE 3” writ-
area, setting off firecrackers to sleeps peacefully in her ver and red color scheme. ten in a clean and mod-
celebrate the Spring Festival. grand bed. Despite any It has a streamlined design ern font. The letters
The sky is filled with colorful noise that could wake reminiscent of classic sci- are arranged in the same
bursts of light and smoke as the dead, she remains fi movies. The spaceship line, with each letter having
the firecrackers explode, creat- undisturbed in her slum- features a rounded cock- equal spacing. The color
ing a festive and energetic atmo- ber, earning her the nick- pit with large windows, al- palette consists of vibrant
sphere. The people are wear- name ”Sleepyhead.” lowing the pilot to have a shades of green and orange
ing traditional clothing and have panoramic view of space. representing creativity and
joyful expressions on their faces The exterior is adorned innovation. The logo is
as they enjoy the tradition of set- with retro-style decals and placed on a white back-
ting off firecrackers to ward off has a polished metallic fin- ground, creating a clean
evil spirits and bring good luck ish. It exudes a sense of and sleek look.
for the coming year. nostalgia and adventure as
it travels through the cos-
mos.

Table 2. Generated captions of the examples in Figure 8.

vices OpenAI GPT3.5 [7], GPT4 [36], Claude3 , and amples, which cover different use scenarios from single-
open-source LLAMA2-13B-Chat [48], Baichuan2-13B turn/multi-turn image generation to interleaved text-image
Chat [56], ChatGLM2-6B-Chat [10], Qwen-14B-Chat [3], storytelling. The results are shown in Fig. 8 and Tab. 2.
InternLM-20B-Chat [46]. As shown in Fig. 7, all commer-
cial LLMs successfully generate the images with appropri-
ate corrsponding text (interleaved) responses. This indicates
that our prompting approach could be a simple yet effective 6. Conclusion
way to rapidly augment existing LLMs with iT2I ability.
Nevertheless, the results are less satisfactory for the open- In conclusion, this paper introduces the concept of interac-
source LLMs. Overall, Baichuan2 [56] generates the best tive text-to-image (iT2I) and presents an approach to aug-
results, while Qwen and InternLM tend to refuse to gener- menting existing large language models for this task. Our
ate images even if they are prompted to do so. ChatGLM2 evaluation shows that this approach enables convenient iT2I
capabilities without significant degradation of the models’
could generate an image but the correspondence is incor-
inherent capabilities. This work has the potential to enhance
rect. user experiences in human-machine interactions and elevate
iT2I Examples. Here, we show a number of iT2I ex- the image quality of next-generation T2I models, offering
promising directions for future research and development.
3 https://claude.ai

9
References [12] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming
transformers for high-resolution image synthesis. In Pro-
[1] Namhyuk Ahn, Junsoo Lee, Chunggi Lee, Kunhee Kim, ceedings of the IEEE/CVF conference on computer vision
Daesik Kim, Seung-Hun Nam, and Kibeom Hong. Dream- and pattern recognition, pages 12873–12883, 2021. 3
styler: Paint by style inversion with text-to-image diffusion
[13] Zhida Feng, Zhenyu Zhang, Xintong Yu, Yewei Fang,
models. arXiv preprint arXiv:2309.06933, 2023. 3
Lanxin Li, Xuyi Chen, Yuxiang Lu, Jiaxiang Liu, Weichong
[2] AUTOMATIC1111. Stable Diffusion Web UI, 2022. 2 Yin, Shikun Feng, et al. Ernie-vilg 2.0: Improving text-to-
[3] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xi- image diffusion model with knowledge-enhanced mixture-
aodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, of-denoising-experts. In Proceedings of the IEEE/CVF Con-
Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Day- ference on Computer Vision and Pattern Recognition, pages
iheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin 10135–10145, 2023. 2, 3
Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi [14] Yao Fu, Litu Ou, Mingyu Chen, Yuhao Wan, Hao Peng, and
Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Tushar Khot. Chain-of-thought hub: A continuous effort
Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, to measure large language models’ reasoning performance.
Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen arXiv preprint arXiv:2305.17306, 2023. 6
Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan
[15] Songwei Ge, Taesung Park, Jun-Yan Zhu, and Jia-Bin
Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren
Huang. Expressive text-to-image generation with rich
Zhou, Xiaohuan Zhou, and Tianhang Zhu. Qwen technical
text. In IEEE International Conference on Computer Vision
report, 2023. 9
(ICCV), 2023. 3
[4] Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat,
[16] Yuying Ge, Yixiao Ge, Ziyun Zeng, Xintao Wang, and Ying
Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila,
Shan. Planting a seed of vision in large language model.
Samuli Laine, Bryan Catanzaro, et al. ediffi: Text-to-image
arXiv preprint arXiv:2307.08041, 2023. 3, 5
diffusion models with an ensemble of expert denoisers. arXiv
preprint arXiv:2211.01324, 2022. 2, 3 [17] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
[5] Maciej Besta, Nils Blach, Ales Kubicek, Robert Gersten-
Yoshua Bengio. Generative adversarial networks. Commu-
berger, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann,
nications of the ACM, 63(11):139–144, 2020. 3
Michał Podstawski, Hubert Niewiadomski, Piotr Nyczyk,
and Torsten Hoefler. Graph of Thoughts: Solving Elaborate [18] Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo
Problems with Large Language Models, 2023. 3 Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vec-
tor quantized diffusion model for text-to-image synthesis. In
[6] Tim Brooks, Aleksander Holynski, and Alexei A Efros. In-
Proceedings of the IEEE/CVF Conference on Computer Vi-
structpix2pix: Learning to follow image editing instructions.
sion and Pattern Recognition, pages 10696–10706, 2022. 2
In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 18392–18402, 2023. [19] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou,
3, 5 Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Mea-
[7] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- suring massive multitask language understanding. arXiv
biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- preprint arXiv:2009.03300, 2020. 6
tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- [20] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu-
guage models are few-shot learners. Advances in neural in- sion probabilistic models. Advances in Neural Information
formation processing systems, 33:1877–1901, 2020. 3, 9 Processing Systems, 33:6840–6851, 2020. 2
[8] Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xi- [21] Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park,
aohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mu- Eli Shechtman, Sylvain Paris, and Taesung Park. Scal-
tual self-attention control for consistent image synthesis and ing up gans for text-to-image synthesis. In Proceedings of
editing, 2023. 6 the IEEE/CVF Conference on Computer Vision and Pattern
[9] Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Recognition, pages 10124–10134, 2023. 3
Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, [22] Diederik P Kingma and Max Welling. Auto-encoding varia-
Haoran Wei, et al. Dreamllm: Synergistic multimodal com- tional bayes. arXiv preprint arXiv:1312.6114, 2013. 3
prehension and creation. arXiv preprint arXiv:2309.11499, [23] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao,
2023. 3, 5 Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White-
[10] Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong head, Alexander C Berg, Wan-Yen Lo, et al. Segment any-
Qiu, Zhilin Yang, and Jie Tang. Glm: General language thing. arXiv preprint arXiv:2304.02643, 2023. 3
model pretraining with autoregressive blank infilling. In Pro- [24] Jing Yu Koh, Daniel Fried, and Ruslan Salakhutdinov. Gen-
ceedings of the 60th Annual Meeting of the Association for erating images with multimodal language models. arXiv
Computational Linguistics (Volume 1: Long Papers), pages preprint arXiv:2305.17216, 2023. 3, 5
320–335, 2022. 9 [25] Dongxu Li, Junnan Li, and Steven CH Hoi. Blip-
[11] Yuki Endo. User-controllable latent transformer for style- diffusion: Pre-trained subject representation for control-
gan image layout editing. Computer Graphics Forum, 41(7): lable text-to-image generation and editing. arXiv preprint
395–406, 2022. 3 arXiv:2305.14720, 2023. 3

10
[26] Tianle Li, Max Ku, Cong Wei, and Wenhu Chen. eration with clip latents. arXiv preprint arXiv:2204.06125,
Dreamedit: Subject-driven image editing. arXiv preprint 2022. 3
arXiv:2306.12624, 2023. 6 [41] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
[27] Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jian- Patrick Esser, and Björn Ommer. High-resolution image
wei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. synthesis with latent diffusion models. In Proceedings of
Gligen: Open-set grounded text-to-image generation. In Pro- the IEEE/CVF Conference on Computer Vision and Pattern
ceedings of the IEEE/CVF Conference on Computer Vision Recognition, pages 10684–10695, 2022. 2, 3
and Pattern Recognition, pages 22511–22521, 2023. 3 [42] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch,
[28] Zhengqi Li, Richard Tucker, Noah Snavely, and Aleksander Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine
Holynski. Generative image dynamics. arXiv preprint tuning text-to-image diffusion models for subject-driven
arXiv:2309.07906, 2023. 3 generation. 2022. 6
[29] Pengyang Ling, Lin Chen, Pan Zhang, Huaian Chen, [43] Chitwan Saharia, William Chan, Saurabh Saxena, Lala
and Yi Jin. Freedrag: Point tracking is not you need Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed
for interactive point-based image editing. arXiv preprint Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi,
arXiv:2307.04684, 2023. 3 Rapha Gontijo Lopes, et al. Photorealistic text-to-image
[30] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao diffusion models with deep language understanding. arXiv
Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun preprint arXiv:2205.11487, 2022. 2, 3
Zhu, et al. Grounding dino: Marrying dino with grounded [44] Yujun Shi, Chuhui Xue, Jiachun Pan, Wenqing Zhang, Vin-
pre-training for open-set object detection. arXiv preprint cent YF Tan, and Song Bai. Dragdiffusion: Harnessing diffu-
arXiv:2303.05499, 2023. 3 sion models for interactive point-based image editing. arXiv
[31] Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi preprint arXiv:2306.14435, 2023. 3
Wang, Shoufa Chen, Qinglong Zhang, Yang Yang, Qingyun [45] Kihyuk Sohn, Nataniel Ruiz, Kimin Lee, Daniel Castro
Li, Jiashuo Yu, et al. Internchat: Solving vision-centric tasks Chin, Irina Blok, Huiwen Chang, Jarred Barber, Lu Jiang,
by interacting with chatbots beyond language. arXiv preprint Glenn Entis, Yuanzhen Li, et al. Styledrop: Text-to-image
arXiv:2305.05662, 2023. 3 generation in any style. arXiv preprint arXiv:2306.00983,
2023. 3
[32] Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch,
and Daniel Cohen-Or. Null-text inversion for editing real [46] InternLM Team. Internlm: A multilingual language model
images using guided diffusion models. arXiv preprint with progressively enhanced capabilities. https : / /
arXiv:2211.09794, 2022. 6 github.com/InternLM/InternLM, 2023. 3, 9
[47] Haileleol Tibebu, Aadin Malik, and Varuna De Silva. Text
[33] Chong Mou, Xintao Wang, Jiechong Song, Ying Shan, and
to image synthesis using stacked conditional variational au-
Jian Zhang. Dragondiffusion: Enabling drag-style manipula-
toencoders and conditional generative adversarial networks.
tion on diffusion models. arXiv preprint arXiv:2307.02421,
In Science and Information Conference, pages 560–580.
2023. 3
Springer, 2022. 3
[34] Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhon-
[48] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert,
gang Qi, Ying Shan, and Xiaohu Qie. T2i-adapter: Learning
Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov,
adapters to dig out more controllable ability for text-to-image
Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al.
diffusion models. arXiv preprint arXiv:2302.08453, 2023. 3
Llama 2: Open foundation and fine-tuned chat models. arXiv
[35] OpenAI. Dall·e 3 system card. 2023. 2 preprint arXiv:2307.09288, 2023. 3, 5, 9
[36] OpenAI. Gpt-4 technical report, 2023. 3, 5, 9 [49] Qian Wang, Biao Zhang, Michael Birsak, and Peter Wonka.
[37] Xingang Pan, Ayush Tewari, Thomas Leimkühler, Lingjie Instructedit: Improving automatic masks for diffusion-
Liu, Abhimitra Meka, and Christian Theobalt. Drag your based image editing with user instructions. arXiv preprint
gan: Interactive point-based manipulation on the generative arXiv:2305.18047, 2023. 3
image manifold. In ACM SIGGRAPH 2023 Conference Pro- [50] Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu,
ceedings, pages 1–11, 2023. 3 Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu
[38] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Qiao, et al. Visionllm: Large language model is also an
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, open-ended decoder for vision-centric tasks. arXiv preprint
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning arXiv:2305.11175, 2023. 5
transferable visual models from natural language supervi- [51] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed
sion. In International conference on machine learning, pages Chi, Sharan Narang, Aakanksha Chowdhery, and Denny
8748–8763. PMLR, 2021. 3 Zhou. Self-consistency improves chain of thought reason-
[39] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, ing in language models. arXiv preprint arXiv:2203.11171,
Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2022. 3
Zero-shot text-to-image generation. In International Confer- [52] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
ence on Machine Learning, pages 8821–8831. PMLR, 2021. Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al.
2, 3 Chain-of-thought prompting elicits reasoning in large lan-
[40] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, guage models. Advances in Neural Information Processing
and Mark Chen. Hierarchical text-conditional image gen- Systems, 35:24824–24837, 2022. 3

11
[53] Jingxuan Wei, Shiyu Wu, Xin Jiang, and Yequan Wang.
Dialogpaint: A dialog-based image editing model. arXiv
preprint arXiv:2303.10073, 2023. 3
[54] Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng
Chua. Next-gpt: Any-to-any multimodal llm, 2023. 3, 5
[55] Zeyue Xue, Guanglu Song, Qiushan Guo, Boxiao Liu, Zhuo-
fan Zong, Yu Liu, and Ping Luo. Raphael: Text-to-image
generation via large mixture of diffusion paths. arXiv
preprint arXiv:2305.18295, 2023. 2, 3
[56] Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang,
Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan, Fan
Yang, et al. Baichuan 2: Open large-scale language models.
arXiv preprint arXiv:2309.10305, 2023. 3, 9
[57] Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin
Chen, Xiaoyan Sun, Dong Chen, and Fang Wen. Paint by
example: Exemplar-based image editing with diffusion mod-
els. In Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition, pages 18381–18391,
2023. 3, 6
[58] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran,
Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan.
Tree of thoughts: Deliberate problem solving with large lan-
guage models. arXiv preprint arXiv:2305.10601, 2023. 3
[59] Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-
adapter: Text compatible image prompt adapter for text-to-
image diffusion models. arXiv preprint arXiv:2308.06721,
2023. 3
[60] Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-
adapter: Text compatible image prompt adapter for text-to-
image diffusion models. 2023. 6
[61] Ahmet Burak Yildirim, Vedat Baday, Erkut Erdem, Aykut
Erdem, and Aysegul Dundar. Inst-inpaint: Instructing
to remove objects with diffusion models. arXiv preprint
arXiv:2304.03246, 2023. 3
[62] Shengming Yin, Chenfei Wu, Jian Liang, Jie Shi, Houqiang
Li, Gong Ming, and Nan Duan. Dragnuwa: Fine-grained
control in video generation by integrating text, image, and
trajectory. arXiv preprint arXiv:2308.08089, 2023. 3
[63] Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su.
Magicbrush: A manually annotated dataset for instruction-
guided image editing, 2023. 5
[64] Lvmin Zhang and Maneesh Agrawala. Adding conditional
control to text-to-image diffusion models. arXiv preprint
arXiv:2302.05543, 2023. 3
[65] Shu Zhang, Xinyi Yang, Yihao Feng, Can Qin, Chia-Chih
Chen, Ning Yu, Zeyuan Chen, Huan Wang, Silvio Savarese,
Stefano Ermon, Caiming Xiong, and Ran Xu. Hive: Har-
nessing human feedback for instructional visual editing.
arXiv preprint arXiv:2303.09618, 2023. 5

Introduction To Programming: Author of The Course: Associate Professor, Candidate of Technical Science Pachshenko G.N
No ratings yet
Introduction To Programming: Author of The Course: Associate Professor, Candidate of Technical Science Pachshenko G.N
41 pages
AI Trends of May 2023 You Need To Know by Gonzalo Recio Medium
No ratings yet
AI Trends of May 2023 You Need To Know by Gonzalo Recio Medium
1 page
Indian Institute OF Information Technology Allahabad: Text To Image Synthesis
No ratings yet
Indian Institute OF Information Technology Allahabad: Text To Image Synthesis
8 pages
28226-Article Text-32280-1-2-20240324
No ratings yet
28226-Article Text-32280-1-2-20240324
9 pages
Image Synthesis From an Ethical Perspective
No ratings yet
Image Synthesis From an Ethical Perspective
11 pages
A Survey of AI Text-to-Image and AI Text-to-Video Generators
No ratings yet
A Survey of AI Text-to-Image and AI Text-to-Video Generators
5 pages
Paper Math
No ratings yet
Paper Math
13 pages
Getting Started with Generative AI as a Teaching Partner (3)
No ratings yet
Getting Started with Generative AI as a Teaching Partner (3)
51 pages
Sample Report PDF
No ratings yet
Sample Report PDF
25 pages
Session 4 Generative AI Applications
No ratings yet
Session 4 Generative AI Applications
26 pages
UNIT VI Gen-AI ASP Notes
No ratings yet
UNIT VI Gen-AI ASP Notes
11 pages
BTP_6 sem_part1
No ratings yet
BTP_6 sem_part1
40 pages
Image Synthesis From An Ethical Perspective: Oliver Bendel
No ratings yet
Image Synthesis From An Ethical Perspective: Oliver Bendel
10 pages
Questions for Text to Image Ai
No ratings yet
Questions for Text to Image Ai
5 pages
Utilizing Generative AI for Text-To-Image Generation
No ratings yet
Utilizing Generative AI for Text-To-Image Generation
6 pages
Generating AI Text to Image A Comprehensive Guide
No ratings yet
Generating AI Text to Image A Comprehensive Guide
3 pages
Generative_AI_Questions
No ratings yet
Generative_AI_Questions
4 pages
Prompt Log Analysis of Text-to-Image Generation Systems
No ratings yet
Prompt Log Analysis of Text-to-Image Generation Systems
11 pages
Text-to-Image Generation Using Deep Learning
No ratings yet
Text-to-Image Generation Using Deep Learning
6 pages
Text-to-Image_Synthesis_With_Generative_Models_Methods_Datasets_Performance_Metrics_Challenges_and_Future_Direction_Basiv
No ratings yet
Text-to-Image_Synthesis_With_Generative_Models_Methods_Datasets_Performance_Metrics_Challenges_and_Future_Direction_Basiv
16 pages
Visual GPT
No ratings yet
Visual GPT
17 pages
Engproc 20 00016 With Cover
No ratings yet
Engproc 20 00016 With Cover
7 pages
Promt Engg.
No ratings yet
Promt Engg.
14 pages
3DALL-E: Integrating Text-to-Image AI in 3D Design Workflows
No ratings yet
3DALL-E: Integrating Text-to-Image AI in 3D Design Workflows
20 pages
Zhou Shifted Diffusion For Text-to-Image Generation CVPR 2023 Paper
No ratings yet
Zhou Shifted Diffusion For Text-to-Image Generation CVPR 2023 Paper
10 pages
An Adaptive Approach To Text To Image
No ratings yet
An Adaptive Approach To Text To Image
5 pages
New Microsoft Word Document (2)
No ratings yet
New Microsoft Word Document (2)
8 pages
Multimodal
No ratings yet
Multimodal
25 pages
Edith PPT
No ratings yet
Edith PPT
22 pages
Plug and Play Diffusion Feature
No ratings yet
Plug and Play Diffusion Feature
15 pages
Learn To Speak With Engineering Prompt by Genioestructural - Xyz
No ratings yet
Learn To Speak With Engineering Prompt by Genioestructural - Xyz
22 pages
Group No.17: Class-Ai - A Sub-Edi
No ratings yet
Group No.17: Class-Ai - A Sub-Edi
14 pages
2204.13988v3
No ratings yet
2204.13988v3
18 pages
Designintech2023 001 Pages 2
No ratings yet
Designintech2023 001 Pages 2
27 pages
Zeroheight-Design Systems in The Age of AI
No ratings yet
Zeroheight-Design Systems in The Age of AI
50 pages
A Taxonomy of Prompt Modifiers For Text-To-Image Generation
No ratings yet
A Taxonomy of Prompt Modifiers For Text-To-Image Generation
15 pages
Meta
No ratings yet
Meta
17 pages
AI_1726904568
No ratings yet
AI_1726904568
144 pages
32636-Article Text-36704-1-2-20250410
No ratings yet
32636-Article Text-36704-1-2-20250410
9 pages
Documents 5
No ratings yet
Documents 5
5 pages
Building A System That Can Generate High
No ratings yet
Building A System That Can Generate High
2 pages
Design Guidelines For Prompt Engineering
No ratings yet
Design Guidelines For Prompt Engineering
23 pages
ArXiv 2302.08453 T2I-Adapter
No ratings yet
ArXiv 2302.08453 T2I-Adapter
10 pages
AI Art in Architecture
No ratings yet
AI Art in Architecture
11 pages
Chat GPT Is Not All You Need Paper Review
No ratings yet
Chat GPT Is Not All You Need Paper Review
31 pages
Dall e 3 - Compressed
No ratings yet
Dall e 3 - Compressed
19 pages
2301.07093
No ratings yet
2301.07093
21 pages
Major Project Synopsis
No ratings yet
Major Project Synopsis
14 pages
A Word Is Worth A Thousand Pictures Prompts As AI
No ratings yet
A Word Is Worth A Thousand Pictures Prompts As AI
22 pages
What's in A Text-To-Image Prompt The Potential of Stable Diffusion in Visual Arts Education
No ratings yet
What's in A Text-To-Image Prompt The Potential of Stable Diffusion in Visual Arts Education
12 pages
AI Training For Language Teachers
No ratings yet
AI Training For Language Teachers
167 pages
Generative Ai Tools
No ratings yet
Generative Ai Tools
16 pages
ICRCCT24_001
No ratings yet
ICRCCT24_001
6 pages
Course Artificial Intelligence Elective Code
No ratings yet
Course Artificial Intelligence Elective Code
11 pages
Automated Black-Box Prompt Engineering For Personalized Text-to-Image Generation
No ratings yet
Automated Black-Box Prompt Engineering For Personalized Text-to-Image Generation
40 pages
Dehouce
No ratings yet
Dehouce
12 pages
Class8 Ch5 Notes
No ratings yet
Class8 Ch5 Notes
2 pages
imagen_3_report
No ratings yet
imagen_3_report
32 pages
Designing Interfaces for Text-To-image Prompt Engineering Using s
No ratings yet
Designing Interfaces for Text-To-image Prompt Engineering Using s
13 pages
Fairies
From Everand
Fairies
Steve Porter
3/5 (1)
Pumpkin Gets a Haircut
From Everand
Pumpkin Gets a Haircut
Taralyn Wernke
No ratings yet
SmartPTT Web Application User Guide
No ratings yet
SmartPTT Web Application User Guide
9 pages
Lect01 - Introduction To Web Services
No ratings yet
Lect01 - Introduction To Web Services
15 pages
Revised Exam Timetable for UG 2, 4 & 6 Semester
No ratings yet
Revised Exam Timetable for UG 2, 4 & 6 Semester
2 pages
JDC Notes
No ratings yet
JDC Notes
4 pages
Physical Computing With Scratch
No ratings yet
Physical Computing With Scratch
17 pages
Chapter 5 IT Act and Cyber Law PPT - Pps
No ratings yet
Chapter 5 IT Act and Cyber Law PPT - Pps
43 pages
Get Creative: Turn A Pi Zero W Into A Wireless Usb Drive
100% (1)
Get Creative: Turn A Pi Zero W Into A Wireless Usb Drive
100 pages
Make A Character Play The Game: Fast-Play RPG Rules
100% (3)
Make A Character Play The Game: Fast-Play RPG Rules
2 pages
Communication Lab Record MUhit
No ratings yet
Communication Lab Record MUhit
65 pages
FO Catalogue SpecSheet Collection en Chap5
No ratings yet
FO Catalogue SpecSheet Collection en Chap5
30 pages
Secure Hash Function
No ratings yet
Secure Hash Function
8 pages
AZ 305 Microsoft Azure Solutions Architect Expert Exam Study Guide PDF
No ratings yet
AZ 305 Microsoft Azure Solutions Architect Expert Exam Study Guide PDF
16 pages
J10i To J10i2
No ratings yet
J10i To J10i2
2 pages
Akashcv - Docx - Recovered - Compatibility Mode
No ratings yet
Akashcv - Docx - Recovered - Compatibility Mode
2 pages
Final Survey Setup
No ratings yet
Final Survey Setup
78 pages
19bec149 Exp7 Ca
No ratings yet
19bec149 Exp7 Ca
8 pages
Panasonic Lumix G6 Basic Operating Instructions
No ratings yet
Panasonic Lumix G6 Basic Operating Instructions
52 pages
63 B 3
No ratings yet
63 B 3
4 pages
Summary of Qualifications (Knowledge & Skill Areas) : Consultant Profile
No ratings yet
Summary of Qualifications (Knowledge & Skill Areas) : Consultant Profile
7 pages
Objective Bits For CADCAM 2019-20
No ratings yet
Objective Bits For CADCAM 2019-20
2 pages
Sap Steel
No ratings yet
Sap Steel
133 pages
JDSAA Volume 2 Issue 2 Pages 24-31
No ratings yet
JDSAA Volume 2 Issue 2 Pages 24-31
8 pages
Book
No ratings yet
Book
68 pages
Raghu Institute of Technology: Free Open Source Software (Foss) Lab Manual
No ratings yet
Raghu Institute of Technology: Free Open Source Software (Foss) Lab Manual
51 pages
Claim 35$ !!!
No ratings yet
Claim 35$ !!!
6 pages
2021 NSCGRecode Documentation
No ratings yet
2021 NSCGRecode Documentation
71 pages
IoT-Based Smart Parking System
No ratings yet
IoT-Based Smart Parking System
9 pages
Unique ID Management
No ratings yet
Unique ID Management
3 pages
Project Book V4
No ratings yet
Project Book V4
73 pages