Text-to-Image Synthesis With Generative Models Met
Text-to-Image Synthesis With Generative Models Met
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365043
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2023.0322000
ABSTRACT Text-to-image synthesis, the process of turning words into images, opens up a world of creative
possibilities, and meets the growing need for engaging visual experiences in a world that is becoming
more image-based. As machine learning capabilities expanded, the area progressed from simple tools
and systems to robust deep learning models that can automatically generate realistic images from textual
inputs. Modern, large-scale text-to-image generation models have made significant progress in this direction,
producing diversified and high-quality images from text description prompts. Although several methods
exist, Generative Adversarial Networks (GANs) have long held a position of prominence. However, diffusion
models have recently emerged, with results much beyond those achieved by GANs. This study offers a
concise overview of text-to-image generative models by examining the existing body of literature and provide
a deeper understanding of this topic. This will be accomplished by providing a concise summary of the
development of text-to-image synthesis, previous tools and systems employed in this field, key types of
generative models, as well as an exploration of the relevant research conducted on GANs and diffusion
models. Additionally, the study provides an overview of common datasets utilized for training the text-to-
image model, compare the evaluation metrics used for evaluating the models and addresses the challenges
encountered in the field. Finally, concluding remarks are provided to summarize the findings and implications
of the study and open issues for further research.
INDEX TERMS Deep learning, diffusion model, generative models, generative adversarial network, text-
to-image synthesis.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365043
years. Users may be able to describe visual elements through a given word. To produce compilations of images obtained
visually rich text descriptions if automatic image generation from the Flickr platform, Word2Image [8] implemented a
from natural language is used. Visual content, like pictures, is variety of methodologies, including semantic clustering, cor-
a better way to share and understand information because it relation analysis, and visual clustering.
is more accurate and easy to understand than written text [4]. Moreover, WordsEye [9] is a text-to-scene system that
Text-to-image Synthesis refers to the use of computa- mechanically generates static, 3D scenes that are represen-
tional methods to convert human-written textual descriptions tational of the supplied content. A language analyzer and a
(sentences or keywords) into visually equivalent representa- visualiser are the two primary parts of the system. Also, a
tions of those descriptions (images) [3]. The best alignment multi-modal system called CONFUCIUS [10], that works
of visual content matching the text used to be determined as a text-to-animation converter, can convert any sentence
through word-to-image correlation analysis combined with containing an action verb into an animation that is perfectly
supervised methods in synthesis. New unsupervised methods, synced with speech. A visually assisted instant messaging
especially deep generative models, have emerged as a result technique, called Chat With Illustration (CWI) [11], auto-
of recent developments in deep learning. These models are matically provides users with visual messages connected with
able to generate reasonable visual images by employing ap- text messages. Nevertheless, many different systems for other
propriately trained neural networks [3]. Figure 1 shows a languages exist. In order to handle the Russian language,
general architecture of how text-to-image generation would the Utkus [12] text-to-image synthesis system utilizes a nat-
work: an text prompt is fed into an image generative model, ural language analysis module, a stage processing module,
which uses the text description to generate an image. and a rendering module. Likewise, Vishit [13] is a method
for visualizing processed Hindi texts. Language processing,
knowledge base construction, and scene generation are its
three main computational foundations. Moreover, for Ara-
bic language, [14] put forth a comprehensive mobile-based
system designed for Arabic that generates illustrations for
Arabic narratives automatically. . The suggested method is
specifically designed for utilization on mobile devices, with
the aim of instructing Arab children in an engaging and non-
traditional manner. Also, using a technique called conceptual
graph matching, Illustrate It! [15] is a multimedia mobile
FIGURE 1. General architecture of text-to-image generation learning solution for the Arabic language.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365043
The generator is in charge of making new fake images by backward diffusion. In the forward diffusion phase, Gaussian
taking a noise vector as an input and putting out an image noise is progressively added to the input data at each level
as an output. On the other hand, the discriminator’s job is [21]. In the second phase, called "reverse," the model is
to tell the difference between real and fake images after charged to reverse the diffusion process so that the original
being trained with real data. In other words, it serves as a input data can be recovered.
classification network that is capable of classifying images The architectures of generative models types are shown in
by returning 0 for fake and 1 for real. Therefore, the gener- Figure 7.
ator’s goal is to create convincing fakes in order to trick the
discriminator, while the discriminator’s goal is to recognize
the difference [1]. Training improves both the discriminator’s
ability to distinguish between real or fake images, and the
generator’s ability to produce realistic-looking images. When
the discriminator can no longer tell genuine images from
fraudulent ones, equilibrium has been reached.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365043
II. DATASETS
Datasets play a crucial role in the development and evaluation
of text-to-image generative models. In the realm of text-to-
image generative models, the utilization of diverse datasets is
vital for achieving accurate and realistic visual outputs. This
section will explore the various datasets frequently utilized in
this research area. The most frequently used datasets by text-
to-image synthesis models are:
MS COCO [31], known as the Microsoft Common Objects FIGURE 3. Sample images and their captions of common text-to-image
in Context, is a comprehensive compilation of images that datasets. Figure reproduced from Frolov et al. [1]
is widely employed for the purpose of object detection and
segmentation. The dataset comprises a collection of more
than 330,000 images, with each image being accompanied Multi-Modal CelebA-HQ A large-scale face image col-
by annotations for 80 object categories and 5 captions that lection, Multi-Modal-CelebA-HQ [34] contains 30,000 high-
provide descriptive information about the depicted scene. The resolution facial images hand-picked from the CelebA dataset
COCO dataset is extensively utilized in the field of computer by following CelebA-HQ [35]. Transparent images, sketches,
vision research and has been employed for the purposes of descriptive text, and high-quality segmentation masks accom-
training and evaluating numerous cutting-edge models for pany each image. Algorithms for face generation and editing,
object identification and segmentation. text-guided picture manipulation, sketch-to-image produc-
CUB-200-2011 Caltech-UCSD Birds-200-2011 [32] is a tion, and more can all benefit from being trained and tested
popular dataset for fine-grained visual categorization. This on the data available in Multi-Modal-CelebA-HQ.
dataset comprises 11,788 bird images from 200 subcate- CelebA-Dialog Another enormous visual language face
gories. Images are divided into 5,994 training and 5,794 dataset with detailed labeling [36], divides a single feature
testing sets. Each image in the dataset has a subcategory, part into a range of degrees that all belong to the same semantic
location, binary attribute, and bounding box labels. Natural meaning. The dataset has over 200,000 images, encompass-
language descriptions supplemented these annotations to im- ing 10,000 distinct identities. Each image is accompanied by
prove the CUB-200-2011 dataset. Each image received ten five detailed attributes, providing fine-grained information.
single-sentence descriptions. DeepFashion [37] serves as a valuable resource for the
Oxford 102 Flower [33] comprises a collection of 102 training and evaluating of numerous image synthesis models.
distinct categories of flowers, which can be effectively em- It encompasses a comprehensive collection of annotations,
ployed for image classification. The selected flowers were including textual descriptions and fine-grained labels , across
indigenous to the United Kingdom. The number of photos in multiple modalities. The dataset comprises a collection of
each class ranges from 40 to 258. The images demonstrate eight hundred thousand fashion images that exhibit a wide
significant variations in terms of size, pose, and lighting range of diversity, encompassing various accessories and po-
conditions. There exist categories that exhibit significant vari- sitions.
ations within their respective boundaries, as well as numerous Imagenet To test algorithms designed to save, retrieve, or
categories that have notable similarities. analyze multimedia data, researchers have created a massive
Figure 3 shows samples of images along their captions database called ImageNet [38], which contains high-quality
from the MS COCO, Oxford 102 Flower, and CUB-200-2011 images that have been manually annotated. There are more
datasets. than 14 million images in the ImageNet database, all of which
4 VOLUME 11, 2023
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365043
have been annotated using the WordNet classification system. bilize conditional-GAN training. Using the provided text de-
Since 2010, the dataset has been applied as a standard for scription as input, the Stage-I GAN generates low-resolution
object recognition and image classification in the ImageNet images of the initial shape and colors of the object. High-
Large Scale Visual Recognition Challenge (ILSVRC). resolution (e.g., 256x256) images with photorealistic features
OpenImages [39] consists of around 9 million images that are generated by the Stage-II GAN using the results from
have been annotated with various types of data, including Stage-I and the descriptive text.
object bounding boxes, image-level labels, object segmen- However, an improvement to this model was made, leading
tation masks, localized narratives and visual relationships. to StackGAN++ [46]. The second version of StackGAN uses
The training dataset of version 7 has 1.9 million images generators and discriminators organized in a tree-like struc-
and 16 million bounding boxes representing 600 different ture to produce images at multiple scales that fit the same
item classes, rendering it the most extensive dataset currently scene. StackGAN++ has more reliable training behavior by
available with annotations for object location. approximating multiple distributions.
CC12M Conceptual 12M [40] is one of the datasets uti- For even more accurate text-to-image production, the At-
lized by OpenAI’s DALL-E2 for training, and it consists of 12 tentional Generative Adversarial Network (AttnGAN) [47]
million text-image pairs. The dataset, built from the original permits attention-driven, multi-stage refining. By focusing
CC3M dataset of 3 million text-image pairs, was used for a on important natural language terms, AttnGAN’s attentional-
wide range of pre-training and end-to-end training of images. generating network allows it to synthesize fine-grained image
LAION-5B One of the largest publicly available image- features.
text datasets is Large-scale AI Open Network (LAION) [41]. To rebuild textual descriptions from the generated images,
More than five billion text-image pairs make up LAOIN- MirrorGAN [48] presents a text-to-image-to-text architecture
5B, an AI training dataset that is 14 times larger than its with three models. To guarantee worldwide semantic coher-
predecessor, LAOIN-400M. ence between textual descriptions and the corresponding pro-
Table 1 provides a comprehensive comparison of the com- duced images, it additionally suggests word sentence average
monly used datasets used in computer vision and multimodal embedding.
research. Each dataset is evaluated based on key attributes Figure 4 shows the architectures of: StackGAN, Stack-
including domain, common task, number of images, captions GAN++, AttnGAN, and MirrorGAN.
per image, training and testing split, and the number of object
categories.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365043
using the sequential conditional GAN framework. To im- Tedi-GAN [34] combines text-guided image production
prove the image resolution and uniformity of the generated and modification into one framework for high accessibility,
sequences, it employs two discriminators, one at the story variety, accuracy and stability in facial image generation and
level and one at the image level, as well as a deep context manipulation. It can synthesize high-quality images using
encoder that dynamically tracks the story flow. multi-modal GAN inversion and a huge multi-modal dataset.
Furthermore, a multi-conditional GAN (MC-GAN) [50] Although there have been many studies on text-to-image
coordinates both the object and the context. The main portion generation in English, very few have been applied to other
of MC-GAN is a synthesis block that separates object and languages. In [54], the use of Attn-GAN was proposed
background information during training. This block helps for generating fine-grained images based on descriptions in
MC-GAN to construct a realistic object image with the appro- Bangla text. It is capable of integrating the most exact details
priate background by altering the proportion of background in various subregions of the image, with a specific emphasis
and foreground information. on the pertinent terms mentioned in the natural language
The Dynamic Memory Generative Adversarial Network description.
(DM-GAN) [51] employs a dynamic memory module to Furthermore, [55] uses language translation models to ex-
enhance the ambiguous image contents in cases where the tend established English text-to-image generating approaches
initial images are generated inadequately. The method can to Hindi text-to-image synthesis. Input Hindi sentences were
accurately generate images from the text description since a translated to English by a transformer-based Neural Machine
memory writing gate is created to pick the relevant text details Translation module, whose output was supplied to a GAN-
based on the content of initial image . In addition, a response based Image Generation Module.
gate is used to adaptively combine the data retrieved from the On the other hand, The CJE-TIG [56] cross-lingual text-to-
memories with the attributes of the images. image pre-training technique removes barriers to using GAN-
ManiGAN [52] semantically edits an image to match a based text-to-image synthesis models for any given input
provided text describing desirable attributes such as color, language. This method alters text-to-image training patterns
texture, and background, while keeping irrelevant content. that are linguistically specific. It uses a bilingual joint encoder
ManiGAN has two major parts.The first part links visual in place of a text encoder, applies a discriminator to optimize
regions with meaningful phrases for effective manipulation. the encoder, and uses novel generative models to generate
The second part corrects mismatched properties and com- content.
pletes missing image content. The difficulties of visualizing the text of a story with
Without relying on any sort of entanglements between several characters and exemplary semantic relationships were
many generators, DeepFusion Generative Adversarial Net- considered in [57]. Two cutting-edge GAN-based image gen-
works (DF-GAN) [53] may produce high-resolution images eration models served as inspiration for the researchers’ in-
directly by a single generator and discriminator. Moreover, novative two-stage model architecture for creating images.
DF-GAN’s Deep text-image Fusion Block (DFBlock) allows Stage-I of the image generating process makes use of a
for a more thorough and efficient fusion of text and picture scene graph image generation framework; stage-II refines the
information. output image using a StackGAN based on the object layout
6 VOLUME 11, 2023
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365043
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365043
creating images from complex text descriptions, emphasizing TABLE 2. Diffusion Models-based related studies
the potential of GANs in the realm of text-to-image synthesis.
Ref. Year Model Dataset
B. TEXT TO IMAGE GENERATION USING DIFFUSION [70] 2021 VQ-Diffusion CUB-200, Oxford-102
MODELS & MS-COCO
[73] 2021 GLIDE 250M image-text pairs
Unlike GAN-based approaches, which primarily work with [79] 2022 Stable Diffusion LAION dataset
small-scale data, autoregressive methods use large-scale data [80] 2022 DALL-E-2 650M images
[74] 2021 M6-UFC M2C-Fashion &
to generate text-to-image conversions, such as DALL-E [66] Multi-Modal CelebA-
from OpenAI and Parti [67] from Google. Nevertheless, these HQ
approaches have significant computation costs and sequential [75] 2022 CLIP-GEN MS-COCO &
ImageNet
error buildup due to their autoregressive nature [66], [67], [76] 2022 Imagen 860M text-image pairs
[68], [69]. Conversely, diffusion models are highly popular [77] 2022 ERNIE-ViLG 2.0 170M image-text pair
for all sorts of generating applications. [78] 2022 eDiff-I 1B text-image pairs
For the purpose of creating images from text, the study [70] [81] 2022 DiVAE ImageNet
introduced the vector quantized diffusion (VQ-Diffusion)
model. Vector quantized variational autoencoders (VQ-
VAEs) form the basis of this technique, with the latent space Imagen, a method for text-to-image synthesis presented in
being modeled using a conditional variant of the Denoising [76], uses a single encoder for the text sequence and a set of
Diffusion Probabilistic Model (DDPM). Using a natural lan- diffusion models to generate high-resolution images. The text
guage description with an ROI mask, the Blended Diffusion embeddings provided by the encoder are also a prerequisite
approach was provided in [71] for making local (region- for these models. As an added bonus, the authors presented
based) adjustments to real images. The authors were suc- a brand new caption set (DrawBench) for testing text-to-
cessful in their mission by employing a pretrained language- image conversion. The authors created Efficient U-Net, an
image model (CLIP) to guide the modification in the direction efficient network architecture, and used it in their text-to-
of a given text prompt and combining it with a DDPM to image generation experiments to test its efficacy. Figure 6
generate results that looked natural. represents a simple visualisation of Imagen architecture.
CLIP-Forge [72] was proposed as a solution to the
widespread absence of coupled text and shape data. Utilizing
a two-step training approach, CLIP-Forge requires only a pre-
trained image-text network like CLIP, as well as an unlabeled
shape dataset. One of the advantages of this approach is that it
can produce various shapes for a given text without resorting
to costly inference time optimization.
In [73], the authors investigate CLIP guidance and
classifier-free guidance as two separate guiding methodolo-
gies for the problem of text-conditional image synthesis.
Their proposed model, GLIDE, which stands for Guided FIGURE 6. Overview of Imagen, reproduced from Saharia et al. [76]
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365043
input into vector quantized scale-dependent components. The was recently released, represents a significant advancement
previously mentioned stage of learning multi-scale represen- over its predecessors. Leveraging advanced diffusion models,
tations can also take advantage of input conditions such as DALL-E 3 not only excels in maintaining fidelity to textual
language, scene graphs, and image layout. Frido can thus be prompts but also underscores its ability to capture intricate
utilized for both traditional and cross-modal image synthesis. details, marking a substantial advancement in the realm of
A new method called DreamBooth was suggested in [83] generative models.
as a way to tailor the results of text-to-image generation from
diffusion models to the needs of users. The authors fine-
tuned a pretrained text-to-image model so that it is able to
associate a distinctive identifier with a subject given only a
small number of images of that subject as input. Following
the subject’s incorporation into the model’s output domain,
the identifier can be used to generate completely brand-new
photorealistic pictures of the subject in a variety of settings.
Furthermore, Imagic [84] shows how a single real image
can be subjected to sophisticated text-guided semantic edits.
While maintaining the image’s original qualities, Imagic can
alter the position and composition of one or more objects
within it. It works on raw images without the need for image
masks or any other preprocessing.
Likewise, UniTune [85] is capable of editing images with
a high degree of semantic and visual fidelity to the original,
given a random image and a textual edit description as input.
It can be considered is an art-direction tool that only requires
text as input rather than more complex requirements such as
masks or drawings. FIGURE 8. Samples generated by DALL-E 2 given the prompt: "a bowl of
soup that is a portal to another dimension as digital art". Source: [80]
DiVAE, a VQ-VAE architecture model that employs a dif-
fusion decoder as the reconstructing component in image syn-
Stable Diffusion is another popular text-to-image tool that
thesis, was proposed by Shi. et al. in [81]. They investigated
was introduced in 2022, based on a previous work [79].
how to incorporate image embedding into the diffusion model
Stable Diffusion employs a type of diffusion model known as
for high performance and discovered that a minor adjustment
the latent diffusion model (LDM). The VAE, U-Net, and an
to the U-Net used in diffusion could accomplish this.
optional text encoder comprise Stable Diffusion. Compared
Building upon the success of its predecessor [66], DALL-E
to pixel-based diffusion models, LDMs dramatically reduced
2 [80] was launched as a follow-up version with the intention
the requirement for processing while achieving a new state of
of producing more realistic images at greater resolutions by
the art for picture inpainting and highly competitive perfor-
combining concepts, features, and styles. The model consists
mance on a variety of applications like unconditional image
of two parts: a prior that creates a CLIP image embedding
creation and super-resolution. Figure 9 shows an overview of
from a caption and a decoder that creates an image based
the architecture of Stable diffusion.
on the embedding. It was demonstrated that increasing image
variety through the intentional generation of representations
leads to only a slight decrease in photorealism and caption
similarity. Figure 8 shows samples of images generated by
DALL-E 2 given a detailed text prompt.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365043
corresponding images. It stands out due to its deep semantic The paper details the development and effectiveness of this
understanding derived from large language models, enabling model in generating culturally relevant and accurate images
it to create images that are both visually appealing and closely across various languages, showcasing its potential for global
aligned with complex textual descriptions. The model’s train- use in T2I tasks.
ing is enhanced by the ParaImage Dataset, which includes Random image samples on MS-COCO dataset represented
extensive image-text pairs. This approach marks a significant in Figure 10, generated by DALL-E, GLIDE, and DALL-E 2.
advancement in AI-driven media, particularly in generating
intricate images from elaborate text descriptions. IV. EVALUATION METRICS
UPainting is an approach was presented in [88] to auto- The majority of current metrics evaluate a model’s quality
matic painting generation using deep learning. The model by considering two main factors: the quality of the images it
captures the essence of famous painters and styles, enabling produces and the alignment between text and images. Fréchet
the creation of new artworks that reflect the characteristics of Inception Distance (FID) [96] and Inception Score (IS) [97]
these styles. It’s a blend of art and technology, offering a new are commonly used metrics for appraising the image quality
way of creating art with AI’s assistance. of a model. These metrics were initially developed for tra-
CLIPAG [89] explores a unique approach to text-to-image ditional GAN tasks focused on assessing image quality. To
generation without relying on traditional generative models. evaluate text-image alignment, the R-precision [47] metric is
It leverages Perceptually Aligned Gradients (PAG) in robust widely employed.
Vision-Language models, specifically an enhanced version of For more in-depth details, we refer to [98]. Moreover, the Clip
CLIP, to generate images directly aligned with text descrip- Score [99] used in evaluating common sense and mentioned
tions. This method marks a shift in text-to-image synthesis, objects, while Human Evaluation offers a comprehensive
utilizing a more streamlined and efficient process compared insight into multiple aspects of image generation. In the fol-
to conventional methods. lowing a detailed description of each metric.
GLIGEN was proposed in [90] as a new method for text- The Frechet Inception Distance (FID) [96] : Using the
to-image generation, focusing on generating linguistically feature space of a pre-trained Inception v3 network, FID [77]
coherent and visually compelling images. It emphasizes the determines the frechet distance between natural and artificial
integration of natural language understanding and image syn- distributions. This equation solves it:
thesis, demonstrating impressive capabilities in creating im-
ages that accurately reflect complex textual inputs. 2
1
Snapfusion [91] introduces an efficient text-to-image dif- F(r, g) = ∥µr − µg ∥ +trace Σr + Σg − 2 (Σr Σg ) 2 (1)
fusion model optimized for mobile devices, achieving im- Where r and g represent, respectively, the image’s real
age generation in under two seconds. It addresses the com- and generated features. The covariance and mean of real
putational intensity and speed limitations of existing diffu- and produced features are represented by r, g, r, and g,
sion models through an innovative network architecture and correspondingly. The lower FID score is considered to be
improved step distillation. The proposed UNet efficiently the more appropriate score. It describes the level of realism,
synthesizes high-quality images, outperforming the baseline accuracy, and variety in the generated distributions. Table 3
Stable Diffusion model in terms of FID and CLIP scores. represents a comparison of FID scores obtained by GANs
This research [92] introduces a method to add conditional and diffusion model on MS-COCO dataset, and shows that
control to image generation models, allowing for more pre- diffusion models made a remarkable results.
cise and tailored image creation. The approach improves the
ability to generate images that meet specific criteria or con- The Inception score [97], which ignores the underlying
ditions, enhancing the versatility and applicability of image distribution, measures the produced distribution’s faithfulness
generation technologies. and diversity. The following is the IS equation:
This paper [93] explores advancements in text-to-image
diffusion models, focusing on enhancing their capabilities to I = exp(Ex DKL (p(y|x)∥p(y))) (2)
produce more realistic and varied images. The study delves
into new methods and techniques to improve these models, IS calculates the difference between the marginal distri-
significantly advancing the field of T2I synthesis. bution p(y) and the conditional distribution p(y|x) using the
This study [94] focuses on adapting the English Stable Kull back-Leibler (KL) divergence. The generated image
Diffusion model for Chinese text-to-image synthesis. It intro- x, denoted by the label y, is predicted using a pre-trained
duces a novel method for transferring the model’s capabilities Inception v3 network. Unlike FID, a higher IS is preferable. It
to the Chinese language, resulting in high-quality image gen- implies high-quality images accurately categorized by class.
eration from Chinese text prompts, significantly reducing the
need for extensive training data. The R-precision (RP) [47] metric is widely employed
AltDiffusion [95] presents a multilingual text-to-image dif- for assessing the consistency between text and images. RP
fusion model supporting eighteen languages, addressing the operates on the principle of employing a generated image
limitations of existing models that cater primarily to English. query based on the provided caption. Specifically, given an
10 VOLUME 11, 2023
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365043
FIGURE 10. Random image samples on MS-COCO, generated by DALL-E, GLIDE, and DALL-E 2. Source: [80]
authentic text description and 99 randomly selected mis- TABLE 3. FID scores of GANs and diffusion models on MS-COCO dataset
matched captions, an image is produced from the authentic
caption. This resulting image is then utilized to query the Ref. Model FID ↓
original description from a pool of 100 candidate captions. [45] Stackgan 74.05
The retrieval is deemed successful if the similarity score be- [46] Stackgan++ 81.59
[47] AttnGAN 35.49
tween it and the authentic caption is the highest. The matching [48] MirrorGAN -
score is determined using the cosine similarity between the [51] DM-GAN 32.64
encoding vectors of the image and the caption. A higher RP [53] DF-GAN 21.42
score indicates better quality, with RP being the proportion of [70] VQ-Diffusion 13.86
[73] GLIDE 12.24
successful retrievals. [79] Stable Diffusion 12.63
CLIP score: The CLIP model [99], developed by OpenAI, [80] DALL-E-2 10.39
demonstrates the ability to evaluate the semantic similarity [76] Imagen 7.27
[77] ERNIE-ViLG 2.0 6.75
between a given text caption and an accompanying image. [78] eDiff-I 6.95
Based on this rationale, the CLIP score can serve as a quanti-
tative measure and is formally defined as:
E[s(f (image) ∗ g(caption))] (3) scale of the CLIP logit [73]. A higher Clip score suggests
a stronger semantic relationship between the image and the
where the mathematical expectation is computed over the set text, while a lower score shows less of a connection.
of created images in a batch, and s represents the logarithmic Human evaluations Some studies used human evaluation
VOLUME 11, 2023 11
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365043
as a qualitative measure to assess and evaluate the quality of Language Support The majority of studies in the field
the results. The reporting of metrics based on human evalua- of text-to-image generation have been conducted on English
tion is motivated by the fact that many possible applications text descriptions due to the abundance of dataset resources
of the models are centered upon tricking the human observer and the simple structure of the language. Some languages,
[100]. Typically, a collection of images is provided to an however, require more effort which needs to be addressed. For
individual, who is tasked with evaluating their quality in terms instance, Arabic, in contrast to English, has more complicated
of photorealism and alignment with associated captions. morphological features and fewer semantic and linguistic
Frolov et al. [1] proposed a set of different criteria for com- resources [5]. This is a main challenge that needs to be dealt
paring evaluation metrics. The following is an explanation of with in text-to-image generation.
these criteria. Computational Complexity The computational complex-
• Image Quality and Diversity: The degree to which ity of diffusion models poses a notable difficulty. The process
the generated image looks realistic or similar to the of training a diffusion model involves multiple iterative pro-
reference image and the ability of the model to produce cesses which can impose a significant computational burden.
varied images based on the same text prompt. Therefore, The model’s scalability may be constrained by
• Text Relevance: How well the generated image corre- the increased complexity observed when working with larger
sponds to the given text prompt. datasets and higher-resolution images. Moreover, for further
• Mentioned Objects and Object Fidelity: Whether the research in the field of text-to-image generative models,
model correctly identifies and includes the objects men- and despite the availability of big datasets like LION-5B to
tioned in the text, and how accurately the objects in the the general public, the utilization of such datasets remains
generated image match their real-world counterparts. challenging for individuals due to the substantial hardware
• Numerical and Positional Alignment: The accuracy of requirements involved.
any quantitative details and the positional arrangement Ethical Considerations It is important to consider the
of objects in the generated image in relation to the potential ethical issues that arise with the use of text-to-
provided text. image generative models. One of the significant concerns is
• Common Sense: The presence of logical and expected the potential for misuse of these models. With the ability to
elements in the generated image. generate realistic images based on text descriptions, there is
• Paraphrase Robustness: The model remains unaf- a risk that these models could be used to create deceptive or
fected by minor modifications in the input description, misleading content. This could have serious consequences in
such as word substitutions or rephrasings. various areas, such as fake news, fraud, or even harassment.
• Explainable: The ability to provide a clear explanation Another issue is the potential bias that can be embedded
of why an image is not aligned with the input. in the generated images. If the training data used to develop
• Automatic: Whether the metric can be calculated auto- these models is not diverse and representative, there is a
matically without human intervention. possibility that the generated images may reflect prejudices
Based on these Key criteria, we provide in Table 4 a or stereotypes present in the data.
comparative analysis of the commonly used text-to-image
evaluation metrics based on their performance. It is important VI. FUTURE DIRECTIONS
to note that the table presented offers a simplified overview. The domain of text-to-image generation is experiencing sig-
In practice, choosing the right metric depends on the spe- nificant advancements on a regular basis. The recent emer-
cific goals and context of the text-to-image generation task. gence of novel generative diffusion models, including DALL-
Additionally, the effectiveness of these metrics may vary E, Midjourney, Stable diffusion, and others, has sparked sig-
depending on the specific model and dataset used. nificant interest and discussion in the scientific community.
The field shows a high degree of fertility and renewability, as
V. CHALLENGES AND LIMITATIONS seen by the recent publication of numerous relevant studies
Although there has been significant progress made in the and an ongoing flow of new papers within a relatively short
area of creating visual representations of textual descriptions, timeframe.
there are still some challenges and limitations that will be By making generative models open-source, researchers and
discussed below. developers can collaborate more effectively, which will in
Open source Although DALL-E is one of the competitive turn boost innovation in the field. Researchers may utilize
models, unfortunately, it has not been released for public us- these publicly available models to investigate novel uses, en-
age. There is a copy of DALL-E 2 available in PyTorch [101], hance current AI models, and move the field forward rapidly.
but no pre-trained model. However, the Stable Diffusion To overcome the language barrier, some studies proposed
model is among the open-source models that are currently multilingual [95] and cross-lingual [56] models to support
accessible. Stable Diffusion benefits from extensive commu- multiple languages within the same model. The goal of these
nity support due to its open-source nature. Consequently, it is multilingual models is to break down linguistic barriers by
anticipated that there will be additional advancements in this providing a common groundwork for the comprehension and
particular area in the near future. processing of several languages at once. This method has the
12 VOLUME 11, 2023
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365043
TABLE 4. Overview of commonly used evaluation metrics for text-to-image synthesis, adapted from Frolov et al. [1]
Paraphrase Robustness
Numerical Alignment
Positional Alignment
Mentioned Objects
Common Sense
Image Diversity
Text Relevance
Object Fidelity
Image Quality
Explainable
Automatic
Metric
FID ✓ ✓ ✓
IS ✓ ✓
R-precision (RP) ✓ ✓
Clip Score ✓ ✓ ✓
Human Evaluation ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
ability to dramatically improve linguistic diversity in commu- explore and refine text-to-image generative models.
nication and open up access to information for everyone.
Moreover, to make these technologies more widely ac- ACKNOWLEDGMENT
cessible and sustainable, it will be essential to improve re- The authors would like to thank the Deanship of Scientific
source efficiency and minimize computational complexity by Research, Qassim University, for funding the publication of
creating models that produce high-quality photos using less this project.
computer resources.
Nevertheless, greater research into ethical and bias consid-
REFERENCES
erations is required. Ensuring fairness, removing bias, and
[1] S. Frolov, T. Hinz, F. Raue, J. Hees, and A. Dengel, ‘‘Adversarial text-to-
following to ethical rules are still critical considerations for image synthesis: A review,’’ Neural Networks, vol. 144, pp. 187–209, 12
any AI system. Possible directions for future study in this area 2021.
include developing models with increased consciousness and [2] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
sensitivity to these factors. S. Ozair, A. Courville, and Y. Bengio, ‘‘Generative adversarial networks,’’
6 2014. [Online]. Available: http://arxiv.org/abs/1406.2661
The utilization of text-to-image production exhibits a wide [3] J. Agnese, J. Herrera, H. Tao, and X. Zhu, ‘‘A survey and taxonomy of
range of applications across several domains, including but adversarial neural networks for text-to-image synthesis,’’ Wiley Interdisci-
not limited to education, product design, and marketing. This plinary Reviews: Data Mining and Knowledge Discovery, vol. 10, 7 2020.
[4] L. Jin, F. Tan, and S. Jiang, ‘‘Generative adversarial network technologies
technology enables the creation of visual materials, such as and applications in computer vision,’’ Computational Intelligence and
illustrations and infographics, that seamlessly integrate text Neuroscience, vol. 2020, 2020.
and images. There are some early assumptions about which [5] J. Zakraoui, M. Saleh, and J. Alja’am, ‘‘Text-to-picture tools,
systems, and approaches: a survey,’’ Multimedia Tools and
businesses might be impacted by the growing area of image Applications, vol. 78, pp. 22 833–22 859, 2019. [Online]. Available:
generation, which will obviously have an impact on any sector https://doi.org/10.1007/s11042-019-7541-4
that relies on visual art, such as graphic design, filmmaking, [6] D. Joshi, J. Z. Wang, and J. Li, ‘‘The story picturing engine—
a system for automatic text illustration,’’ ACM Transactions
or photography [100].
on Multimedia Computing, Communications, and Applications
(TOMM), vol. 2, pp. 68–89, 2 2006. [Online]. Available:
VII. CONCLUSION https://dl.acm.org/doi/10.1145/1126004.1126008
[7] X. Zhu, A. Goldberg, M. Eldawy, C. Dyer, and B. Strock, ‘‘A text-to-
The field of text-to-image synthesis has made significant picture synthesis system for augmenting communication.’’ 10 2007, p.
progress in recent years. The development of GANs and diffu- 1590.
sion models has paved the way for more advanced and realis- [8] H. Li, J. Tang, G. Li, and T. S. Chua, ‘‘Word2image: towards visual
interpreting of words,’’ ACM Multimedia, pp. 813–816, 2008.
tic image generation from textual descriptions. These models [9] B. Coyne and R. Sproat, ‘‘Wordseye: an automatic text-to-scene con-
have demonstrated an outstanding ability to generate high- version system,’’ International Conference on Computer Graphics and
quality images across a wide range of domains and datasets. Interactive Techniques, pp. 487–496, 2001.
This study offers a comprehensive review of the existing [10] M. E. Ma, ‘‘Confucius: An intelligent multimedia storytelling interpre-
tation and presentation system,’’ School of Computing and Intelligent
literature on text-to-image generative models, summarizing Systems, University of Ulster, 2002.
the historical development, popular datasets, key methods, [11] Y. Jiang, J. Liu, and H. Lu, ‘‘Chat with illustration,’’
common used evaluation metrics, and challenges faced in this Multimedia Systems, vol. 22, pp. 5–16, 2016. [Online]. Available:
https://doi.org/10.1007/s00530-014-0371-3
field. Despite these challenges, the potential of text-to-image [12] D. Ustalov, ‘‘A text-to-picture system for russian language,’’ 10 2012, pp.
generation in expanding creative horizons and enhancing AI 35–44.
systems is undeniable. The ability to generate realistic and [13] P. Jain, H. Darbari, and V. C. Bhavsar, ‘‘Vishit: A visualizer for hindi text,’’
2014, pp. 886–890.
diverse images from textual inputs opens up new possibilities
[14] A. Karkar, J. M. A. Ja’am, S. Foufou, and A. Sleptchenko, ‘‘An e-learning
in various fields, including art, design, advertising, and others. mobile system to generate illustrations for arabic text,’’ 2016, pp. 184–
Therefore, researchers and practitioners should continue to 191.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365043
[15] A. G. Karkar, J. M. Alja’am, and A. Mahmood, ‘‘Illustrate it! an arabic [37] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang, ‘‘Deepfashion: Powering
multimedia text-to-picture m-learning system,’’ IEEE Access, vol. 5, pp. robust clothes recognition and retrieval with rich annotations,’’ 6 2016.
12 777–12 787, 2017. [38] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, ‘‘Imagenet:
[16] I. Goodfellow, ‘‘Nips 2016 tutorial: Generative adversarial networks,’’ A large-scale hierarchical image database,’’ 2009. [Online]. Available:
12 2016. [Online]. Available: http://arxiv.org/abs/1701.00160 http://www.image-net.org.
[17] D. P. Kingma and M. Welling, ‘‘An introduction to variational [39] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset,
autoencoders,’’ Foundations and Trends in Machine Learning, vol. 12, pp. S. Kamali, S. Popov, M. Malloci, A. Kolesnikov, T. Duerig, and V. Ferrari,
307–392, 6 2019. [Online]. Available: https://arxiv.org/abs/1906.02691v3 ‘‘The open images dataset v4: Unified image classification, object detec-
[18] L. Weng, ‘‘Flow-based deep generative models,’’ lilianweng.github.io, tion, and visual relationship detection at scale,’’ International Journal of
2018. [Online]. Available: https://lilianweng.github.io/posts/2018-10-13- Computer Vision, vol. 128, pp. 1956–1981, 7 2020.
flow-models/ [40] S. Changpinyo, P. Sharma, N. Ding, and R. Soricut, ‘‘Conceptual 12m:
[19] P. Dhariwal and A. Nichol, ‘‘Diffusion models beat gans on image Pushing web-scale image-text pre-training to recognize long-tail visual
synthesis,’’ 5 2021. [Online]. Available: http://arxiv.org/abs/2105.05233 concepts,’’ Proceedings of the IEEE Computer Society Conference on
[20] F.-A. Croitoru, V. Hondru, R. T. Ionescu, and M. Shah, ‘‘Diffusion Computer Vision and Pattern Recognition, pp. 3557–3567, 2 2021.
models in vision: A survey,’’ 9 2022. [Online]. Available: [Online]. Available: https://arxiv.org/abs/2102.08981v2
http://arxiv.org/abs/2209.04747 [41] C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman,
[21] L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, Y. Shao, W. Zhang, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman,
B. Cui, and M.-H. Yang, ‘‘Diffusion models: A comprehensive P. Schramowski, S. Kundurthy, K. Crowson, L. Schmidt,
survey of methods and applications,’’ Comprehensive Survey of R. Kaczmarczyk, and J. Jitsev, ‘‘Laion-5b: An open large-scale dataset
Methods and Applications, vol. 1, p. 39, 9 2022. [Online]. Available: for training next generation image-text models,’’ Advances in Neural
https://arxiv.org/abs/2209.00796v9 Information Processing Systems, vol. 35, 10 2022. [Online]. Available:
[22] L. Weng, ‘‘What are diffusion models?’’ lilianweng.github.io, https://arxiv.org/abs/2210.08402v1
2021. [Online]. Available: https://lilianweng.github.io/posts/2021-07- [42] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee,
11-diffusion-models/ ‘‘Generative adversarial text to image synthesis,’’ 5 2016. [Online].
[23] S. Tyagi and D. Yadav, ‘‘A comprehensive review on image synthesis with Available: http://arxiv.org/abs/1605.05396
adversarial networks: Theory, literature, and applications,’’ Archives of [43] A. Radford, L. Metz, and S. Chintala, ‘‘Unsupervised representation
Computational Methods in Engineering, vol. 29, pp. 2685–2705, 8 2022. learning with deep convolutional generative adversarial networks,’’ 2015.
[24] R. Zhou, C. Jiang, and Q. Xu, ‘‘A survey on generative adversarial [44] S. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee,
network-based text-to-image synthesis,’’ Neurocomputing, vol. 451, pp. ‘‘Learning what and where to draw,’’ 10 2016. [Online]. Available:
316–336, 9 2021. http://arxiv.org/abs/1610.02454
[25] Y. X. Tan, C. P. Lee, M. Neo, K. M. Lim, J. Y. Lim, and A. Alqahtani, [45] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. Metaxas,
‘‘Recent advances in text-to-image synthesis: Approaches, datasets and ‘‘Stackgan: Text to photo-realistic image synthesis with stacked
future research prospects,’’ IEEE Access, vol. 11, pp. 88 099–88 115, generative adversarial networks,’’ vol. 2017-Octob, 12 2016, pp. 5908–
2023. 5916. [Online]. Available: https://github.com/hanzhanggit/StackGAN.
[26] H. Cao, C. Tan, Z. Gao, G. Chen, P.-A. Heng, and S. Z. Li, ‘‘A [46] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and
survey on generative diffusion model,’’ 9 2022. [Online]. Available: D. N. Metaxas, ‘‘Stackgan++: Realistic image synthesis with stacked
http://arxiv.org/abs/2209.02646 generative adversarial networks,’’ IEEE Transactions on Pattern Analysis
[27] C. Zhang, S. K. C. ZHANG, C. Zhang, S. Zheng, M. Zhang, M. Qamar, and Machine Intelligence, vol. 41, pp. 1947–1962, 10 2017. [Online].
S.-H. Bae, and I. S. Kweon, ‘‘A survey on audio diffusion models: Text Available: https://github.com/hanzhanggit/StackGAN-v2.
to speech synthesis and enhancement in generative ai,’’ 3 2023. [Online]. [47] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He,
Available: https://arxiv.org/abs/2303.13336v2 ‘‘Attngan: Fine-grained text to image generation with attentional gener-
[28] R. Yang, P. Srivastava, and S. Mandt, ‘‘Diffusion probabilistic modeling ative adversarial networks,’’ Proceedings of the IEEE Computer Society
for video generation,’’ Entropy, vol. 25, 3 2022. [Online]. Available: Conference on Computer Vision and Pattern Recognition, pp. 1316–1324,
https://arxiv.org/abs/2203.09481v5 11 2017.
[29] A. Ulhaq, N. Akhtar, G. Pogrebna, and S. Member, ‘‘Efficient [48] T. Qiao, J. Zhang, D. Xu, and D. Tao, ‘‘Mirrorgan: Learning text-
diffusion models for vision: A survey,’’ 10 2022. [Online]. Available: to-image generation by redescription,’’ Proceedings of the IEEE
https://arxiv.org/abs/2210.09292v2 Computer Society Conference on Computer Vision and Pattern
[30] C. Zhang, C. Zhang, M. Zhang, and I. S. Kweon, ‘‘Text-to-image Recognition, vol. 2019-June, pp. 1505–1514, 3 2019. [Online]. Available:
diffusion models in generative ai: A survey,’’ 3 2023. [Online]. Available: https://arxiv.org/abs/1903.05854v1
https://arxiv.org/abs/2303.07909v2 [49] Y. Li, Z. Gan, Y. Shen, J. Liu, Y. Cheng, Y. Wu, L. Carin, D. Carlson, and
[31] T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, J. Gao, ‘‘Storygan: A sequential conditional gan for story visualization,’’
P. Dollár, and C. L. Zitnick, ‘‘Microsoft coco: Common objects in 12 2018. [Online]. Available: http://arxiv.org/abs/1812.02784
context,’’ Lecture Notes in Computer Science (including subseries [50] H. Park, Y. Yoo, N. K. Mc-Gan, H. Park, Y. Yoo, and N. Kwak, ‘‘Mc-gan:
Lecture Notes in Artificial Intelligence and Lecture Notes in Multi-conditional generative adversarial network for image synthesis,’’
Bioinformatics), vol. 8693 LNCS, pp. 740–755, 5 2014. [Online]. British Machine Vision Conference 2018, BMVC 2018, 5 2018. [Online].
Available: https://arxiv.org/abs/1405.0312v3 Available: https://arxiv.org/abs/1805.01123v5
[32] S. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee, [51] M. Zhu, P. Pan, W. Chen, and Y. Yang, ‘‘Dm-gan: Dynamic memory
‘‘Learning what and where to draw,’’ 10 2016. [Online]. Available: generative adversarial networks for text-to-image synthesis,’’ 2019, pp.
http://arxiv.org/abs/1610.02454 5795–5803.
[33] M.-E. Nilsback and A. Zisserman, ‘‘Automated flower classification over [52] B. Li, X. Qi, T. Lukasiewicz, and P. H. Torr, ‘‘Manigan: Text-
a large number of classes,’’ 10 2008, pp. 722–729. guided image manipulation,’’ Proceedings of the IEEE Computer Society
[34] W. Xia, Y. Yang, J.-H. Xue, and B. Wu, ‘‘Tedigan: Conference on Computer Vision and Pattern Recognition, pp. 7877–7886,
Text-guided diverse face image generation and manipulation,’’ 12 2019. [Online]. Available: https://arxiv.org/abs/1912.06203v2
Proceedings of the IEEE Computer Society Conference on [53] M. Tao, H. Tang, F. Wu, X.-Y. Jing, B.-K. Bao, and C. Xu, ‘‘Df-gan: A
Computer Vision and Pattern Recognition, pp. 2256–2265, simple and effective baseline for text-to-image synthesis,’’ pp. 16 494–
12 2020. [Online]. Available: https://github.com/weihaox/TediGAN. 16 504, 8 2020. [Online]. Available: https://arxiv.org/abs/2008.05865v4
https://arxiv.org/abs/2012.03308v3 [54] M. A. H. Palash, M. A. A. Nasim, A. Dhali, and F. Afrin,
[35] T. Karras, T. Aila, S. Laine, and J. Lehtinen, ‘‘Progressive ‘‘Fine-grained image generation from bangla text description using
growing of gans for improved quality, stability, and variation,’’ attentional generative adversarial network,’’ 2021 IEEE International
6th International Conference on Learning Representations, ICLR Conference on Robotics, Automation, Artificial-Intelligence and Internet-
2018 - Conference Track Proceedings, 10 2017. [Online]. Available: of-Things (RAAICON), pp. 79–84, 12 2021. [Online]. Available:
https://arxiv.org/abs/1710.10196v3 https://ieeexplore.ieee.org/document/9929536/
[36] Y. Jiang, Z. Huang, X. Pan, C. C. Loy, and Z. Liu, ‘‘Talk-to-edit: Fine- [55] A. S. Parihar, A. Kaushik, A. V. Choudhary, and A. K. Singh, ‘‘Htgan:
grained facial editing via dialog,’’ 2021. An architecture for hindi text based image synthesis,’’ 2021 5th Interna-
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365043
tional Conference on Computer, Communication, and Signal Processing, image synthesis via non-autoregressive generative transformers,’’ 5 2021.
ICCCSP 2021, pp. 273–279, 5 2021. [Online]. Available: https://arxiv.org/abs/2105.14211v4
[56] H. Zhang, S. Yang, and H. Zhu, ‘‘Cje-tig: Zero-shot cross-lingual text-to- [75] Z. Wang, W. Liu, Q. He, X. Wu, and Z. Yi, ‘‘Clip-gen: Language-free
image generation by corpora-based joint encoding,’’ Knowledge-Based training of a text-to-image generator with clip,’’ 3 2022. [Online].
Systems, vol. 239, p. 108006, 3 2022. Available: https://arxiv.org/abs/2203.00386v1
[57] J. Zakraoui, M. Saleh, S. Al-Maadeed, and J. M. Jaam, ‘‘Improving [76] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, S. K. S.
text-to-image generation with object layout guidance,’’ Multimedia Tools Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes, T. Salimans,
and Applications, vol. 80, pp. 27 423–27 443, 2021. [Online]. Available: J. Ho, D. J. Fleet, and M. Norouzi, ‘‘Photorealistic text-to-image diffusion
https://doi.org/10.1007/s11042-021-11038-0 models with deep language understanding,’’ 5 2022. [Online]. Available:
[58] J. Zakraoui, S. A. Maadeed, M. S. A. El-Seoud, J. M. Alja’am, and https://arxiv.org/abs/2205.11487v1
M. Salah, ‘‘A generative approach to enrich arabic story text with [77] Z. Feng, Z. Zhang, X. Yu, Y. Fang, L. Li, X. Chen, Y. Lu, J. Liu,
visual aids.’’ Association for Computing Machinery, 2021, pp. 47–52. W. Yin, S. Feng, Y. Sun, H. Tian, H. Wu, and H. Wang, ‘‘Ernie-
[Online]. Available: https://doi.org/10.1145/3512716.3512725 vilg 2.0: Improving text-to-image diffusion model with knowledge-
[59] S. Maher and M. Loey, ‘‘Photo realistic generation from arabic text enhanced mixture-of-denoising-experts,’’ 10 2022. [Online]. Available:
description based on generative adversarial networks,’’ Transactions on https://arxiv.org/abs/2210.15257v1
Asian and Low-Resource Language Information Processing, 5 2021. [78] Y. Balaji, S. Nah, X. Huang, A. Vahdat, J. Song, K. Kreis, M. Aittala,
[Online]. Available: https://dl.acm.org/doi/10.1145/3490504 T. Aila, S. Laine, B. Catanzaro, T. Karras, and M.-Y. Liu, ‘‘ediff-i:
[60] M. Bahani, A. El Ouaazizi, and K. Maalmi, ‘‘Arabert Text-to-image diffusion models with an ensemble of expert denoisers,’’
and df-gan fusion for arabic text-to-image generation,’’ 11 2022. [Online]. Available: https://arxiv.org/abs/2211.01324v3
Array, vol. 16, p. 100260, 2022. [Online]. Available: [79] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, ‘‘High-
https://www.sciencedirect.com/science/article/pii/S2590005622000935 resolution image synthesis with latent diffusion models,’’ pp. 10 674–
[61] M. Bahani, S. M. Ben, K. Maalmi, and A. E. Ouaazizi, ‘‘Increase 10 685, 12 2021. [Online]. Available: https://arxiv.org/abs/2112.10752v2
the effectiveness of the arabic text-to-image generation task,’’ 10 [80] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, ‘‘Hierarchical
2022. [Online]. Available: https://www.researchsquare.com/article/rs- text-conditional image generation with clip latents,’’ 4 2022. [Online].
2169841/v1 Available: http://arxiv.org/abs/2204.06125
[62] M. Bahani, A. E. Ouaazizi, and K. Maalmi, ‘‘The effectiveness of t5, gpt- [81] J. Shi, C. Wu, J. Liang, X. Liu, and N. Duan, ‘‘Divae: Photorealistic
2, and bert on text-to-image generation task,’’ Pattern Recognition Letters, images synthesis with denoising diffusion decoder,’’ 6 2022. [Online].
vol. 173, pp. 57–63, 9 2023. Available: https://arxiv.org/abs/2206.00386v1
[63] M. Kang, J.-Y. Zhu, R. Zhang, J. Park, E. Shechtman, S. Paris, and T. Park, [82] W.-C. Fan, Y.-C. Chen, D. Chen, Y. Cheng, L. Yuan, and Y.-C. F. Wang,
‘‘Scaling up gans for text-to-image synthesis,’’ pp. 10 124–10 134, 3 ‘‘Frido: Feature pyramid diffusion for complex scene image synthesis,’’
2023. [Online]. Available: https://arxiv.org/abs/2303.05511v2 8 2022. [Online]. Available: https://arxiv.org/abs/2208.13753v1
[64] C. Liu, J. Hu, and H. Lin, ‘‘Swf-gan: A text-to-image model based on [83] N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and
sentence–word fusion perception,’’ Computers Graphics, vol. 115, pp. K. Aberman, ‘‘Dreambooth: Fine tuning text-to-image diffusion
500–510, 10 2023. models for subject-driven generation,’’ 8 2022. [Online]. Available:
[65] M. Tao, B. K. Bao, H. Tang, and C. Xu, ‘‘Galip: Generative https://arxiv.org/abs/2208.12242v1
adversarial clips for text-to-image synthesis,’’ Proceedings of the
[84] B. Kawar, S. Zada, O. Lang, O. Tov, H. Chang, T. Dekel, I. Mosseri, and
IEEE Computer Society Conference on Computer Vision and Pattern
M. Irani, ‘‘Imagic: Text-based real image editing with diffusion models,’’
Recognition, vol. 2023-June, pp. 14 214–14 223, 1 2023. [Online].
10 2022. [Online]. Available: https://arxiv.org/abs/2210.09276v1
Available: https://arxiv.org/abs/2301.12959v1
[85] D. Valevski, M. Kalman, Y. Matias, and Y. Leviathan, ‘‘Unitune: Text-
[66] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen,
driven image editing by fine tuning an image generation model on a single
and I. Sutskever, ‘‘Zero-shot text-to-image generation,’’ 2 2021. [Online].
image,’’ 10 2022. [Online]. Available: https://arxiv.org/abs/2210.09477v3
Available: https://arxiv.org/abs/2102.12092v2
[86] J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang,
[67] J. Yu, Y. Xu, J. Y. Koh, T. Luong, G. Baid, Z. Wang, V. Vasudevan,
J. Zhuang, J. Lee, Y. Guo, W. Manassra, P. Dhariwal, C. Chu, Y. Jiao,
A. Ku, Y. Yang, B. K. Ayan, B. Hutchinson, W. Han, Z. Parekh, X. Li,
and A. Ramesh, ‘‘Improving image generation with better captions,’’ 9
H. Zhang, J. Baldridge, and Y. Wu, ‘‘Scaling autoregressive models
2023. [Online]. Available: https://cdn.openai.com/papers/dall-e-3.pdf
for content-rich text-to-image generation,’’ 6 2022. [Online]. Available:
https://arxiv.org/abs/2206.10789v1 [87] W. Wu, Z. Li, Y. He, M. Z. Shou, C. Shen, L. Cheng, Y. Li,
[68] M. Ding, Z. Yang, W. Hong, W. Zheng, C. Zhou, D. Yin, J. Lin, T. Gao, D. Zhang, and Z. Wang, ‘‘Paragraph-to-image generation with
X. Zou, Z. Shao, H. Yang, and J. Tang, ‘‘Cogview: Mastering text-to- information-enriched diffusion model,’’ arXiv.org, 2023.
image generation via transformers,’’ Advances in Neural Information [88] W. Li, X. Xu, X. Xiao, J. Liu, H. Yang, G. Li, Z. Wang, Z. Feng,
Processing Systems, vol. 24, pp. 19 822–19 835, 5 2021. [Online]. Q. She, Y. Lyu, and H. Wu, ‘‘Upainting: Unified text-to-image diffusion
Available: https://arxiv.org/abs/2105.13290v3 generation with cross-modal guidance,’’ 10 2022. [Online]. Available:
[69] M. Ding, W. Zheng, W. Hong, and J. Tang, ‘‘Cogview2: Faster and https://arxiv.org/abs/2210.16031v3
better text-to-image generation via hierarchical transformers,’’ 4 2022. [89] R. Ganz and M. Elad, ‘‘Clipag: Towards generator-free text-to-image gen-
[Online]. Available: https://arxiv.org/abs/2204.14217v2 eration,’’ 6 2023. [Online]. Available: https://arxiv.org/abs/2306.16805v2
[70] S. Gu, D. Chen, J. Bao, F. Wen, B. Zhang, D. Chen, L. Yuan, and [90] Y. Li, H. Liu, Q. Wu, F. Mu, J. Yang, J. Gao, C. Li, and Y. J. Lee,
B. Guo, ‘‘Vector quantized diffusion model for text-to-image synthesis,’’ ‘‘Gligen: Open-set grounded text-to-image generation,’’ Proceedings of
Proceedings of the IEEE Computer Society Conference on Computer the IEEE Computer Society Conference on Computer Vision and Pattern
Vision and Pattern Recognition, vol. 2022-June, pp. 10 686–10 696, 11 Recognition, vol. 2023-June, pp. 22 511–22 521, 1 2023. [Online].
2021. [Online]. Available: https://arxiv.org/abs/2111.14822v3 Available: https://arxiv.org/abs/2301.07093v2
[71] O. Avrahami, D. Lischinski, and O. Fried, ‘‘Blended diffusion for [91] Y. Li, H. Wang, Q. Jin, J. Hu, P. Chemerys, Y. Fu, Y. Wang,
text-driven editing of natural images,’’ pp. 18 187–18 197, 11 2021. S. Tulyakov, and J. Ren, ‘‘Snapfusion: Text-to-image diffusion model
[Online]. Available: https://arxiv.org/abs/2111.14818v2 on mobile devices within two seconds,’’ 6 2023. [Online]. Available:
[72] A. Sanghi, H. Chu, J. G. Lambourne, Y. Wang, C.-Y. Cheng, https://arxiv.org/abs/2306.00980v2
M. Fumero, and K. R. Malekshan, ‘‘Clip-forge: Towards zero-shot text- [92] L. Zhang, A. Rao, and M. Agrawala, ‘‘Adding conditional control
to-shape generation,’’ pp. 18 582–18 592, 10 2021. [Online]. Available: to text-to-image diffusion models,’’ 2 2023. [Online]. Available:
https://arxiv.org/abs/2110.02624v2 https://arxiv.org/abs/2302.05543v2
[73] A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. Mcgrew, [93] W. Zhao, Y. Rao, Z. Liu, B. Liu, J. Zhou, and J. Lu, ‘‘Unleashing
I. Sutskever, and M. Chen, ‘‘Glide: Towards photorealistic image text-to-image diffusion models for visual perception,’’ 3 2023. [Online].
generation and editing with text-guided diffusion models,’’ 12 2021. Available: https://arxiv.org/abs/2303.02153v1
[Online]. Available: https://arxiv.org/abs/2112.10741v3 [94] J. Hu, X. Han, X. Yi, Y. Chen, W. Li, Z. Liu, and M. Sun, ‘‘Efficient
[74] Z. Zhang, J. Ma, C. Zhou, R. Men, Z. Li, M. Ding, J. Tang, J. Zhou, cross-lingual transfer for chinese stable diffusion with images as pivots,’’
and H. Yang, ‘‘M6-ufc: Unifying multi-modal controls for conditional 5 2023. [Online]. Available: https://arxiv.org/abs/2305.11540v1
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3365043
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/