project 4 report(Rohit&Gayatri)
project 4 report(Rohit&Gayatri)
GENERATOR
Submitted
By
2024-25
DEPARTMENT OF COMPUTER SCIENCE AND
ENGINEERING
CERTIFICATE
This is to certify that the project entitled “Text To Image Generator” has
been carried out by Gayatri Palai(Regd. No. 22012984331) and Bellana
Rohit(Regd. No. 2201298311) under my guidance and supervision and be
accepted in partial fulfilment of the requirement for the degree of Masters In
Computer Science And Engineering. The report, which is based on the
candidate’s own work, has not been submitted elsewhere for a degree. To the
best of my knowledge, Mrs. Gayatri Palai and Mr. Bellana Rohit have good
moral character and decent behavior.
Dr. Sujit Kumar Panda Asst.Prof. Smruti Smaraki Sarangi Asst. Prof. Suchisnita Nayak
2
DEPARTMENT OF COMPUTER SCIENCE AND
ENGINEERING
DECLARATION
We, Gayatri Palai and Bellana Rohit, hereby declare that this written
submission represents our ideas in our own words, and where others’ ideas or
words have been included, it has been adequately cited and referenced to the
original sources. I also declare that I have adhered to all principles of academic
honesty and integrity and have not misrepresented or fabricated, or falsified
any idea/ data/ fact/ source in my submission. I understand that any violation
of the above will be cause for disciplinary action by the Institute and can also
evoke penal action from the sources which have thus not been properly cited
or from whom proper permission has not been taken when needed.
3
DEPARTMENT OF COMPUTER SCIENCE AND
ENGINEERING
BONAFIDE CERTIFICATE
This is to certify that the project work titled “ Text To Image Generator ”
is a bonafide record of the work done by Mrs. Gayatri Palai (2201298331) &
Mr. Bellana Rohit (2201298311) in partial fulfillment of the requirements for
the award of the degree B.Tech in CSE from GIFT Autonomous, Bhubaneswar
Odisha.
4
DEPARTMENT OF COMPUTER SCIENCE AND
ENGINEERING
ACKNOWLEDGEMENTS
I am very grateful, thankful, and wish to record our indebtedness to Dr.
Trilochan Sahu, Principal, Gandhi Institute For Technology (GIFT ),
Bhubaneswar, for his active guidance and interest in this project work.
I would also like to thank Dr. Sujit Ku. Panda, Head, Department of Computer
Science and Engineering and for his continued drive for better quality in
everything that allowed us to carry out our project work.
I would also like to take the opportunity to thank Asst. Prof. Smruti Smaraki
Sarangi, for their help and cooperation in this project work.
Lastly, words run to express my gratitude to my Parents and all the Professors,
Lecturers, Technical and Official staff, and friends for their co-operation,
constructive criticism, and valuable suggestions during the preparation of the
project report.
5
ABSTRACT
Synthetic Content Generation Using Machines is a very trending topic in the
field of Deep Learning and it is an extremely difficult task even for the state-of
the-Art ML algorithms. The upside of Using Deep Learning to do this is that it
can generate Content that does not exist yet. In the recent past Generative
Adversarial Networks (GAN) have shown great promise when it comes to
generating images but they are difficult to train and condition on any particular
input which acts as a downside for them. However, they have tremendous
applications in generating content in an unsupervised learning approach like
generating video, Increasing the resolution of Images or Generating Images
from Text. In this project we look at generating 64*64 Images on the fly using a
text as an Input. The images generated will be unique in terms that they do not
already exist and in doing that we will improve upon already existing
Architecture models and try to reduce the difficulties that come with training
GAN Models like Reduced Training Time and Better Convergence of The
Model.
The Final Project will be a WebApp, where you can Input a Text and a
6
INDEX
S.NO CONTENTS PAGE NO.
ABSTRACT 6
1. INTRODUCTON 8-17
3. PERFORMANCE ANALYSIS
28-30
4. ADVANTAGES AND
DISADVANTAGES 30-32
6. CONCLUSION 34
7. REFERENCES 35
7
1. INTRODUCTION
1.1 INTRODUCTION
For a human mind it is very easily too thin of new content. what if someone asks
you to “draw a flower with blue petals”. It is very easy for us to do that. but
machines process information very differently. Just understanding the structure
of the above sentence is a difficult task for them let alone generate something
based on that description. Automatic synthetic content generation is a field that
has been explored in the past and was discredited because at that time neither
the algorithms existed nor enough processing power that could help solve the
problem. However, the advent of deep learning started changing the earlier
beliefs. The tremendous power of neural networks to capture the features even
in the humongous of datasets makes them a very viable candidate for
automatic content generation. another milestone was achieved when Ian Good
Fellow proposed generative adversarial networks in 2014. GANs are a kind of
architecture in Deep learning that can produce content from random noise.
What is even more unique about GANs is that the content they create
represents the dataset on which they are being trained upon but it is totally
unique in some way or the other. Generating an image from a text-based
description is one aspect of generative adversarial networks that we will focus
upon. Since the GANs follow unsupervised learning approach we have
modified them to take am input as a condition and generate based on the input
condition. This can form base for a large number of things like synthetic audio
generation like the ones used in Siri or assistant, video content generation from
just scripts. imagine entire movies made out of just the script. These are some
uses that many companies are researching about. modifying GANs and
applying conditions on them isn’t limited to just generating images, we can use
it to create passwords that are very hard to crack and numerous similar
applications like this,
8
Deep Learning and Content Generation
Deep Learning is a field that utilises and relies completely on Various
Flavours of Neural Networks to Extract Insights from the data and find
patterns among that data. While it has been shown to be very successful
in things like Image Classification (In some datasets even beating human
level accuracy by a large margin) and Time Series Analysis (There are
so many factors involved that it
even becomes difficult for a human to take all those into account), A
completely different Aspect of it has been started to explore.
9
The big Question Being
"Can We use Deep Learning to Generate Content?"
As we know Neural Networks can extract features of a dataset that
they have been trained upon, the goal becomes using those features
to create new data points that do not belong in the dataset itself.
10
2) Discriminator:
Generator alone will just generate something random, so basically discriminator
will give guidance to generator on what images it should create. Discriminator
is nothing more than a simple convolutional neural network that takes in an
image as an input and determines whether the image came from the original
dataset or is it an image generated by the generator. Simply taking in an image
as an input it determines whether it is real or fake (Synthetically Generated by
Generator).
11
1.1 PROBLEM STATEMENT
Generating Images from Text is a very difficult problem that can be approached
by using Generative Adversarial Networks and will be extremely useful for
content creators wherein they can type a description and have the type of
content generated automatically saving them a lot of money and work.
Imagine Thinking about a Description and having to draw something that
matches the description in a meaningful way. It’s even a difficult task for
humans. But Deep Learning Can Understand the Underlying Structure of The
Content and might be able to generate that automatically. Thereby eliminating
the need of domain expertise. GANs despite having all the upside for content
generation are very difficult to train and take a lot of time to converge and are
unstable doing the training process and in this project, we also try to tackle
these problems by modifying the Underlying Structure of the GAN Model
12
1.2 OBJECTIVE
The main objective of this project is to develop a web app in which a text can be
inputted and it outputs an image matching the description of the text and in
doing so try to improve upon the generator architecture of the Generative
Adversarial Networks. By modifying the input to a generator and applying
conditions on the input we can create a model that generates images not from
noise but from a controlled input. In Our case the Controlled Input Being the
text that is Embedded after passing onto another Neural Network
13
14
1.3 METHODOLOGY
We first start by downloading the Oxford 102 Dataset which contains 102
different Categories of flowers and also contains annotation for each image
in the form of a text Description.
15
After this we download on more data set that is CUB dataset that contains 200
bird species with almost 11700 images.
16
Next, We Begin importing all the packages and the sub packages and splitting
it into the training, Validation and testing set. The following packages and
libraries are being used to process The Dataset and build the architectures:
•NumPy
•Pytorch
•OpenCV
•Flask
We first start by downloading and pre-processing the dataset. During the pre-
processing phase we convert text into embedding and normalize the images so
they are ready to be passed onto respective models We then start to build our
Customised Generator Model and use a standard Pre-trained Model as the
Discriminator After the model Creation we create a training script and take in
some best practices in the field of Deep Learning to train the model with
stability using our customised Pytorch Trainer. The Final task is to wrap up the
final trained model into a Flask Web App so that Testing becomes easy.
17
1.4 FUTURE SCOPE
As AI and machine learning technologies continue to develop, the capabilities
of AI image generators will undoubtedly improve and expand. Here are some
potential future developments and innovations that can take AI image
generation to new heights:
Improved algorithms:
18
2. SYSTEM DEVELOPMENT
2.1 TEXT DESCRIPTION TO EMBEDDING
The very first step involved in training our model is to convert the text to an
embedding. Neural networks work on vectors and numbers and cannot
essentially do anything if the input format is a text. So, the very first thing we
do is utilise a Long Short-Term Memory (LSTM) network which will take in
the input as a pre-processed text after removing unnecessary space and
improving semantics using standard text pre-processing libraries like spacy
and converting the text description into a vector of numbers which is then
given as an input to a pre-trained LSTM and the last layer is taken out which is
essentially the word embedding that we are looking for.
19
Why Word Embedding
Why exactly do we need to convert our sentence into an Embedding and not
just a one hot Encoded vector. To Understand that let us take a very simple
Example where in once we represent the Words as one hot encoded Vectors
and in the other, we Use an Embedding Matrix
2. Those Vectors do not have any kind of relation among them that a model can Learn and
it becomes very difficult for it to learn when it cannot even understand the relation between
words Now let us Represent them in an Embedding
20
When Represented like this the embedding for each vector has a meaning.
When representing these in Euclidean Space we will see that The Two Fruits
are closer to each other while the King and Queen are very similar to each
other in many respects except one which could be Gender. It is not pre-decided
on what features the model should learn but during the process model Itself
decides the best values that reduce the loss and in process it learns the
embedding That makes more sense to it.
21
Long Short-Term Memory Network or LSTM is a type of Recurrent Neural
Network that are very good for processing of long sentences because of its
ability to learn long term dependencies within the text by modifying the weight
of its gate cells. RNN Typically suffer with a problem that they can’t
remember the proper dependencies When processing text whose length is long.
To illustrate that problem, we will demonstrate Using a very simple Series.
Suppose you are being provided with a series and you have to tell the next
number Example 1) 2->4->6->8
Example 2) 2->4->8
Now in both the series three numbers are common and we know the first series
is a Multiple of 2 while the second one is a power of 2. But when we pass the
numbers to A Model the last input that it gets in both cases is 8 so how should
the model distinguish Between both the series. It should essentially have
previous pattern information combined with the current input to output the
correct result. But when the sequence gets longer in Length an RNN fails to
factor the previous information properly as with no proper mechanism to deal
with degrading gradients and at the end it is unable to do any kind of learning
This is the problem that LSTM were built to solve. An LSTM has additional
gates that help It properly retains the information throughout the input.
However Not all information is Important every time. As we go deeper into the
sequence the chances that the next output Depends on a very old input is very
less and that is where the forget gate of LSTM comes into action. At every
Step of input in a sequence an LSTM remodifies the weight of the gates using
backpropagation. In a very simple way, it helps it to determine what kind of
inputs are important at the current step to predict the next word/element in a
sequence.
22
2.2 PRE-PROCESSING THE IMAGES
We need to properly process the data before passing to the model as this will
Determine the level of accuracy that we can reach Instead of Using the 0 mean
and standard Deviation of 1, we can compute the mean and standard deviation
for each channel easily. for the current dataset the mean comes out to be
[0.484,0.451,0.406] and the standard deviation comes to be
[0.231,0.244,0.412].
Data Augmentation
Data Augmentation will help us to create more data to feed in to the model and
help it to generalise well by letting it see the data in various orientations. We
create our own transformation using NumPy. Here are some of the
Augmentation that we will be implementing
Combining the random flip and random rotation we have come up with the 8
dihedral transformations that could be applied to any number of channels and
on any kind of dataset as could be seen in the code snippet we first start by
creating a function which takes in an input x as a tensor (Matrix
Representation of Our Image) and a mode. We do not want to apply these
image augmentations when we are in validation mode and testing the entire
thing out in training mode, we need to randomly apply these transforms. We
use the python’s default random number generator to determine what kind of
transformations would be randomly applied to the image.
23
To flip the image horizontally we first convert our tensor into a NumPy array
and then use the NumPy flipper function to flip the array horizontally and flip
up to flip the array vertically. To rotate the image, we generate a random
number k between 0 and 3 which determines how many 90-degree rotations of
the array we will do. The following dihedral transformations could be formed
after this step
24
2.3 CREATING CUSTOMISED GENERATOR MODEL
The way a standard Generator Model Works is that it takes in some input and
by a series of Up sampling or Deconvolution operations, it creates the Image.
The only issue with that is while generating the final output it takes into
account is the Information from the previous layer which are very ideal for
tasks like Classification and Bounding Box Regression. But when dealing with
Image Generation we should also keep into account the original input
constraints without much processing along with the Information in the last
layer as it will not only help the gradient flow better but also help converge the
model faster
25
In the code snippet above we create our customised generator model from
scratch using pytorch. We start off by declaring the class and then initializing
the architecture within it.to properly use of pytorch’s inbuilt neural network
layers we need to use super to inherit the properties of the base class we start
off by declaring a convtranspose2d which essentially takes in the input
embedding and starts by doubling along the height and width and reducing
along the channel direction we add a dropout to increase regularization which
not only deals with overfitting the model on the training but also helps the
model generalise on the input features well this is followed by two
convolutional blocks, one doubling along the channel dimension and the other
one taking in that input and again reducing it back to original channel
dimensions without any change in any other dimensions. This was done as in
our practical implementations this trick worked out well now comes the major
step of producing the final image. As we stated earlier that we also need to add
in the original embedding directly. But the issue with that is embedding has
different dimensions altogether. To resolve that we use a simple up sampling
operation to bring the embedding to proper dimension before adding it to the
output of last layer. In terms of equations, we can see it as let the input be x
and desired output be h(x)
26
2.4 TRAINING THE MODEL
Step 1: Train the discriminator on original dataset and some random noise to
get the discriminator an edge in identifying real images from random noise.
This step is very important in the beginning as if the discriminator doesn’t
already know to some extent what the real dataset should look like. When we
use the loss function to the generator it will give essentially a lower loss than it
should which slows down the initial training. The training eventually stabilises
if we do not train the discriminator first properly but that takes a lot of time. by
doing this we are decreasing the training time of the model.
Step 2: After the discriminator has been initially trained for a while, we start
by making a forward pass through the modified generator model and get in a
random image initially and a high loss function, which is then backpropagated
throughout the entire network in order to update and fine tune its internal
parameters. The generated images are stored in a temporary variable and are
passed on to the discriminator in its next phase. There might be a chance where
our GAN is not finding the equilibrium between the discriminator and
generator.
27
This graph shows us the accuracy by the discriminator:
Here the accuracy of discriminator is 100% which means our GAN is perfectly
identifying that weather the image is real or fake. This is the most common
failure and it is called convergence failure.
28
3 PERFORMANCE ANALYSIS
EVALUATION:
It is not easy to evaluate the performance of a generative model based on any
metric and mostly humans at the end have to decide whether the generated
content is good or not and whether or not it holds any particular meaning.
However, we can judge the discriminator model in its ability to distinguish real
images from fake ones. now compared to ordinary convolutional models where
high accuracy means better results it isn’t true in the case of generative models.
If the discriminator has very high accuracy in distinguishing real images from
fake ones, then that implies generator hasn’t done a very good job in creating
images that represent the dataset well. In the situation of a perfect equilibrium
the discriminator should have an accuracy of 50% that is it has to take a
random guess to determine whether the generated image is fake or not
implying the generator has created images so good that are indistinguishable
from the original images. The closer an accuracy is to 50% the better task the
generator has done in creating images.
29
The above graph shows us the loss that is the discriminator loss for the real images
in blue colour and discriminator loss for fake images in orange colour and
generator loss for the generated images in the green colour. This is an expected
loss during this training and it will stabilize after around 100 to 300 epochs. The
discriminator loss for the real and the fake images is approximately 50% and the
generator loss for the generated images is between to 50% to 70 %.
Accuracy graph:
This graph shows us the accuracy by the discriminator for the real images that
is in blue colour and for the fake images in orange colour. This GAN model
will get stabilized in 100 to 300 epochs and after that it will give us an
accuracy approximately in between 70% to 80% and it will remain stabilized
after that.
30
4 ADVANTAGES AND DISADVANTAGES:
4.1 ADVANTAGES:
1. Enhanced Efficiency and Productivity:
31
3. Accessibility and Affordability:
32
4.2 DISADVANTAGES
1. Lack of Authenticity and Originality:
One of the main concerns with AI-generated photos is the potential lack of
authenticity and originality. While AI algorithms excel at analysing patterns
and generating visually pleasing images, they may lack the human touch and
personal expression found in traditional photography. AI-generated photos can
sometimes appear generic or formulaic, lacking the unique perspective and
creativity that photographers bring to their work. This raises questions about
the uniqueness and artistic value of AI-generated visuals.
2. Ethical Considerations and Bias:
33
4 FINAL WEB-APP PRODUCTION
We Develop a Web App using flask that presents the user. The user with a web
app and an option to input the text and choose a model from various inference
Models that we have trained. On clicking generate Image, the request is
processed in the backend in python and a resultant image using the model is
created and we hit another flask endpoint where the generated Image is
displayed. The following end points have been created in our existing flask
application:
1) Home Route:
At the home endpoint or route, we redirect to the page where the user can
provide with an input if the model has been loaded successfully without any
error. If the chosen model cannot be loaded properly, we redirect to a route
describing the error.
2) Generate Route:
After the user successfully enters a text, it is pre-processed into a vector and
passed on to our LSTM model that generates the word embedding. The
embedded vector is then passed to the loaded generator model and is saved
onto a location using timestamp as the file name.
3) Result Route:
After the Image has been successfully generated, we redirect the application
to a page which displays the generated image.
4) Error Route:
34
5 CONCLUSIONS
In this project we have created a web app that can take in text description of a
flower and bird and generate images based on that. And while doing that we
have modified the generator architecture in such a way that we have reduced
the training time of GAN.
In this paper, a brief image generation review is presented. The existing images
generation approaches have been categorized based on the data used as input
for generating new images including images, hand sketch, layout and text. In
addition, we presented the existing works of conditioned image generation
which is a type of image generation while a reference is exploited to generate
the final image. An effective image generation method is related to the dataset
used which must be a large-scale one. For that, we summarize popular
benchmark datasets used for image generation techniques. The evaluation
metrics for evaluating various methods is presented. Based on these metrics as
well as dataset used for training, a tabulated comparison is performed. Then, a
summarization of the current image generation challenges is presented.
35
6 REFRENCES
[1] M. Arlovski and L. Bottou. Towards principled methods for training generative adversarial networks. In
ICLR, 2017.
[2] A. Brock, T. Lim, J. M. Ritchie, and N. Weston. Neural photo editing with introspective adversarial
networks. In ICLR, 2017.
[3] T. Che, Y. Li, A. P. Jacob, Y. Bagnio, and W. Li. Mode regularized generative adversarial networks. In
ICLR, 2017.
[4] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable
representation learning by information maximizing generative adversarial nets. In NIPS, 2016.
[5] E. L. Denton, S. Chintala, A. Szlam, and R. Fergus. Deep generative image models using a laplacian
pyramid of adversarial networks. In NIPS, 2015.
[7] J. Gauthier. Conditional generative adversarial networks for convolutional face generation. Technical
report, 2015.
[9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016. [10]
X. Huang, Y. Li, O. Poursaeed, J. Hopcroft, and S. Belongie. Stacked generative adversarial networks. In
CVPR, 2017
36