ai-image-generator
ai-image-generator
Bachelor of Engineering in
Computer
By
Name Roll no.
Yash Ahire 02
Prashant Bhalerao 04
Chetan Parse 42
Jishan Shaikh 47
Supervisor
Prof. (Deepa Athawale)
Technology Personified
Department of Computer Engineering
Innovative Engineers’ and Teachers’ Education
society’s
Bharat College of Engineering
Badlapur: - 421504
(Affiliated to University of Mumbai)
(2023-24)
CERTIFICATE
This is to certify that, the Project titled
Yash Ahire
Prashant Bhalerao
Chetan Parse
Jishan Shaikh
and is submitted in the partial full-filament of the requirement for the degree of
Bachelor of Engineering in Computer
To the University of Mumbai
Supervisor External
Prof. (Deepa Athawale) Prof.( )
ABSTRACT
1. INTRODUCTON 3
2. SYSTEM DEVELOPMENT 12
3. PERFORMANCE ANALYSIS
23
4. ADVANTAGES AND
DISADVANTAGE 25
6. CONCLUSION 30
7. REFERENCES 2
1.1 INTRODUCTION
For a human mind it is very easily too thin of new content. what if someone asks you to
“draw a flower with blue petals”. It is very easy for us to do that. but machines process
information very differently. Just understanding the structure of the above sentence is a
difficult task for them let alone generate something based on that description. Automatic
synthetic content generation is a field that has been explored in the past and was discredited
because at that time neither the algorithms existed nor enough processing power that could
help solve the problem. However, the advent of deep learning started changing the earlier
beliefs. The tremendous power of neural networks to capture the features even in the
humongous of datasets makes them a very viable candidate for automatic content
generation. another milestone was achieved when Ian Good Fellow proposed generative
adversarial networks in 2014. GANs are a kind of architecture in Deep learning that can
produce content from random noise. What is even more unique about GANs is that the
content they create represents the dataset on which they are being trained upon but it is
totally unique in some way or the other. Generating an image from a text-based description
is one aspect of generative adversarial networks that we will focus upon. Since the GANs
follow unsupervised learning approach we have modified them to take am input as a
condition and generate based on the input condition. This can form base for a large number
of things like synthetic audio generation like the ones used in Siri or assistant, video content
generation from just scripts. imagine entire movies made out of just the script. These are
some uses that many companies are researching about. modifying GANs and applying
conditions on them isn’t limited to just generating images, we can use it to create passwords
that are very hard to crack and numerous similar applications like this
We first start by downloading the Oxford 102 Dataset which contains 102 different
Categories of flowers and also contains annotation for each image in the form of a text
Description.
10
•NumPy
•Pytorch
•OpenCV
•Flask
We first start by downloading and pre-processing the dataset. During the pre-processing
phase we convert text into embedding and normalize the images so they are ready to be
passed onto respective models We then start to build our Customised Generator Model and
use a standard Pre-trained Model as the Discriminator After the model Creation we create a
training script and take in some best practices in the field of Deep Learning to train the
model with stability using our customised Pytorch Trainer. The Final task is to wrap up the
final trained model into a Flask Web App so that Testing beco
11
Improved algorithms:
As researchers and developers identify new techniques and approaches to improve AI image
generation, the algorithms employed by these tools will become more advanced and
efficient. It is anticipated that future versions of AI image generators will generate more
realistic and high-quality images, with fewer artifacts and more precise fine details.
Existing AI image generators still struggle with generating diverse and coherent results
consistently. In other words, they sometimes lack the ability to represent a broader range of
styles and may generate images with inconsistencies or inaccuracies. In the future, AI image
generators will likely produce more diverse and consistent images while reducing these
common issues, leading to better alignment with users' expectations and requirements.
Future AI image generators are likely to seamlessly integrate with various existing design
and development tools, enabling creatives to work more efficiently and add AI-powered
image generation functionalities to their workflows. This will remove any significant effort
required to implement AI image generation in applications, such as utilizing the capabilities
of tools like App Master platform for backend, web, and mobile applications.
12
The very first step involved in training our model is to convert the text to an embedding.
Neural networks work on vectors and numbers and cannot essentially do anything if the
input format is a text. So, the very first thing we do is utilise a Long Short-Term Memory
(LSTM) network which will take in the input as a pre-processed text after removing
unnecessary space and improving semantics using standard text pre-processing libraries like
spacy and converting the text description into a vector of numbers which is then given as an
input to a pre-trained LSTM and the last layer is taken out which is essentially the word
embedding that we are looking for.
13
Why exactly do we need to convert our sentence into an Embedding and not just a one hot
Encoded vector. To Understand that let us take a very simple Example where in once we
represent the Words as one hot encoded Vectors and in the other, we Use an Embedding
Matrix
2. Those Vectors do not have any kind of relation among them that a model can Learn and
it becomes very difficult for it to learn when it cannot even understand the relation between
words Now let us Represent them in an Embedding
14
15
Example 2) 2->4->8
Now in both the series three numbers are common and we know the first series is a Multiple
of 2 while the second one is a power of 2. But when we pass the numbers to A Model the
last input that it gets in both cases is 8 so how should the model distinguish Between both
the series. It should essentially have previous pattern information combined with the current
input to output the correct result. But when the sequence gets longer in Length an RNN fails
to factor the previous information properly as with no proper mechanism to deal with
degrading gradients and at the end it is unable to do any kind of learning This is the problem
that LSTM were built to solve. An LSTM has additional gates that help It properly retains
the information throughout the input. However Not all information is Important every time.
As we go deeper into the sequence the chances that the next output Depends on a very old
input is very less and that is where the forget gate of LSTM comes into action. At every
Step of input in a sequence an LSTM remodifies the weight of the gates using
backpropagation. In a very simple way, it helps it to determine what kind of inputs are
important at the current step to predict the next word/element in a sequence. While the
forget gate determines how much every input it has seen earlier in the sequence is important,
the input gate helps to decide and update what information to keep and using combination of
these it is able to retain information even in a long sentence and able to overcome the
problems that arise with Recurrent Networks. The beauty of LSTM is that even a very
shallow LSTM model can understand the structure of a sentence very well due to the large
number of parameters that it has and its very unique configuration of the three gates.
16
We need to properly process the data before passing to the model as this will Determine the
level of accuracy that we can reach Instead of Using the 0 mean and standard Deviation of
1, we can compute the mean and standard deviation for each channel easily. for the current
dataset the mean comes out to be [0.484,0.451,0.406] and the standard deviation comes to
be [0.231,0.244,0.412].
Data Augmentation
Data Augmentation will help us to create more data to feed in to the model and help it to
generalise well by letting it see the data in various orientations. We create our own
transformation using NumPy. Here are some of the Augmentation that we will be
implementing
Combining the random flip and random rotation we have come up with the 8 dihedral
transformations that could be applied to any number of channels and on any kind of dataset
as could be seen in the code snippet we first start by creating a function which takes in an
input x as a tensor (Matrix Representation of Our Image) and a mode. We do not want to
apply these image augmentations when we are in validation mode and testing the entire
thing out in training mode, we need to randomly apply these transforms. We use the
python’s default random number generator to determine what kind of transformations would
be randomly applied to the image.
17
18
The way a standard Generator Model Works is that it takes in some input and by a series of
Up sampling or Deconvolution operations, it creates the Image. The only issue with that is
while generating the final output it takes into account is the Information from the previous
layer which are very ideal for tasks like Classification and Bounding Box Regression. But
when dealing with Image Generation we should also keep into account the original input
constraints without much processing along with the Information in the last layer as it will
not only help the gradient flow better but also help converge the model faster
In the code snippet above we create our customised generator model from scratch using
pytorch. We start off by declaring the class and then initialising
19
where F(x)=Conv Blocks + Non Linearities Instead of hoping the function to fit to a desired
mapping we can specify a residual mapping and let model reduce it and optimise it so as to
bring it closer to our desired output h(x)
20
The training process of a Generative Adversarial Network is a bit complicated than training
a normal Neural Network as it involves training the discriminator and the generator in an
alternating fashion.
Step 1: Train the discriminator on original dataset and some random noise to get the
discriminator an edge in identifying real images from random noise. This step is very
important in the beginning as if the discriminator doesn’t already know to some extent what
the real dataset should look like. When we use the loss function to the generator it will give
essentially a lower loss than it should which slows down the initial training. The training
eventually stabilises if we do not train the discriminator first properly but that takes a lot of
time. by doing this we are decreasing the training time of the model.
Step 2: After the discriminator has been initially trained for a while, we start by making a
forward pass through the modified generator model and get in a random image initially and
a high loss function, which is then backpropagated throughout the entire network in order to
update and fine tune its internal parameters. The generated images are stored in a temporary
variable and are passed on to the discriminator in its next phase. There might be a chance
where our GAN is not finding the equilibrium between the discriminator and generator. This
graph shows us the loss for the discriminator in the blue colour and loss for the generator in
the orange colour that is both is heading towards zero in the initial phase of the training. It is
possible when GAN is not stable.
22
Here the accuracy of discriminator is 100% which means our GAN is perfectly identifying
that weather the image is real or fake. This is the most common failure and it is called
convergence failure.
23
EVALUATION:
It is not easy to evaluate the performance of a generative model based on any metric and
mostly humans at the end have to decide whether the generated content is good or not and
whether or not it holds any particular meaning. However, we can judge the discriminator
model in its ability to distinguish real images from fake ones. now compared to ordinary
convolutional models where high accuracy means better results it isn’t true in the case of
generative models. If the discriminator has very high accuracy in distinguishing real images
from fake ones, then that implies generator hasn’t done a very good job in creating images
that represent the dataset well. In the situation of a perfect equilibrium the discriminator
should have an accuracy of 50% that is it has to take a random guess to determine whether
the generated image is fake or not implying the generator has created images so good that
are indistinguishable from the original images. The closer an accuracy is to 50% the better
task the generator has done in creating images.
24
Accuracy graph:
This graph shows us the accuracy by the discriminator for the real images that is in blue
colour and for the fake images in orange colour. This GAN model will get stabilized in 100
to 300 epochs and after that it will give us an accuracy approximately in between 70% to
80% and it will remain stabilized after that.
25
4.1 ADVANTAGES:
The automation of the photo generation process enables photographers and content creators
to focus more on the artistic aspects of their work. With AI handling the technical aspects,
such as colour correction, lighting adjustments, and image enhancement, photographers can
dedicate their time and energy to capturing unique perspectives, exploring creative concepts,
and telling compelling visual stories. This increased efficiency allows for a streamlined
workflow, enabling photographers to produce a larger volume of high-quality images in a
shorter amount of time.
One of the key advantages of AI-generated photos is their ability to enhance image quality.
AI algorithms can intelligently analyse each aspect of the photo, identifying areas that
require improvement and applying enhancements accordingly. This includes adjusting
lighting conditions to enhance details and contrast, correcting colour grading to achieve
26
27
One of the main concerns with AI-generated photos is the potential lack of authenticity and
originality. While AI algorithms excel at analysing patterns and generating visually pleasing
images, they may lack the human touch and personal expression found in traditional
photography. AI-generated photos can sometimes appear generic or formulaic, lacking the
unique perspective and creativity that photographers bring to their work. This raises
questions about the uniqueness and artistic value of AI-generated visuals.
The use of AI algorithms in generating photos also raises ethical considerations. Privacy and
consent issues may arise when AI is used to manipulate or generate images of individuals
without their knowledge or consent. Moreover, AI algorithms can inherit biases present in
the data they are trained on, potentially perpetuating societal biases or stereotypes in the
generated photos. Responsible and transparent AI practices is essential to address these
ethical concerns and ensure the fair and unbiased use of AI-generated photos.
The performance of AI algorithms heavily relies on the quality and diversity of the training
data they are provided. If the training data is biased or lacks diversity, the generated photos
may inherit these limitations. AI algorithms need access to a wide range of high-quality
training data to produce accurate and representative results. Ensuring that the training
datasets are comprehensive and inclusive is crucial for overcoming biases and achieving the
desired level of diversity in the generated photos.
28
1) Home Route:
At the home endpoint or route, we redirect to the page where the user can provide with an
input if the model has been loaded successfully without any error. If the chosen model
cannot be loaded properly, we redirect to a route describing the error.
2) Generate Route:
After the user successfully enters a text, it is pre-processed into a vector and passed on to
our LSTM model that generates the word embedding. The embedded vector is then passed
to the loaded generator model and is saved onto a location using timestamp as the file name.
3) Result Route:
After the Image has been successfully generated, we redirect the application to a page
which displays the generated image.
4) Error Route:
29
In this paper, a brief image generation review is presented. The existing images generation
approaches have been categorized based on the data used as input for generating new
images including images, hand sketch, layout and text. In addition, we presented the
existing works of conditioned image generation which is a type of image generation while a
reference is exploited to generate the final image. An effective image generation method is
related to the dataset used which must be a large-scale one. For that, we summarize popular
benchmark datasets used for image generation techniques. The evaluation metrics for
evaluating various methods is presented. Based on these metrics as well as dataset used for
training, a tabulated comparison is performed. Then, a summarization of the current image
generation challenges is presented.
31
[2] A. Brock, T. Lim, J. M. Ritchie, and N. Weston. Neural photo editing with introspective
adversarial networks. In ICLR, 2017.
[3] T. Che, Y. Li, A. P. Jacob, Y. Bagnio, and W. Li. Mode regularized generative
adversarial networks. In ICLR, 2017.
[5] E. L. Denton, S. Chintala, A. Szlam, and R. Fergus. Deep generative image models
using a laplacian pyramid of adversarial networks. In NIPS, 2015.
[9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In
CVPR, 2016. [10] X. Huang, Y. Li, O. Poursaeed, J. Hopcroft, and S. Belongie. Stacked
generative adversarial networks. In CVPR, 2017
32