Seminar Report GRP No. 56
Seminar Report GRP No. 56
SEMINAR REPORT
Academic year
2022-23
Acknowledgement
We would like to thank Vivek H V who gave us a golden opportunity to work on this project for
NAYA Studio. We would also like to thank him for his constant motivation, valuable counsel
and advice in every possible way in spite of his busy schedule throughout our project activity.
We would like to express our sincere gratitude towards our project guide Dr .Mrudul Dixit for her
constant support and valuable guidance during the completion of this B.Tech Project.
We would also like to thank Dr.Prachi Mukherji(H.O.D., E&TC) for her constant encouragement,
valuable guidance, suggestions and her precious time in every possible way in spite of her busy
schedule throughout our project activity.
We take this opportunity to express our sincere thanks to all the teaching as well as non-teaching
staff of the E&TC department for their constant help whenever required. Finally, we express our
sincere thanks to all those who helped us directly or indirectly in many ways towards our
B.Tech project work.
Anushka Chikhale(C22019111125)
Prajakta Deshpande (C22019111133)
Pragati Dound(C22019111138)
Savi Gandewar(C22019111141)
Abstract
The use of technology in the field of design has revolutionized designing products to some
extent. Today a designer can visualize and design spaces without being physically present in the
said space. This depicts how amazing innovations and constantly evolving technologies have
simplified the design process to great lengths.
Text-to-image models use deep neural networks to translate a natural language description into
an image. In text-to-image models, a language model transforms the input text into a latent
representation, and a generative model produces an image based on that representation. The
most effective models are trained using large amounts of web-scraped image and text data.
Text-to-image models can be used by a designer to realize ideas. Such models will enable the
designers to visualize their ideas in real-time.
Although the use of technology for simplifying the design process is fairly new, it is surely
going to become mainstream in the future.
There is a need to develop an interface which will act as a bridge between the users and the
pre-trained model. The interface will enable the users to generate images from text prompts,
store the images in their workspace, and retrieve images when need be.
TABLE OF CONTENTS
Sr. no Chapter Page no.
1. INTRODUCTION 9
2. LITERATURE SURVEY 10
2.1 Product Table
3. SPECIFICATIONS 12
4. METHODOLOGY : 13
4.1. DALL-E Mini
4.2. Vector Quantized Adversarial Network
4.3. Contrastive Language- Image Pre-Training
5. DETAIL DESIGN 16
5.1. User Flow
5.2. Database
5.2.1. Schema for Users Database
5.2.2. Schema for Images Database
5.3 API Documentation
6. RESULTS 24
7. EVALUATION 29
8. CONCLUSION 29
9. FUTURE SCOPE 29
10. REFERENCES 30
List Of Figures
1. Introduction
AI is a promising exploratory area that can greatly improve the user experience for designers
and gather relevant data during the development process of specific applications. The result is
increasing gratitude for a technology that simplifies complex systems and drives product
innovation.
Naya Studios is a platform that aims to create more inclusive and sustainable products through
their adaptive platform that embodies co:creation and trust while providing an incredible user
experience.
Building a tool that will bring to reality, the idea of a product, from the designer's mind by using
just words or sentences is a fascinating concept. By using AI/ML image generation models the
exact same thing can be done.
Dall-E is a deep learning image generative model that generates images from natural language
descriptions called captions. It is a generative pre-trained transformer which can generate
images in multiple styles including photorealistic imagery, paintings and emoji. The model is
trained by looking at millions of images from the web with relevant captions.Some concepts are
learned from memory but images that cannot exist can also be created.
Several models have been combined to achieve these results : An image coder that converts a
raw image into a sequence of numbers using an associated decoder, a model that converts text
prompts to encoded images and a model to judge the quality of the generated image for better
filtering.
It is necessary to provide an interface that serves as a link between users and the pre-trained
model. Users will be able to create images from text prompts, store the images in their
workspaces, and retrieve them as needed thanks to the UI.
2. Literature Survey
3. Specifications
HARDWARE SPECIFICATIONS :
Memory: 8 GB RAM
SOFTWARE SPECIFICATIONS :
Operating System: Windows 11 Home 64
Language : Python
After initial literature survey and research the following models have been found to be most
suitable for this project :-
We can train the transformer on predicting the next value in this sequence. This way, the
transformer learns how far distributed pieces of the image are related to each other.
Fig 2 : Methodology
After the literature survey and finalization of models, run the models and compare their
outputs for the same input.
Compare results of the said models and select the best model for developing the software
feature.
Make a user-flow diagram describing the flow of data as the user navigates the website.
Design the website wireframe and prototype on figma
Develop a front-end for the website and establish connection between the model and front-end
Accept text input from user in order to generate images
Fetch images generated by model and display to user
Enable user to select appropriate image
Save image selected by user to database.
Display image on users workspace.
5. Detail Design
5.2 Database:
A database is an organized collection of structured information or data, usually stored electronically
in a computer system. This project requires two databases, one to store the basic details of users
of the platform and another to store images generated by the model. The user database will store
information about the user’s email, full name, profession and password. After successful login,
every user will be assigned a unique user id for easy identification.
The second database will contain information about the user’s previously saved images and any
new images that the user wants to save along with the image caption. It will also consist of a
unique image id for easy identification of each image.
6. Results
On giving the following text Input : “ Mushroom shaped chair ” , the images generated by the
different models are as follows :
Outputs :
We can see in the above figure (Fig. 4 ) that the image is blurry and that it doesn't even properly
outline the object.
Model : GLIDE
As the figure (Fig 7.) shows, this model provides the client with a wide variety of choices while
producing accurate results.
FIGMA PROTOTYPE
An outline of what the end project will look like. ( A wireframe )
This is the login page where users sign in, create accounts, and access their workspaces.
This is the user's workspace, it is visible to them and where they can enter the prompt for the
images.
For example : The user enters the prompt “ Mushroom shaped chair ” and clicks on create.
The pre-trained model generates all the possible images and displays them to the user on the page.
The users only needs to click on an image to save it, if they want to save any of the generated
images.
Once saved, the image appears in the user's workspace, which may also be accessed by clicking
"My Workspace" in the top-left corner of the page.
7. Evaluation
● The automated performance metric used internally is the CLIP score.
● CLIP scores have limitations, it is ineffective at counting. Due to such limitations,
human evaluations are used to assess image quality and caption similarity.
● From visual inspection, DALL-E mini generates the best image according to the
caption.
● Fast image generation
● High quality scene Images (Resolution : 256 x 256 pixels )
● Speed : 40 sec per Image.
8. Conclusion
● The model DALL-E mini is most suitable for building the feature as it produces more
realistic images.
● This model has lower computational time.
● DALL-E mini also gives a more accurate image according to the caption provided.
9. Future scope
● Creating a user interface (UI) for the chosen pre-trained model to make it easier to use
and more effective.
● Creating a backend that uses a database and API to store the user's images and keep
them accessible even after the user logs out of the session.
● Adding a functionality that modifies the provided picture based on the user's request.
10. References
[1] Chitwan Saharia , William Chan et.al. , “Photorealistic Text-to-Image Diffusion Models with
[2] Akanksha Singh, Ritika Shenoy, Sonam Anekar,Prof. Sainath Patil , “Text to Image using Deep
[3] Alec Radford et. al. , “Learning Transferable Visual Models From Natural Language
Supervision”, 2021
[4] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen, “Hierarchical
[5] Bahjat Kawar,Huiwen Chang,” Text-Based Real Image Editing with Diffusion Models”, 2022
[6]https://colab.research.google.com/drive/1wwyTCWYNqTZbV0KFhqbaIRLesmHEZtB4?usp=sh
aring
[7]https://colab.research.google.com/drive/1_hPc8DDGIwLPGLiM7LgC_0AUkQ1MmEWb?authu
ser=1