Image Captionbot For Assistive Technology

Because an image can have a variety of meanings in different languages, it's difficult to generate short descriptions of those meanings automatically

Uploaded by

International Journal of Innovative Science and Research Technology

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

345 views3 pages

Image Captionbot For Assistive Technology

Because an image can have a variety of meanings in different languages, it's difficult to generate short descriptions of those meanings automatically

Uploaded by

International Journal of Innovative Science and Research Technology

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Volume 7, Issue 2, February – 2022 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165

Image Captionbot for Assistive Technology

Arnold Abraham, Aby Alias, Vishnumaya
Department of Computer Science and Engineering
Ilahia College of Engineering and Technology
Muvattupuzha, Kerala, India

Abstract:- Because an image can have a variety of A new image captioning model known as "domain
meanings in different languages, it's difficult to generate specific image caption generator" replaces the general caption's
short descriptions of those meanings automatically. It's specific words with those that are specific to the domain. This
difficult to extract context from images and use it to model is referred to as a "domain-specific image caption
construct sentences because they contain so many different generator" (DSIG). The image caption generator was put to the
types of information. It allows blind people to test in terms of both quality and quantity. This model does not
independently explore their surroundings. Deep learning, a allow for the implementation of a semantic ontology from
new programming trend, can be used to create this type of beginning to end.
system. This project will use VGG16, a top-notch CNN
architecture for image classification and feature extraction. For example, in [2], Kurt Shuster and his colleagues
In the text description process, LSTM and an embedding proposed a model that understands an image's content and
layer will be used. These two networks will be combined to provides humans with engaging captions. Using the most
form an image caption generation network. After that, recent advances in image and sentence encoding, create and
we'll train our model with data from the flickr8k dataset. retrieve models that perform well on standard captioning tasks.
The model's output is converted to audio for the benefit of Here, a brand-new retrieval architecture called TransResNet is
those who are visually impaired. developed, as well as a new state-of-the-art for creating
captions for COCO videos. Modifiable personality traits can be
Keywords:- Deep Learning; Recurrent neural network; used to enhance the models' human appeal. These models can
Convolutional neural network; VGG16; LSTM. be trained with a large amount of data by collecting a large
amount. In terms of relevance and involvement, the system
I. INTRODUCTION performs similarly to a human. There are ongoing efforts to
improve generative models that have previously failed.
Many people with disabilities still find it difficult to fully
participate in society, but they are still a valuable and Soheyla Amirian and other researchers coined the term to
important part of our society. As a result, they have been describe the functions of automatic image annotation, tagging,
hampered in their social and economic advancement, and they and indexing, which are all detailed in. It is known as image
have little or no desire to contribute to our economic captioning when metadata is automatically generated in the
prosperity. Our goal is to assist in bridging this ever-widening form of captions (i.e. producing sentences that express the
gap between the two groups. These technological content of the image). There are many ways to search for
advancements will assist us in achieving this goal. images using image captions. These include using them in
databases, online and on personal devices. For image
A person without visual impairments can deduce the captioning, Deep Learning has had some success in recent
scene description and content of an image, but the blind in our years. Accuracy, diversity and emotional impact of the
society do not have this ability. This ability to provide visual captions are all issues that need to be addressed. Generating
content descriptions in the form of naturally spoken sentences new and combinatorial samples is possible with the proposed
could be extremely beneficial to the visually impaired. If you generative adversarial models. Our goal is to improve image
want to imagine a world where no one is limited by their visual captions by experimenting with various autoencoders. Using
abilities, you can have access to the visual medium without unsupervised neural networks, autoencoders are able to learn to
having to see the objects themselves. Their goal is to use an encode data on their own. Visit the study's website if you're
automated method of capturing visual content and producing interested in finding out more.
natural language sentences to empower the visually impaired.
[4] proposed deep learning for the generation of image
This ability was one of the most difficult for a computer captions using neural networks. N. Komal Kumar and D.
to achieve on its own before recent advances in the field of Vigneswari conducted their research using a Flickr 8k dataset.
computer vision. Image descriptions are therefore more A. Mohan K Laxman and J. Yuvaraj used the method here.
difficult than object recognition and classification because they More accurate image captions were generated using the
must capture more than just the objects themselves. To provide proposed deep learning method than using any of the currently
a visual representation and understanding, the visual and available image caption generators. Image caption generators
linguistic models must be understood. could benefit from a hybrid model.
II. RELATED WORKS Using knowledge graphs, Yimin Zhou, Yiwei Sun, and
For the past few years, researchers have been focusing on Vasant Honavar have proposed CNet-NIC, a new approach to
the issue of translating visual content into descriptions in image captioning. The performance of image captioning
natural language forms. They are vulnerable to attack and have systems on several benchmark data sets, such as MS COCO,
a limited set of capabilities because of certain constraints. was compared using CIDEr-D, a performance measure

IJISRT22FEB655 www.ijisrt.com 606

Volume 7, Issue 2, February – 2022 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
designed specifically for evaluating image captioning systems. Runtime adjustments can be made to the caption generation
According to research, methods for captioning images that use process to produce captions that are visually appealing and
only images outperform those that use graphs. written in the style of the user's choice. To accomplish this, a
combination of stylized monolingual text and factual
According to Feng Chen and his colleagues, an attribute- image/video captions is used (e.g., romantic and humorous
based CNN-RNN framework that relied heavily on manually sentences). FlickrStyle10K is a new dataset that includes 10K
selected attributes improved performance. In this neural image Flickr images with the same humorous and romantic captions,
captioning model, topic models are integrated into the CNN- and StyleNet outperforms existing approaches for generating
RNN framework. In each image, a number of topics are visual captions with completely different styles.
subdivided, each with a unique probability distribution. Try
playing around with the Microsoft COCO dataset in this case. Adaptive attention encoder-decoder framework with
Our model outperforms the competition and has the potential fallback option for decoder and a new LSTM extension called
to be extremely beneficial, according to the results. "visual sentinel" is presented in [11] by Caiming Xiong, Devi
Researchers found that images are capable of delivering Parikh, and Richard Socher. When it comes to image
complex semantic information through the use of topic captioning, this model outperforms the competition. Take a
features. self-assessment to learn more about how to use adaptive
attention effectively. The model can be used in a variety of
Seung-Ho Han and Ho-Jin Choi’s Explanatory Image other contexts, including image captioning.
Caption Generator [7] explains why certain words are used as
captions for images. Consequently, an explanation module has III. PROPOSED SYSTEM
been developed and the image–sentence relevance has been
reduced, which has an effect on the training of generation
modules. When using the explanation module, an image's
caption words and regions are used to create a weighted
matrix. Using a weight matrix, you can see how a document's
geographical regions and individual words are interconnected.
When it comes to creating descriptive captions and providing
context for the results, this model is better than the rest. As
time goes on, the model may be able to fix some of its flaws.

Rakshith Shetty, Marcus Rohrbach, Lisa Anne Hendricks,

Mario Fritz, and Bernt Schiele [8] developed an adversarial
model for generating diverse captions for images. The
generator is trained using an adversarial learning framework Fig. 1: System Architecture
and a discriminator network designed to promote diversity.
A. OVERVIEW
The adversarial model generates captions that are more diverse
For image classification and extracting features from
and human-like when compared to a simpler model. The
images, VGG16, which is used in the proposed method, is one
adversarial model can generate more unique captions due to its
of the best CNN architectures Text descriptions will be
extensive use of vocabulary. Enhancing caption variety while
encoded using an embedding layer and an LSTM. To create an
maintaining accuracy can be accomplished through the use of a
image captioning network, these two networks will be
human evaluation process.
combined. Next, we'll use data from the flickr8k dataset to
When it comes to creating image descriptions, [9] came train our model. Captions for the visually impaired will be
up with a new idea. Bo Dai, Sanja Fidler, Raquel Urtasun, and generated using the trained model, and the generated captions
Dahua Lin each delivered a keynote address. Semantic will be converted to audio.
relevance and naturalness as well as diversity are used to
B. MODULES
improve overall quality rather than relying solely on word-for-
The main modules in the proposed system are:
word matching. In the past, some of these characteristics have
a) IMAGE FEATURE EXTRACTION
been overlooked. Conditional GAN, Policy Gradient, and early
The best CNN architecture for image classification, the
feedbacks were used to overcome the difficulties of this
VGG16 model, is used to extract image features. We
formulation's end-to-end training. The proposed methodology,
begin by extracting all of the image's features using this
when compared to Flickr30,000's progressive MLE-based
pre-trained model, VGG16. It is possible to save the
model, produced more natural, diverse, and semantically
feature vector created by VGG16. Create an image ID
relevant descriptions. Surveys and data retrieval applications
to feature mapping.
support this claim. As a bonus, they have a more human-like
evaluator to choose from. b) TEXT PROCESSING
To begin, lowercase the text and remove all punctuation
StyleNet, an innovative framework for creating captions
and word numbers. Create and save a text vocabulary at
for images and videos of different styles, was developed by
this point. A mapping between images and descriptions
Chinese researchers. Automated style factor extraction from
can be created if a single image has multiple
monolingual text corpora should be the goal of developing a
descriptions.
new component of the model known as factored LSTM.

IJISRT22FEB655 www.ijisrt.com 607

Volume 7, Issue 2, February – 2022 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
c) TOKENIZATION [6.] Feng Chen, Songxian Xie, Xinyi Li, Shasha Li, Jintao
Beginning and ending symbols should be included in Tang, Ting Wang, “What topics do Images say: A Neural
the writing. Tokenizing the text is the final step after Image captioning model with Topic Representation”
adding tokens, and this is where the tokenizer is kept. IEEE 2019
Tokenize the image's numerical description A sequence [7.] Seung-Ho Han and Ho-Jin Choi, “Explainable Image
of images and words is then used to match the image. Caption Generator Using Attention and Bayesian
Inference” IEEE 2018
d) ARCHITECTURE CREATION
Two dense layers represent the image features used for [8.] Rakshith Shetty, Marcus Rohrbach, Lisa Anne Hendricks,
text descriptions. LSTM and embedding are the two Mario Fritz, Bernt Schiele, “Speaking the Same
methods used. These two networks will be combined to Language: Matching Machine to Human Captions by
create a network for automatically creating captions for Adversarial Training” IEEE 2017
image files. [9.] Bo Dai,Sanja Fidler, Raquel Urtasun, Dahua
Lin,“Towards Diverse and Natural Image descriptions
e) TRAIN THE MODEL via a Conditional GAN” IEEE 2017
We train the model in google colab and save the trained [10.] Chuang Gan, Zhe Gan, Xiaodong He, Jianfeng Gao, Li
model.
Deng, “StyleNet: Generating attractive Visual Captions
f) IMAGE CAPTION GENERATION with Styles” IEEE 2017.
Creating a human-readable description of a photograph [11.] Jiasen Lu, Caiming Xiong, Devi Parikh, Richard Socher,
is a difficult artificial intelligence problem. A model “Knowing When to Look: Adaptive Attention via A
from the field of natural language processing is also Visual Sentinel for Image Captioning” IEEE 2017.
required for image comprehension. An image can be
used to extract features for the pretrained model
VGG16. After loading the image, use the saved model
and tokenizer to create a caption generation function.
Finally, convert the caption into an audio file.

IV. CONCLUSION

As a result, a system was created to assist the blind and

visually impaired in achieving their goals and contributing
more to society as a whole. To create a mapping between
images and sentences, a VGG-16 network is trained first,
followed by an LSTM network. The quantitative validation of
our model yielded promising results.

The model's precision and efficiency will be improved in

the future. A server-client model for the blind to use in any
environment can also be added. As the size of the dataset
grows larger, overfitting becomes less of a problem.
Furthermore, we believe that our research could pave the way
for a more general form of AI.

REFERENCES

[1.] Seung-Ho Han and Ho-Jin Choi,” Domain-Specific Image

Caption Generator with Semantic Ontology” IEEE 2020
[2.] Kurt Shuster, Samuel Humeau, Hexiang Hu, Antoine
Bordes, Jason Weston, “Engaging Image Captioning via
Personality” IEEE 2019
[3.] Soheyla Amirian, Khaled Rasheed, Thiab R. Taha, Hamid
R. Arabnia, “Image Captioning with Generative
Adversarial Network” IEEE 2019
[4.] N. Komal Kumar, D. Vigneswari, A. Mohan, K. Laxman,
J. Yuvaraj, “Detection and Recognition of Objects in
Image Caption Generator System: A Deep Learning
Approach” IEEE 2019
[5.] Yimin Zhou, Yiwei Sun, Vasant Honavar, ” Improving
Image Captioning by Leveraging Knowledge Graphs”
IEEE 2019