0% found this document useful (0 votes)
13 views

A Novel Approach of Image Caption Generator Using Deep Learning

The document discusses a novel approach to image caption generation using deep learning techniques, specifically Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). It outlines a methodology consisting of data preparation, feature extraction, and caption generation, tested on the Flicker8k dataset. The proposed system aims to produce informative and contextually relevant captions for images, leveraging advancements in machine learning and attention mechanisms.

Uploaded by

saiprathaptedla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

A Novel Approach of Image Caption Generator Using Deep Learning

The document discusses a novel approach to image caption generation using deep learning techniques, specifically Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). It outlines a methodology consisting of data preparation, feature extraction, and caption generation, tested on the Flicker8k dataset. The proposed system aims to produce informative and contextually relevant captions for images, leveraging advancements in machine learning and attention mechanisms.

Uploaded by

saiprathaptedla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2023 Third International Conference on Ubiquitous Computing and Intelligent Information Systems (ICUIS)

A Novel Approach of Image Caption Generator


2023 Third International Conference on Ubiquitous Computing and Intelligent Information Systems (ICUIS) | 979-8-3503-0698-9/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICUIS60567.2023.00012

using Deep Learning


Sunil Kumar Kapil Joshi
DilipKumar Jang Bahadur Saini Department of Computer Science and Department of CSE, Uttaranchal Institute
Department of Computer Science and Engineering, Meerut Institute of of Technology, Uttaranchal University,
Engineering, Pimpri Chinchwad Engineering and Technology, Meerut Dehradun, India.
University, Pune 412106,,India (U.P.) India. [email protected]
[email protected] [email protected]

Abhishek Kumar Pathak Saksham Jain Anupam Singh


Assistant Professor, UPES, Dehradun, Department of Information Technology Associate professor
India Meerut, Institute of Engineering and Department of computer science and
[email protected] Technology, Meerut (U.P.) India. engineering
[email protected] Graphic era hill University, dehradun
[email protected]

Abstract—Image caption generation is an emerging field of amount of progress. Although the modal had many benefits
study for researchers that mainly focuses on developing but there are also some limitations as well. As the research
systems that can generate captions of an image. In today’s in the field continues to advance, image captioning has more
World, Image captioning is a very useful tool. Moreover, some
potential to make visual context more understandable. The
systems use machine learning models such as deep learning
models which include CNNs and RNNs models used to analyze purpose of this article is to produce captions that are
images and generate captions. There are some late growths in informative, expressive, and easily understandable by
caption generation that have focused on transfer learning, humans.
reinforcement learning, and multimodal approaches. The
proposed system has 5 phases which are data cleaning, II. RELATED WORK
extraction, layering, training, and testing. The proposed model In order to better understand how we will develop a new
is tested on the recent Flicker8k_Dataset for image caption CNN structure in our study, we will examine some prior
generation and it is implemented using Python software. research that emphasizes the importance of image caption
generation. CNN's organizational structure has an impact on
Keywords—Image, Caption, Xception, Recurrent neural
network (RNN), Long short-term memory (LSTM), the performance of recognition or prediction [1].
Convolutional neural networks (CNN), Deep learning, Computer Image caption generation (ICG) is a challenging task that
vision (CV). involves creating a textual description of an image. ICG has
applications in various domains, including computer vision,
I. INTRODUCTION natural language processing, and assistive technologies for
An image caption generator is a type of natural language visually impaired individuals [2]. In this literature review,
processing system that generates textual narrations for we will discuss the most prominent approaches and their
images. It is a process of understanding an image's context performance.
and explaining it with appropriate captions using deep 1. Encoder-Decoder-based methods: These methods
learning techniques. It was considered an unfeasible task by consist of two main components, an encoder and a
CV researchers so far. Image caption generation modal is decoder. The encoder takes an image as input and
commonly based on deep learning techniques, such as extracts high-level features using a CNN. The decoder
Convolutional Neural Networks (CNNs) and Recurrent generates a textual description of the image using a
Neural Networks (RNNs) and they are trained on large RNN. The most popular encoder-decoder-based
image datasets and on their analogous captions. CNNs are methods are Show and Tell, Show, Attend and Tell, and
used to understand and extract the features of an image and Up-Down.
RNNs are used to generate captions of that image. To test Show and Tell: Show and Tell was introduced by
our modal, we measure its performance by using the Flickr Vinyals et al. in 2015. It was the first model to use an
8k dataset which contains approx. 8000 images with each end-to-end architecture for ICG. The CNN extracts
image having five captions respectively. It has many image features, which are then fed into the LSTM to
applications like such as in healthcare purposes, which helps generate captions for an image.
for improving visual content for visually impaired patients Show, Attend, and Tell: It was proposed by Xu et al. in
in understanding images. A short time ago, computer vision 2016. It is an extension of Show and Tell, which
in the image processing area have shown a significant incorporates an attention mechanism. The attention

979-8-3503-0698-9/23/$31.00 ©2023 IEEE 24


DOI 10.1109/ICUIS60567.2023.00012
d licensed use limited to: Vignan's Foundation for Science Technology & Research (Deemed to be University). Downloaded on September 23,2024 at 14:22:42 UTC from IEEE Xplore. Restrictio
mechanism allows the decoder to focus on specific corresponding captions. Popular datasets used in the
image regions when generating each word. field include MSCOCO, Flickr8k, and Flickr30k. The
Up-Down: Up-Down was introduced by Anderson et al. dataset should be pre-processed by resizing the images
in 2018. It uses a two-stage attention mechanism to to a consistent size and tokenizing the captions into
generate image captions. In the first stage, the model individual words.
generates a set of attention maps, which indicate the 2. Pre-trained Image Encoder: Utilize a pre-trained
salient regions of the image. In the second stage, the
convolutional neural network (CNN) as an image
model generates the caption by attending to the
encoder to extract high-level features from the input
attention maps and the image features.
2. Transformer-based methods: Transformer-based images. Common choices for the CNN architecture
methods have gained popularity in recent years due to include VGG16, ResNet, and Inception. The CNN is
their superior performance in various natural language typically pre-trained on a large-scale image
processing tasks. The most popular transformer-based classification task, such as ImageNet, to capture general
methods for ICG are ViLBERT and LXMERT. image representations.
ViLBERT: ViLBERT was proposed by Lu et al. in 3. Text Pre-processing: Perform text pre-processing on the
2019. It is a multi-modal transformer-based model that caption data. This involves tokenizing the captions into
can jointly process visual and textual inputs. ViLBERT individual words, removing punctuation, converting
uses two separate transformers for visual and textual words to lowercase, and creating a vocabulary mapping
inputs, which are then fused together to generate the that assigns a unique index to each word in the dataset.
image caption. 4. Sequence Generation Model: Use a sequence
LXMERT: LXMERT was introduced by Tan and
generation model to generate captions given the
Bansal in 2019. It is a large-scale transformer-based
encoded image features. One popular choice is the
model that can process multiple modalities, including
text, image, and knowledge graph. LXMERT uses a recurrent neural network (RNN) with long short-term
cross-modal transformer to encode visual and textual memory (LSTM) or gated recurrent unit (GRU) units.
inputs and a graph attention mechanism to incorporate 5. Training: Train the image caption generator by
external knowledge. optimizing it to minimize the discrepancy between the
3. Hybrid models: Hybrid models combine the strengths generated captions and the ground truth captions from
of encoder-decoder and transformer-based methods. the dataset. The training involves feeding the encoded
These models use CNN as an encoder and a image features into the RNN, generating a sequence of
transformer-based architecture as a decoder. The most words, and comparing it to the ground truth caption
popular hybrid models are Oscar and UNITER. using a loss function such as cross-entropy loss.
Oscar: Oscar was introduced by Li et al. in 2020. It uses 6. Attention Mechanism: Incorporate an attention
a hybrid architecture that combines the power of CNN mechanism into the image caption generator to focus on
and transformer-based models. The model uses a CNN
different regions of the image while generating each
as an encoder and a transformer-based decoder that
incorporates both positional and visual embedding. word in the caption. The attention mechanism helps the
UNITER: UNITER was proposed by Chen et al. in model align relevant image regions with the
2020. It is a transformer-based model that can process corresponding words in the caption, resulting in more
both text and image inputs. UNITER uses a cross- accurate and contextually relevant descriptions.
modal transformer to encode the inputs and a region-to- 7. Beam Search: During caption generation, utilize beam
token attention mechanism to generate the caption. search instead of a greedy approach to improve the
In conclusion, the literature review highlights the quality of captions. It maintains a set of multiple
evolution of image caption generation techniques from candidate captions and selects the most likely ones
early approaches to the incorporation of attention based on a scoring criterion, considering both the
mechanisms, transfer learning, reinforcement learning, generated words and their corresponding attention
and evaluation metrics. The encoder-decoder-based weights.
methods were the initial approaches, followed by the
8. Evaluation: Evaluate the performance of the image
emergence of transformer-based models [3]. Hybrid
caption generator using suitable metrics such as BLEU,
models have also gained attention due to their ability to
leverage the strengths of both encoder-decoder and METEOR, CIDEr, etc.
transformer-based architectures. These advancements 9. Fine-tuning and Transfer Learning: Fine-tune the pre-
have significantly improved the performance of image trained image encoder and sequence generation model
caption generators, leading to more accurate and on the specific image captioning task to improve
contextually relevant captions [4]. performance. This can involve freezing certain layers
and updating others to adapt to the specific dataset and
III. PROPOSED METHODOLOGY task requirements. Transfer learning techniques can also
The following steps are used for the image caption be applied by initializing the model with weights pre-
generator: trained on a similar task or dataset [14, 15].
1. Dataset Preparation: The first step is to gather a suitable 10. Inference: In the inference phase, given a new image,
dataset for training the image caption generator. This extract its features using the pre-trained image encoder.
dataset should consist of paired images and their

25

d licensed use limited to: Vignan's Foundation for Science Technology & Research (Deemed to be University). Downloaded on September 23,2024 at 14:22:42 UTC from IEEE Xplore. Restrictio
Feed the encoded features into the trained sequence
generation model and use beam search to generate
captions for a particular image. Post-process the
generated caption by converting the word indices back
into their corresponding words.
By following this proposed methodology, you can
develop an image caption generator that effectively
generates descriptive and contextually relevant captions
for input images. Remember to experiment with
different architectures, hyper parameters, and training
strategies to optimize the performance of the model.
11. Defining the CNN-LSTM model, in order to make our
modal we merge these architectures. This is also known
as the CNN-RNN modal.
x CNN (Convolutional Neural Network) is used in an
image caption generator to extract visual features
from images, enabling the model to understand the Fig. 2: Caption Generator deep learning model
content and context of the image and generate
relevant and descriptive captions. Aim of the Proposed System
x LSTM (Long Short-Term Memory) is used in an The aim of the proposed system of an image caption
generator is to automatically generate identifying and
image caption generator to generate coherent and
accurate captions of an image [5]. The system would use a
contextually aware captions by modeling the
machine learning technique to understand the relationship
sequential dependencies between words in the between the images and the generated captions.
generated text [13].
Convolutional Neural Network
Convolutional Neural Networks plays an important role in
image caption generator. CNNs are used as an encoder in
encoder–decoder modal for generating image captions.
CNNs are responsible for extracting features from an image
and the output of the CNN is passed to the decoder to
generate captions using RNN [6]. This process is called
convolutional in which the network applies a set of filters to
the image, which helps to identify specific features such as
edges and corners etc. This allows the CNN to learn about
the different objects that are present in the input image,
allowing them to differentiate one image from the other
Fig 1: Image caption generator modal image [12]. The output of the convolutional is passed to the
pooling layer which helps to maintain its characteristics.
Now, the model has been trained. For defining the structure The output of the above layer is passed through the fully
of our model, here are the steps involved in an image connected layers which extract features from an image.
caption generator based on deep learning models: Moreover, a distinguishing feature of Convolutional Neural
x Pre-processing and Feature Extraction: The CNN Network (CNN) that make it different from other Machine
extracts high-level visual features from the image, Learning algorithms is its capability to pre-process the data
encoding its content into a fixed-length feature by itself [7]. Thus, you do not have to worry about
vector. accommodating a lot of resources for data pre-processing.
x Caption Generation: The feature vector from the CNN The figure below shows the working of a Deep CNN:
is fed as input to an LSTM-based language model.
x Evaluation and Refinement: The generated captions
are evaluated using metrics like BLEU or CIDEr to
assess their quality and similarity to reference
captions.

26

d licensed use limited to: Vignan's Foundation for Science Technology & Research (Deemed to be University). Downloaded on September 23,2024 at 14:22:42 UTC from IEEE Xplore. Restrictio
Fig. 5: Input gate
Fig. 3: Working of a deep convolutional neural network

Long short-term memory x Output gate-


LSTM is a kind of RNN that has the ability to learn with the The output gate in an image caption generator regulates
aid of long-term dependencies. LSTMs perform the amount of generated caption information to be
exceptionally well on a wide range of sequence modeling outputted by selectively allowing or restricting the flow
problems, and they are now frequently used. [LSTMs] are of hidden state information, ensuring the production of
designed in a manner for the avoidance of problems which relevant and coherent captions based on the image input
is being occurred because of long-term dependency [7]. It [11].
possesses the property of remembering the information over
a long period of time in its behavior. Long short-term
memory (LSTM) is generally used in image caption
generators. LSTMs are well suited for tasks that involve
sequential data, such as generating captions for an image. In
image caption generators, LSTMs are used as a decoder to
generate the relevant captions. Throughout the processing of
inputs, LSTM is used to carry out the relevant information
and to get rid of non-relevant information. The memory cell
serves as the memory of the LSTM [8, 13].
x Forget gate – Fig. 6: Output gate
The forget gate is a component that is used to selectively The final LSTM cell structure looks like –
forget information from previous hidden states. The
forget gate takes as the concatenation of the previous
hidden state (h_{t-1}) and the current input (x_t).
The computation of forget gate can be shown below:
f_t = σ(W_f*[h_{t-1}, x_t] + b_f)

Fig. 4: Forget gate


Fig. 7: LSTM cell structure
x Input gate-
The input gate in an image caption generator
The ImageNet dataset
controls the amount of image information used in In this deep learning project, we have made the use of
generating captions by selectively allowing or ImageNet dataset. This dataset is a benchmark for different
restricting the flow of image features to subsequent pictures and it also includes a lot of real-world images [19].
model stages, ensuring relevant image- The images for this project have been taken from Flickr_8K
contextualized captions [9]. dataset. It has a total of 8000 images and a memory size of
about 1GB. ImageNet has been used as a standard dataset

27

d licensed use limited to: Vignan's Foundation for Science Technology & Research (Deemed to be University). Downloaded on September 23,2024 at 14:22:42 UTC from IEEE Xplore. Restrictio
[20] for a wide range of computer vision tasks such as 2. Input image:
image classification, object detection etc [10].

Output:
Two men on the phone walking down a busy street.

3. Input image:

Fig. 8: Examples of images present in the dataset

IV. RESULTS AND DISCUSSION

The results for the respective inputs have been shown


below. In recent years, there have been several advances in
the field of image captioning, with the use of deep learning
models such as CNNs and LSTMs. Many states of the art
modal used as encoder–decoder architecture, where CNNs
are used as an encoder and LSTMs are used as decoder. We
have increased the amount of dataset for training our model
to improve the accuracy and performance of the model. Output:
Some of the outputs are given below: Two girls are playing in the grass

1. Input image: V. CONCLUSION AND FUTURE SCOPE

In conclusion, image caption generators have made


significant advancements in generating descriptions that
accurately capture the content of images. Image caption
method has made significant progress in recent years. They
enable the automatic generation of descriptive captions for
images, improving accessibility and understanding of visual
content.

However, challenges remain in improving caption quality,


fine-grained image understanding, multimodal approaches,
transfer learning, evaluation metrics, and ethical
considerations. The methodology includes grouped
approaches whereby the deep learning is the prime
component in usage designs in this model. Future research
Output:
can focus on enhancing caption quality by reducing errors
Man is standing on rock overlooking the mountains
and improving language fluency. Additionally, developing
models that understand fine-grained details in images would
result in more informative captions.

28

d licensed use limited to: Vignan's Foundation for Science Technology & Research (Deemed to be University). Downloaded on September 23,2024 at 14:22:42 UTC from IEEE Xplore. Restrictio
REFERENCES

[1] Image Caption Generator Based on Deep Neural Networks by


Jianhui Chen, Wenqiang Dong and Minchen Li, ACM (2014).
[2] Sreejith S P, Vijayakumar A (2021): Image Captioning Generator
using Deep Machine Learning.
[3] Ali Ashraf Mohamed (2020): Image caption using CNN and
LSTM.
[4] J. Redmon, S. Divvala, Girshick and A. Farhadi, "You only look
once: Unified real-time object detection", Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR),
2016.
[5] Aghasi Poghosyan,” Long Short-Term Memory with Read-only
Unit in Neural Image Caption Generator’. 2017 (IEEE).
[6] Sunil Kumar, Aanjey Mani Tripathi, Hanshika Bhatia, Gurneet
Kaur, Daksh Aggarwal, Divyansh Chauhan (2021). Design and
Implementation of e-learning Platform Using Data Analysis. In:
Mahapatra, R.P., Peddoju, S.K., Roy, S., Parwekar, P., Goel, L.
(eds) Proceedings of International Conference on Recent Trends in
Computing. Lecture Notes in Networks and Systems, vol. 341, pp.
81-89. Springer, Singapore. https://doi.org/10.1007/978-981-16-
7118-0_7.
[7] Akshat Singhal, Sunil Kumar (2021). Mobile Application on
Drowsiness Detection When Driving Car. In: Mishra, B., Tiwari,
M. (eds) VLSI, Microwave and Wireless Technologies. Lecture
Notes in Electrical Engineering, vol. 877, pp. 337-345.Springer,
Singapore. https://doi.org/10.1007/978-981-19-0312-0_34
[8] Saad Albawi, Tareq Abed Mohammed, and Saad Al-Zawi,
“Understanding of a convolutional neural network”, IEEE – 2017.
[9] A. Graves, A. Mohamed and G. E. Hinton. Speech recognition
with deep recurrent neural networks. pp. 6645–6649, 2013.
΀ϭϬ΁ Dilip Kumar Jang Bahadur Saini, Shailesh D. Kamble, Ravi
Shankar, M. Ranjith Kumar, Dhiraj Kapila, Durga Prasad Tripathi,
Arunava de,Fractal video compression for IOT-based smart cities
applications using motion vector estimation, Measurement:
Sensors, 2023,100698, ISSN 2665-9174,
https://doi.org/10.1016/j.measen.2023.100698
[11] Shailesh Kamble, Dilip Kumar J. Saini, Vinay Kumar, Arun
Kumar Gautam, Shikha Verma, Ashish Tiwari & Dinesh Goyal
(2022) Detection and tracking of moving cloud services from video
using saliency map model, Journal of Discrete Mathematical
Sciences and Cryptography, 25:4, 1083-1092, DOI:
10.1080/09720529.2022.2072436
[12] Chen, C., Zhang, X., You, Q., Fang, C., Wang, Z., Jin, H., & Luo,
J. (2020). Generative adversarial transformer for image captioning.
In Proceedings of the European Conference on Computer Vision
(ECCV) (pp. 706-726)
[13] Piyush Ram, Amarjeet Veer, Anubhav Sharma, Sunil Kumar,
Nighat Naaz Ansari (2022). Stock Price Prediction Using Machine
Learning. In: Mahapatra, R.P., Peddoju, S.K., Roy, S., Parwekar,
P. (eds) Proceedings of International Conference on Recent Trends
in Computing. Lecture Notes in Networks and Systems, vol. 600,
pp. 79-87. Springer, Singapore. https://doi.org/10.1007/978-981-
19-8825-7_8
[14] Vatsal Bhardwaj, Akash Rastogi, Ankit Chauhan, Ajay Kumar
Singh, Sunil Kumar, "Frost-The Real Assistant," 2022 Second
International Conference on Computer Science, Engineering and
Applications (ICCSEA), Gunupur, India, 2022, pp. 1-6, doi:
10.1109/ICCSEA54677.2022.9936248
[15] Huang, J., Chen, Q., Yuan, J., & Metaxas, D. N. (2021). Towards
detailed image captioning by learning visual and semantic
representations. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR) (pp.
2501-2511).

29

d licensed use limited to: Vignan's Foundation for Science Technology & Research (Deemed to be University). Downloaded on September 23,2024 at 14:22:42 UTC from IEEE Xplore. Restrictio

You might also like