0% found this document useful (0 votes)
53 views

Research Paper Final

research

Uploaded by

Shubham Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views

Research Paper Final

research

Uploaded by

Shubham Patil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 5

Review on Image Caption Generation

Aishwarya Mark1, Sakshi Adokar2, Vageshwari Pandit3 , Rutuja Hambarde4 , Prof. Swapnil Patil5
CSE Students, SKN College of Engg, Pune, India1,2,3,4
Guide, SKN College of Engg, Pune, India5

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE


Abstract— With the rapid development of Deep learning, second part focuses on Literature survey and third part
AI along with Computer Vision and Natural Language mainly introduces and analyzes the CNN and LSTM model
processing Image caption has become an interesting and of image captioning and its design ideas. finally the
complex task. Image caption generation is the process of proposed model for image caption generation and
generating textual description of the given image and it is a conclusion.
challenging task because it consists of apprehension of objects.
.
If the machine will be programmed to accurately describe an
image or environment like human vision, it will be highly
beneficial for robotic vision, business and many more. In order
to generate an effective description of the image, the machine
needs to detect, recognize objects as well as understand the
scene type or location, object properties, their relationships
and their interactions with each other. In this paper, we focus on
advanced image captioning techniques such as CNN
(Convolutional Neural Network)-LSTM(Long Short Term
Memory) to generate meaningful captions. and the advantages
and limitations of each method are discussed

Keywords— AI, Deep learning, CNN, LSTM

I. Introduction.

Every day we encounter images in many ways; e.g., the


Internet, news articles, document diagrams and
advertisements. Humans usually find it easy to interpret Input image
these images and give a textual description. However, if
machines need to give a textual description of an image, the Output : man is skateboarding on ramp
machines need to understand the semantic and the context of
the image. A long-standing goal in the field of Artificial
Intelligence is to enable machines to see and understand the
images of our surrounding. Most photo posts on social
networks like Facebook and Instagram hardly contain any II. CNN and LSTM
description or caption. Hence, lots of opinions and emotions This section mainly introduces the theoretical concepts of
are conveyed through visual content only. Today, social CNN and RNN for image caption generation.
networks have grown to be one of the most important
sources for people to acquire information on all aspects of A. CNN for extracting features :
their lives. Social media images provide a potentially rich Image captioning techniques are mainly categorized into
source for understanding public opinions/sentiments. Such two types, based on the template method and another is
an understanding of images may in turn benefit or even based on the encoder decoder structure. A Convolutional
enable many real-world applications such as advertisement, Neural Network or CNN is a deep learning neural network
product based recommendation, marketing and health-care. which is specifically designed for processing structured
arrays of data such as images. Convolutional neural network
In the past few years, computer vision in image processing is really good at identifying the key features and patterns in
field has made significant progress, like image classification the input image such as lines, circles, even eyes and faces.
[1] and object detection [2]. Due to this, it has become Another property of CNN is that it can directly work on raw
possible to automatically generate one or more sentences to image without preprocessing makes it so powerful. It has
understand the visual content of an image, which is the many applications such as Photo and Video recognition,
problem known as Image Captioning. Generation of Image classification, medical image analysis, Computer
complete natural image descriptions automatically has many vision, Natural language processing(NLP)etc.
potential impacts, such as titles attached to news images,
information associated with medical images, text-based The mathematical function of convolution is a special kind
image retrieval, information accessed for blind users, of linear operation in which two functions are multiplied to
human-robot interaction. These applications in image produce a third function that expresses how the shape of one
captioning have important theoretical and practical research function is modified by the other which is denoted by the
value. Therefore, image captioning is a more complex but word “Convolution” in Convolution Neural Network. A
also a meaningful task in the age of artificial intelligence. convolutional neural network is a feed-forward neural
Mimicking the human ability of providing descriptions for network, often with up to 20 or 30 layers. With three or four
images by a machine is itself an impressive step along the convolution layers it is possible to recognize handwritten
line of Artificial Intelligence. The major challenge of this digits and with 25 layers it is possible to distinguish human
task is to capture how objects relate to each other in the faces. The Basic layers of a CNN architecture are
image and to express them in a natural language. Basically, Convolution Layer, Pooling Layer & Fully Connected
this model takes image as input and gives caption for it. Layer. Addition to these three layers, there are two more
With the advancement of the technology the efficiency of important parameters which are the dropout layer and the
image caption generation is also increasing. The
activation function .
organizational structure of this paper is as follows. The
are recommended for a multi-class classification, generally
softmax is used.

B. LSTM for generating captions :

The main drawback with RNN was that vanishing/exploding


gradient effect could occur, if the sequence is very large or
if neural network has more than one hidden layer due to
back propagation. To overcome these issues Long Short
Term Memory (LSTM) was developed. LSTM is a type of
Fig 1: CNN Architecture RNN architecture that addresses the vanishing/exploding
gradients and allows learning of long term dependencies.
Convolutional Layer is the first layer that mainly extracts LSTM has risen to prominence with state-of-the-art
the various features & characteristics from the input images. performance in speech recognition, language modeling,
In this layer, convolution mathematical operation is carried translation, image captioning. LSTM can preserve
out between the input image and a filter of a specific size information for longer periods when compared to RNN. It
MxM. The dot product is taken between the filter and the mainly uses long-term memories (info collected long time
sections of the input image with respect to the size of the back) and short-term memories (info that is collected a few
filter by sliding the filter across the input image (MxM).The timestamps back) along with current event to generate a new
Feature map is the outcome, and it consist of information of modified long-term memory. It is done by trying to to
the image such as its corners and edges. This feature map is “remember” all the past knowledge that the network seen so
then given as input to further layers, which learn a range of far and by “forgetting” irrelevant data. In simple words at
other features from the input image. each time step, it will filter the memory which needs to be
passed to the next time step.
A Pooling Layer is usually applied after the Convolutional
Layer. The main goal of this layer is to reduce the size of LSTM Architecture :
the collapsed feature map to reduce computational costs.
This is achieved by reducing the connections between layers
and independently operating on each feature map. There are
different types of Pooling procedures, depending on the
mechanism used.
Fully connected (Fc) layers are mainly used to consist of
weight and bias with neurons, and are used to connect
neurons between two different layers. These layers are
usually placed in front of the start layer and form the last
layer of the CNN architecture. In this, the input image from
the previous layers are flattened and fed to the FC layer. The
flattened vector then passes through some additional FC
layers where the mathematical functions operations usually
occurs. The classification process begins at this stage .
Dropout, when all the features are connected to the FC
layer, it can cause over fitting in the training dataset. Over Fig 2: LSTM Architecture
fitting occurs when a specific model works so well on the
training data causing a negative impact in the model’s
performance when used on a new data. To conquer this
problem, a dropout layer is applied in which some neurons There are mainly two outputs from one LSTM unit that are
are dropped from the neural network in the course of "Ct" and "ht". The hidden state ht is the short-term memory
training process resulting in reduced size of the model. On that is obtained from the immediately previous steps and
providing a dropout of 0.3, 30% of the nodes are dropped vector Ct is the Cell state which is responsible for storing
out randomly from the neural network. the long-term memory events. LSTMs will make use of a
Finally, one of the most important parameters of the CNN mechanism called gates to add and remove certain
model is the activation function. They are mainly used to information into this cell state. LSTM Network mainly
learn and approximate any type of continuous and complex consists of four different gates for different purposes as
relationship between variables of the network. In simple described below:-
words, it determine which information of the model to 1.Forget Gate: It determines which information from the
deliver forward and which ones should not at the end of the previous data should be discarded.
network. It adds non-linearity to the network. There are 2.Input Gate: It determines what information can be written
several commonly used activation functions such as the onto the Cell State from current input.
ReLU, Softmax, tanH and the Sigmoid functions. Each of 3.Remember Gate: It is used to modulate the information
these functions have a specific purpose. For a binary that the Input gate will write onto the Internal State Cell.
classification CNN model, sigmoid and softmax functions 4.Output Gate: It determines what output(next Hidden State)
to generate from the current Internal Cell
Working of an LSTM recurrent unit:
It firstly takes input from the current input, the previous
hidden state, and the previous internal cell state. Then it
Calculates the values of the four different gates by :- CNN-RNN Model: Objection detection using CNN: CNN
1.By calculating the parameterized vectors for the current provides optimistic results for object detection and will be
input and the previous hidden state by element-wise best suited for image captioning.
multiplication with the concerned vector with the respective
weights for each gate. RNN-LSTM for generating captions : RNN-LSTM will be
2.Applying the respective activation function for each gate used to generate meaningful captions from the image and
element-wise on the parameterized vectors. object detection features. The input will be object detection
3.Then Calculate the current internal cell state by first and the output will be caption for the particular image.
calculating the element-wise multiplication vector of the
input gate and the input modulation gate, then calculate the In past few years image captioning has made significant
element-wise multiplication vector of the forget gate and the improvement. The neural image caption generator gives a
previous internal cell state and then adding the two vectors. beneficial framework for learning to map from various
c_{t} = i\odot g + f\odot c_{t-1} images to human-level image captions. Neural networks can
4. Lastly Calculate the current hidden state by first taking handle all of the issues by generating suitable, expressive
the element-wise hyperbolic tangent of the current internal and highly fluent caption using tensorflow and algorithms.
cell state vector and then performing element-wise The content-based image retrieval efficiency can be
multiplication with the output gate. Some of the drawbacks enhanced by text description of the images, the expanding
of LSTMs are longer training times, large memory application scope of visual understanding in the fields of
requirements, unable to parallel training, etc. science, security, defense and other fields, which has wide
application prospect. This Image Captioning deep learning
model is very useful to inspect the large amount of
III. PROPOSED MODEL unstructured and unlabeled data to detect the patterns in
those images for guiding the Self driving cars, for building
Our model includes use of deep learning for image the software to guide blind people.
captioning. We are using two techniques mainly CNN and
LSTM for image classification. So, to make our image
caption generator model, we will be merging these
architectures i.e. CNN-LSTM model. It is also known as
encoder-decoder model. The neural network-based image
captioning methods work as just simple end to end manner.
These methods are very similar to the encoder-decoder
framework-based neural machine translation. In this
network, global image features are extracted from the
hidden activations of CNN and then fed them into an LSTM
to generate a sequence of words.

The CNN-LSTM architecture is built by using CNN layers


for feature extraction on input data combined with LSTMs
to support sequence prediction. This model is specifically
designed for sequence prediction problems with spatial
inputs, like images or videos. They are widely used in
Activity Recognition, Image Description, Video Description
and many more. CNN-LSTMs are generally used when their
inputs have spatial structure, such as the 2D structure or
pixels in an image or the 1D structure of words in a
sentence, paragraph, or document and also have a temporal
structure in their input such as the order of images in a video
or words in text, or require the generation of output with  DATASET:
temporal structure such as words in a textual description.
We are using the Flickr8k dataset as a standard
As we are using LSTM over RNN, we are introducing benchmark dataset for the sentence description
more & more controlling knobs, which controls the of the image. This dataset consists of 8000
flow and mixing of Inputs as per trained Weights. And images with five captions for each image.
thus, bringing in more flexibility in controlling the Each caption provides a clear description of
outputs. So, LSTM gives us the most Control-ability entities and events present in the image. The
and thus, Better Results dataset represents a diversity of scenarios and
events and doesn't have images of well-known
people and places so that dataset won't be
more generic. It is divided into parts as 6000
images in the training dataset, 1000 images in
the development dataset, and 1000 images in
the test dataset.
The advantages of using this dataset for this
project are:
• Single image is mapped for multiple
captions to make the model generic and avoid
overfitting the model.
• Various categories of training images can
make the image captioning model work for
multiple categories of images and hence can
make the model more robust.

IV. CONCLUSION

In this paper, we have implemented a deep learning


approach for generating captions for the images. Our
described model is based upon a CNN feature extraction
model that encode an image into a vector representation,
followed by LSTM decoder model that can generates
corresponding sentences based on the image features
learned .

REFERENCES

[1] Philip Kinghorn, Li Zang, “a region based image caption


generator with refined descriptions” , Elsiver B V, 6 july
2017, Ling Shao University Northumbria New castle
NE1,United Kingdom.

[2] Priyanka Raut, Rushali A Deshmukh, “An Advanced


Image Captioning using combination of CNN and LSTM”,
Turkish Journal of Computer and Mathematics Education,
05 April 2021, Savitribai Phule Pune Univresity, faculty,
Maharhatra/India.

[3] Shuang Liu, Liang Bai, Yanli Hu and Haoran


Wang ,“Image Captioning Based on Deep Neural
Networks”, MATEC Web of Conferences ,2018 ,College of
Systems Engineering, National University of Defense
Technology,410073 Changsha, China.

[4] Raj kadam , Uday Kumbhar , Onkar Gulik , Dr Makrand


Shahade, “Object Detection and Automatic Image
Captioning Using Tensorflow”, International Journal of
Future Generation Communication and Networking2020,
Scholar, Department Of Computer Engineering, JSPM’s
RSCOE Pune.

[5]Priyanka Kalena , Aromal Nair,Nishi Malde, Saurabh


Parkar
“Visual Image Caption Generator Using Deep Learning”,
ICAST-2019,K.J Somaiya College Of Engineering,
Mumbai.

You might also like