ImagecaptionusingCNNandLSTM
ImagecaptionusingCNNandLSTM
net/publication/342407897
CITATIONS READS
6 9,432
1 author:
SEE PROFILE
All content following this page was uploaded by Ali Ashraf Mohamed on 24 June 2020.
1.Introduction :-
2.Related Work : -
In this section , we will talk about the experimental results were carried out
by MSCOCO dataset . For encoder/decoder framework , they have added a
feature called guiding network in their proposed work . The method that called
guiding network , mainly deals to learn the vector by a neural network
v=g(A) where A is the set of annotation vectors .
3. Methodology:-
Here We use CNN and LSTM to achiave our goal (image caption generator)
we start from what is CNN and how can benefit from it in our problem ?
CNN image classifications takes an input image, process it and classify it under
certain categories (Eg., Dog, Cat,etc). It scans images from left to right and top to
bottom to pull out important features from the image and combines the feature to
classify images.
Figure 1
LSTM stands for Long short term memory, they are a type of RNN (recurrent neural
network) which is well suited for sequence prediction problems. Based on the
previous text, we can predict what the next word will be. It has proven itself effective
from the traditional RNN by overcoming the limitations of RNN which had short term
memory. LSTM can carry out relevant information throughout the processing of
inputs and with a forget gate, it discards non-relevant information.
we merged this two models in one model called a CNN-RNN model.in general Our
approach draws on the success of the top-down image generation models listed
above. We use a deep convolutional neural network to extract the visual image
features and Semantic features are extracted from the semantic tagging model.
Visual features from CNN and semantic features from tagging model are
concatenated and feed as the input to a Long-Short-Term Memory (LSTM) network,
which then generates captions
3.1 Our Model :-
A. Image Feature Extraction: The features of the images from the Flickr 8K dataset
is extracted using the Xception model due to the performance of the model in
object identification .Becouse it is accurate than VGG16. We will see that later .
The Xception is a convolutional neural network which consists of consists of 36,
as this model configuration learns very fast. These are processed by a Dense layer
to produce a 2048 vector element representation of the photo and passed on to the
LSTM layer.
B. Sequence processor :The function of a sequence processor is for handling the text
input by acting as a word embedding layer. The embedded layer consists of rules to
extract the required features of the text and consists of a mask to ignore padded
values. The network is then connected to a LSTM for the final phase of the image
captioning.
C. Decoder: The final phase of the model combines the input from the Image
extractor phase and the sequence processor phase using an additional operation
then fed to a 256 neuron layer and then to a final output Dense layer that produces a
softmax prediction of the next word in the caption over the entire vocabulary which
was formed from the text data that was processed in the sequence processor phase.
The structure of the network to understand the flow of images and text is shown in
the Figure 2.
3.1.1 Model Archtecture:
Input_3 : the input here the features vector from the pretrained model (VGG16 -
Xception) input shape with VGG16 is 4096 but for Xception is 2048
4-Experiments:
4.1. Data :
1-Removing punctuations
,we applied tokenization to our dataset and we used fixed vocabulary size of
8,464
we Know that we cant fed image directly to our model for that we should do
some preprocessing before feding it to our model:
1- Resize each image to (299 * 299) Xception model or (224 * 224) for VGG16
2- Flatten it
3- Scaling image pixels (normalization)
4.4 Evaluation:
The second challenge is a single model versus ensemble comparison. While other methods
have reported performance boosts by using ensembling, in our results we report a single
model performance.
In our evaluation, we compare directly only with results which use the comparable
Xception and VGG16 features by BLUE Score
5. Implementation:
The implementation of the model was done using the Python . Keras 2.0 was used to
implement the deep learning model. Tensorflow library is installed as a backend for the
Keras framework for creating and training deep neural networks. TensorFlow is a deep
learning library developed by Google. The neural network was trained on google colab .
6.Results:
8.Referncs :-
• https://www.groundai.com/project/learning-to-evaluate-image-captioning/1
• https://cs231n.github.io/understanding-cnn/
• https://colah.github.io/posts/2015-08-Understanding-LSTMs/
• https://www.analyticsvidhya.com/blog/2017/12/fundamentals-of-deep-learning-
introduction-to-lstm/
• https://arxiv.org/abs/1909.09586
• https://arxiv.org/pdf/1603.05201.pdf
• https://arxiv.org/pdf/1612.07600.pdf
• https://arxiv.org/abs/1703.09137
• https://arxiv.org/abs/1601.03896
• https://www.researchgate.net/publication/321787151_Deep_learning_in_big_data_
Analytics_A_comparative_study
• http://proceedings.mlr.press/v37/xuc15.pdf
• https://cs231n.github.io/transfer-learning/
• https://keras.io/api/applications/vgg/#vgg16-function
• https://keras.io/api/applications/xception/
• https://keras.io/api/applications/vgg/#vgg19-function
• https://github.com/hlamba28/Automatic-Image-
Captioning/blob/master/Automatic%20Image%20Captioning.ipynb
• http://proceedings.mlr.press/v37/xuc15.pdf