0% found this document useful (0 votes)

3 views

ImagecaptionusingCNNandLSTM

The document discusses a research paper on image captioning using Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) networks. It outlines the methodology, including the use of the Xception model for feature extraction and the implementation of a deep learning architecture to achieve image descriptions. The results indicate a BLEU score of 55.01% with the Xception model, suggesting potential improvements through larger datasets and model architecture modifications.

Uploaded by

bhagvathipanday

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

ImagecaptionusingCNNandLSTM

Uploaded by

bhagvathipanday

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/342407897

Image Caption using CNN & LSTM

Article · May 2020

CITATIONS READS

6 9,432

1 author:

Ali Ashraf Mohamed

Helwan University
1 PUBLICATION 6 CITATIONS

SEE PROFILE

All content following this page was uploaded by Ali Ashraf Mohamed on 24 June 2020.

The user has requested enhancement of the downloaded file.

Image Caption using
CNN & LSTM

1.Introduction :-

Nowadays , Machine learning is a trend in Artifical Intelligence . Recently , we

apply AI in building a powerful performance and highly intelligent machines .
Machine learning has a subset called deep learning , it provides high accuracy
with its results so its performance is high too through its output . In our research
paper , deep learning is used in apps of image description . Image description
provides the process of describing the content from an image . The idea is based
on the detection of objects and what actions in the input image . Bottom-up and
top-down approaches are two main approaches of image description . Bottom-up
approaches generate contents in an input image , and then combine them into a
caption . Top-down approaches generate a semantic representation of an input
image that is then decoded into a caption using various architectures like
recurrent neural networks . Image description could have many benefits , for
instance by helping visually impaired people better understand the content of
images on the web . Now , we will explain what exactly will happen . What do you
see in the below picture?
Well some of you might say “A black dog in a grassy area”, some may say “ black
dog with white spots” , and others might say “ A dog on grass and some yellow
flowers “ . All of these captions are relevant for this image of course and there
may be some others too . The point I want to show; it’s so easy for us, as humans
, to have a glance at a picture and describe it . But , can a computer program
produces a relevant caption as humans ?

2.Related Work : -

In this section , we will talk about the experimental results were carried out
by MSCOCO dataset . For encoder/decoder framework , they have added a
feature called guiding network in their proposed work . The method that called
guiding network , mainly deals to learn the vector by a neural network
v=g(A) where A is the set of annotation vectors .

Generating natural language descriptions from visual data is an important problem .

it has long been studied in computer vision . Hence, this had led to complex systems
consists of visual primitive recognizers combined with a structured formal language
like And-Or Graphs or logic systems .Recently, the problem of still image description
with natural text has gained a huge interest.

3. Methodology:-

Here We use CNN and LSTM to achiave our goal (image caption generator)

we start from what is CNN and how can benefit from it in our problem ?

Convolutional Neural Network is an artificial deep learning neural network. It is used

for image classifications , computer vision ,image recognition and Object detection.

CNN image classifications takes an input image, process it and classify it under
certain categories (Eg., Dog, Cat,etc). It scans images from left to right and top to
bottom to pull out important features from the image and combines the feature to
classify images.
Figure 1

Secondly , what is LSTM ?

LSTM stands for Long short term memory, they are a type of RNN (recurrent neural
network) which is well suited for sequence prediction problems. Based on the
previous text, we can predict what the next word will be. It has proven itself effective
from the traditional RNN by overcoming the limitations of RNN which had short term
memory. LSTM can carry out relevant information throughout the processing of
inputs and with a forget gate, it discards non-relevant information.

we merged this two models in one model called a CNN-RNN model.in general Our
approach draws on the success of the top-down image generation models listed
above. We use a deep convolutional neural network to extract the visual image
features and Semantic features are extracted from the semantic tagging model.
Visual features from CNN and semantic features from tagging model are
concatenated and feed as the input to a Long-Short-Term Memory (LSTM) network,
which then generates captions
3.1 Our Model :-

Our model consists of 3 main phases:

A. Image Feature Extraction: The features of the images from the Flickr 8K dataset
is extracted using the Xception model due to the performance of the model in
object identification .Becouse it is accurate than VGG16. We will see that later .
The Xception is a convolutional neural network which consists of consists of 36,
as this model configuration learns very fast. These are processed by a Dense layer
to produce a 2048 vector element representation of the photo and passed on to the
LSTM layer.

B. Sequence processor :The function of a sequence processor is for handling the text
input by acting as a word embedding layer. The embedded layer consists of rules to
extract the required features of the text and consists of a mask to ignore padded
values. The network is then connected to a LSTM for the final phase of the image
captioning.

C. Decoder: The final phase of the model combines the input from the Image
extractor phase and the sequence processor phase using an additional operation
then fed to a 256 neuron layer and then to a final output Dense layer that produces a
softmax prediction of the next word in the caption over the entire vocabulary which
was formed from the text data that was processed in the sequence processor phase.
The structure of the network to understand the flow of images and text is shown in
the Figure 2.
3.1.1 Model Archtecture:

Xception Model Archtecture Figure2

Input_3 : the input here the features vector from the pretrained model (VGG16 -
Xception) input shape with VGG16 is 4096 but for Xception is 2048

Input_4 : the input here the Sequence of words (Caption)

Add_2 : Decoder phase

4-Experiments:

4.1. Data :

There was a challenge in choosing the dataset to train our model on it .

There was more than one dataset to fulfill our purpose.including (MS-COCO
dataset(containing 180k images) , Flickr30k(containing 30k images) ,
Flickr8k(containing8k images)) , but we choice Flickr8k to train our model on it
. Because we are linked with time and sources of our computers . the dataset
divided into 6,000 images for training , 1,000 for validation and 1,000 for
testing.

4.2PreProcessing for caption (Description of Image):

Each Image has 5 descriptions(captions) ,The main Function here is

Clean() function which takes all descriptions and performs a basic data clean :

1-Removing punctuations

2-Removeing Words that contain numbers

3-Converting all description in lowercase

4-Removing special tokens(like ‘%’, ‘$’, ‘#’, etc.)

,we applied tokenization to our dataset and we used fixed vocabulary size of
8,464

4.3 PreProcessing for images:

we Know that we cant fed image directly to our model for that we should do
some preprocessing before feding it to our model:

1- Resize each image to (299 * 299) Xception model or (224 * 224) for VGG16
2- Flatten it
3- Scaling image pixels (normalization)

4.4 Evaluation:

There were many challenges, The first challenge is a difference in choice of

convolutional feature extractor. For identical decoder architectures we use
VGG16 and Xception ,This two models were trained on Imagenet dataset to
perform image classification on 1000 different classes of images. However, our
purpose here is not to classify the image but just get fixed-length informative
vector for each image. This process is called automatic feature engineering. so we
remove the last softmax layer from the tow models and extract a 2048 length
vector (bottleneck features) for every image

The second challenge is a single model versus ensemble comparison. While other methods
have reported performance boosts by using ensembling, in our results we report a single
model performance.
In our evaluation, we compare directly only with results which use the comparable
Xception and VGG16 features by BLUE Score

5. Implementation:

The implementation of the model was done using the Python . Keras 2.0 was used to
implement the deep learning model. Tensorflow library is installed as a backend for the
Keras framework for creating and training deep neural networks. TensorFlow is a deep
learning library developed by Google. The neural network was trained on google colab .

Also we used This API :

1- Keras Model API

2- Keras pad_sequences() API
3- Keras Tokenizer API
4- Keras VGG16 API
5- Keras Xception API

6.Results:

We use BLUE score It is an algorithm, which has been used for

evaluating the quality of machine translated text. We can use BLEU
to check the quality of our generated caption. BLEU is language
independent . It lies between [0,1]. Higher the score better the
quality of caption

Metric Xception VGG16

BLUE-1 0.550179 0.460179
BLUE-2 0.350132 0.320112
BLUE-3 0.150275 0.100245
BLUE-4 0.050874 0.040572

6.1-Test With VGG16

Test with Xception :

We can see the difference between VGG16 and Xception.

7. Conclusion and Future work:-

In this paper, we have implemented a deep learning approach for the captioning of images.
The sequential API of Keras was used with Tensorflow as a backend to implement the
deep learning architecture to achieve a effective BLEU score of 55.01% with Xception
Model

we can improve our result by alot of modifications for example:

1. Using a larger dataset.
2. Changing the model architecture, e.g. include an attention module.
3. Doing more hyper parameter tuning (learning rate, batch size, number of layers,
number of units, dropout rate, batch normalization etc.).
4. Use the cross validation set to understand overfitting.
5. Using Beam Search instead of Greedy Search during Inference.

8.Referncs :-
• https://www.groundai.com/project/learning-to-evaluate-image-captioning/1
• https://cs231n.github.io/understanding-cnn/
• https://colah.github.io/posts/2015-08-Understanding-LSTMs/
• https://www.analyticsvidhya.com/blog/2017/12/fundamentals-of-deep-learning-
introduction-to-lstm/
• https://arxiv.org/abs/1909.09586
• https://arxiv.org/pdf/1603.05201.pdf
• https://arxiv.org/pdf/1612.07600.pdf
• https://arxiv.org/abs/1703.09137
• https://arxiv.org/abs/1601.03896
• https://www.researchgate.net/publication/321787151_Deep_learning_in_big_data_
Analytics_A_comparative_study
• http://proceedings.mlr.press/v37/xuc15.pdf
• https://cs231n.github.io/transfer-learning/
• https://keras.io/api/applications/vgg/#vgg16-function
• https://keras.io/api/applications/xception/
• https://keras.io/api/applications/vgg/#vgg19-function
• https://github.com/hlamba28/Automatic-Image-
Captioning/blob/master/Automatic%20Image%20Captioning.ipynb
• http://proceedings.mlr.press/v37/xuc15.pdf

View publication stats

300 Plus Computer Mcqs PDF Notes For All Exams
90% (10)
300 Plus Computer Mcqs PDF Notes For All Exams
42 pages
Image Caption Generation Using Deep Learning: Department of Electronics & Instrumentation Engineering NIT Silchar, Assam
No ratings yet
Image Caption Generation Using Deep Learning: Department of Electronics & Instrumentation Engineering NIT Silchar, Assam
21 pages
Air Conditioning Using Matlab
100% (1)
Air Conditioning Using Matlab
16 pages
Ten Key Certificate & Ten Key Test - Online 10 Key Test Certification at Learn2type - Com
No ratings yet
Ten Key Certificate & Ten Key Test - Online 10 Key Test Certification at Learn2type - Com
2 pages
Image Caption Generator Using Deep Learning
No ratings yet
Image Caption Generator Using Deep Learning
9 pages
Minor
No ratings yet
Minor
14 pages
Visual Image Caption Generator Using Deep Learning
No ratings yet
Visual Image Caption Generator Using Deep Learning
7 pages
Building A Voice Based Image Caption Generator With Deep Learning
No ratings yet
Building A Voice Based Image Caption Generator With Deep Learning
6 pages
Image Caption
No ratings yet
Image Caption
16 pages
Implementation_of_Simple_and_Efficient_P
No ratings yet
Implementation_of_Simple_and_Efficient_P
8 pages
Project Review
No ratings yet
Project Review
12 pages
BTP Report
No ratings yet
BTP Report
27 pages
Design of Machine Learning Algorithms For Object Captioning
No ratings yet
Design of Machine Learning Algorithms For Object Captioning
45 pages
DL project report
No ratings yet
DL project report
10 pages
Image Caption Generator Report
No ratings yet
Image Caption Generator Report
27 pages
Image Caption Generator Research Paper
No ratings yet
Image Caption Generator Research Paper
4 pages
2501
No ratings yet
2501
6 pages
Image Caption Generator
100% (1)
Image Caption Generator
20 pages
Automated Image Captioning Using CNN and RNN
No ratings yet
Automated Image Captioning Using CNN and RNN
17 pages
An Empirical Study of Language CNN For Image Captioning
No ratings yet
An Empirical Study of Language CNN For Image Captioning
10 pages
Gu An Empirical Study ICCV 2017 Paper PDF
No ratings yet
Gu An Empirical Study ICCV 2017 Paper PDF
10 pages
Apply Deep Learning-based CNN and LSTM for Visual Image Caption Generator
No ratings yet
Apply Deep Learning-based CNN and LSTM for Visual Image Caption Generator
6 pages
Image Caption Generator Using Deep Learning
No ratings yet
Image Caption Generator Using Deep Learning
5 pages
RP Springer
No ratings yet
RP Springer
10 pages
Show and Tell: A Neural Image Caption Generator
No ratings yet
Show and Tell: A Neural Image Caption Generator
9 pages
Aneja Convolutional Image Captioning CVPR 2018 Paper
No ratings yet
Aneja Convolutional Image Captioning CVPR 2018 Paper
10 pages
Generating_Caption_From_Images_Using_Flickr_Image_Dataset
No ratings yet
Generating_Caption_From_Images_Using_Flickr_Image_Dataset
7 pages
Image Captioning Generator Using Deep Machine Learning
No ratings yet
Image Captioning Generator Using Deep Machine Learning
3 pages
Image Caption Generation
No ratings yet
Image Caption Generation
8 pages
Automatic Image Captioning Using Neural Networks
No ratings yet
Automatic Image Captioning Using Neural Networks
9 pages
ROHAN PRASAD FinalProjectReport - Rohan Gamer
No ratings yet
ROHAN PRASAD FinalProjectReport - Rohan Gamer
39 pages
Presentation Manu Niha (1)
No ratings yet
Presentation Manu Niha (1)
11 pages
(IJCST-V11I4P7) :dr. T. S. Suganya, Mrs. M. Divya, T. Santhosh Kumar, K. Prem Kumar
No ratings yet
(IJCST-V11I4P7) :dr. T. S. Suganya, Mrs. M. Divya, T. Santhosh Kumar, K. Prem Kumar
4 pages
14
No ratings yet
14
8 pages
Image Caption Generator
No ratings yet
Image Caption Generator
2 pages
Image Captioning
No ratings yet
Image Captioning
17 pages
Pami Im2Show and Tell: Lessons Learned From The 2015 MSCOCO Image Captioning Challenge
No ratings yet
Pami Im2Show and Tell: Lessons Learned From The 2015 MSCOCO Image Captioning Challenge
12 pages
Image Caption Generator
No ratings yet
Image Caption Generator
13 pages
A Novel Approach of Image Caption Generator Using Deep Learning
No ratings yet
A Novel Approach of Image Caption Generator Using Deep Learning
6 pages
Image Captioning
No ratings yet
Image Captioning
8 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
ijariie26613
No ratings yet
ijariie26613
5 pages
Research Paper Final
No ratings yet
Research Paper Final
5 pages
Image Caption Generator Using AI: Review - 1
No ratings yet
Image Caption Generator Using AI: Review - 1
9 pages
Hybrid_Image_Captioning_Model
No ratings yet
Hybrid_Image_Captioning_Model
6 pages
Image To Caption Generator
No ratings yet
Image To Caption Generator
7 pages
Final Project Report
No ratings yet
Final Project Report
18 pages
Image Classification: Step-by-step Classifying Images with Python and Techniques of Computer Vision and Machine Learning
From Everand
Image Classification: Step-by-step Classifying Images with Python and Techniques of Computer Vision and Machine Learning
Mark Magic
No ratings yet
Image Captioning Using CNN and LSTM
No ratings yet
Image Captioning Using CNN and LSTM
9 pages
Document from Deependra singh (1)
No ratings yet
Document from Deependra singh (1)
10 pages
Review 2
No ratings yet
Review 2
34 pages
DL 20i0551 Project Proposal
No ratings yet
DL 20i0551 Project Proposal
3 pages
DL Group 6 Rep
No ratings yet
DL Group 6 Rep
11 pages
AIML - Final Report _ version1
No ratings yet
AIML - Final Report _ version1
24 pages
IJIEMR March 2023 COPY RIGHT (2 Files Merged)
No ratings yet
IJIEMR March 2023 COPY RIGHT (2 Files Merged)
8 pages
Image Caption Generator Final Report
No ratings yet
Image Caption Generator Final Report
28 pages
Mini Project Fln..
No ratings yet
Mini Project Fln..
51 pages
final-year-project-report
No ratings yet
final-year-project-report
52 pages
IJNRD2309143
No ratings yet
IJNRD2309143
11 pages
Fin Irjmets1689950550
No ratings yet
Fin Irjmets1689950550
5 pages
Image Caption Technical Report
50% (2)
Image Caption Technical Report
28 pages
A_Novel_Approach_of_Image_Caption_Generator_using_Deep_Learning
No ratings yet
A_Novel_Approach_of_Image_Caption_Generator_using_Deep_Learning
6 pages
abstract final Major project
No ratings yet
abstract final Major project
1 page
Robot 2010 Training Manual Metric Pag71-72
No ratings yet
Robot 2010 Training Manual Metric Pag71-72
2 pages
What Is Java Lecture1
No ratings yet
What Is Java Lecture1
14 pages
Onlineexamination 140110074924 Phpapp01
100% (1)
Onlineexamination 140110074924 Phpapp01
42 pages
Ans314 Se3000-1015
No ratings yet
Ans314 Se3000-1015
2 pages
MS Excel Software App
No ratings yet
MS Excel Software App
51 pages
ISA-101.01-2015 - Human Machine Interfaces For Process Automation Systems
100% (1)
ISA-101.01-2015 - Human Machine Interfaces For Process Automation Systems
56 pages
Operating Systems Lecture Notes-2
No ratings yet
Operating Systems Lecture Notes-2
15 pages
DWL-2100AP Por Telnet
No ratings yet
DWL-2100AP Por Telnet
29 pages
Message Broker Toolkit Installation
No ratings yet
Message Broker Toolkit Installation
10 pages
6000 Diagnostic LED Numbers and Codes
No ratings yet
6000 Diagnostic LED Numbers and Codes
15 pages
Front End Development Roadmap
No ratings yet
Front End Development Roadmap
9 pages
Java - A Complete Practical Solu - Swati Saxena
No ratings yet
Java - A Complete Practical Solu - Swati Saxena
59 pages
Software and Design Assignment 2
No ratings yet
Software and Design Assignment 2
9 pages
Project Loan Automl
No ratings yet
Project Loan Automl
52 pages
Project Scope
No ratings yet
Project Scope
9 pages
VC1541 Benutzerhandbuch
No ratings yet
VC1541 Benutzerhandbuch
45 pages
txtn502 Part2
100% (1)
txtn502 Part2
21 pages
Intelligent Load Shedding
100% (4)
Intelligent Load Shedding
31 pages
NLS Arabic and French Issue R12.1.3
No ratings yet
NLS Arabic and French Issue R12.1.3
7 pages
Alert Notification Document: EAI Replacement
No ratings yet
Alert Notification Document: EAI Replacement
9 pages
Year 1 ICT Lesson Plan Week10
No ratings yet
Year 1 ICT Lesson Plan Week10
1 page
MISRA C 2023 (MISRA C 2012)
No ratings yet
MISRA C 2023 (MISRA C 2012)
13 pages
Sap Ase Hadr Users Guide en
100% (1)
Sap Ase Hadr Users Guide en
530 pages
Butch Vig Vocals
No ratings yet
Butch Vig Vocals
15 pages
Microsoft Windows Commands
No ratings yet
Microsoft Windows Commands
4 pages
Google Drive: The Future of Cloud Computing
No ratings yet
Google Drive: The Future of Cloud Computing
12 pages
PC To PC Communication
No ratings yet
PC To PC Communication
48 pages

ImagecaptionusingCNNandLSTM

Uploaded by

ImagecaptionusingCNNandLSTM

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Image Caption using CNN & LSTM

Article · May 2020

Ali Ashraf Mohamed

The user has requested enhancement of the downloaded file.

Nowadays , Machine learning is a trend in Artifical Intelligence . Recently , we

Generating natural language descriptions from visual data is an important problem .

Convolutional Neural Network is an artificial deep learning neural network. It is used

Secondly , what is LSTM ?

Our model consists of 3 main phases:

Xception Model Archtecture Figure2

Input_4 : the input here the Sequence of words (Caption)

Add_2 : Decoder phase

There was a challenge in choosing the dataset to train our model on it .

4.2PreProcessing for caption (Description of Image):

Each Image has 5 descriptions(captions) ,The main Function here is

2-Removeing Words that contain numbers

3-Converting all description in lowercase

4-Removing special tokens(like ‘%’, ‘$’, ‘#’, etc.)

4.3 PreProcessing for images:

There were many challenges, The first challenge is a difference in choice of

Also we used This API :

1- Keras Model API

We use BLUE score It is an algorithm, which has been used for

Metric Xception VGG16

6.1-Test With VGG16

We can see the difference between VGG16 and Xception.

7. Conclusion and Future work:-

we can improve our result by alot of modifications for example:

View publication stats

You might also like