RBBA ResNet - BERT - Bahdanau Attention for Image Caption Generator
RBBA ResNet - BERT - Bahdanau Attention for Image Caption Generator
Abstract—In recent years, the topic of image caption genera- II. R ELATED WORKS
tors has gained significant attention. Several successful projects
have emerged in this field, showcasing notable advancements. A. Convolutional neural networks
Image caption generators automatically generate descriptive cap-
tions for images through the encoder and decoder mechanisms.
Convolutional neural networks (CNNs) have a crucial role
The encoder leverages computer vision models, while the decoder in computer vision. The emergence of architectures such as
utilizes natural language processing models. In this study, we aim VGGNet [1] and ResNet [2] has significantly enhanced the
to assess a comprehensive set of seven distinct methodologies, efficacy of computer vision models across various tasks.
including six existing methods from prior research and one newly VGGNet [1] is a well-known convolutional neural network
proposed. These methods are trained and evaluated with bilingual
evaluation (BLEU) on the Flickr8K dataset. In our experiments, architecture consisting of multiple layers with small convolu-
the proposed ResNet50 – BERT – Bahdanau Attention model tional filters. It gained popularity for its simplicity and effec-
outperforms the other models in terms of the BLEU-1 score of tiveness in image classification tasks. The network architecture
0.532143 and BLEU-4 score of 0.126316. typically follows a consistent pattern of stacking convolutional
Index Terms—Deep learning, Natural language processing, layers with 3x3 filters, followed by max-pooling layers to
Encoder-Decoder, Flickr8K, BLEU, Image Caption.
reduce the spatial dimensions. The VGGNet architecture of-
fers various configurations, commonly known as VGG16 and
I. I NTRODUCTION VGG19, depending on the depth of the network. Its archi-
Generating textual descriptions or captions for images is one tecture has demonstrated impressive performance on various
of the most challenging tasks for artificial intelligence (AI). image classification benchmarks, achieving high-performance
Despite the difficulty, image caption generators have a wide results and establishing itself as a reliable and effective choice
variety of uses, from providing automatic image descriptions for deep-learning tasks.
for the blind to enhancing image search outcomes and pro- ResNet [2] is a groundbreaking convolutional neural net-
ducing more interesting social media posts. The development work architecture that has revolutionized the field of computer
of more precise and sophisticated image caption generators vision. ResNet addresses a common challenge encountered in
has advanced significantly in recent years, and they are now deep neural networks known as the vanishing gradient prob-
widely used across a variety of industries. lem. As networks become deeper, the gradients can vanish,
Image caption generators analyze the contents of an image leading to difficulties in training and optimization. To address
using deep learning techniques and generate a description of this issue, ResNet utilizes skip connections that allow the
what is happening in the image. Typically, image caption gen- network to bypass certain layers. By doing so, ResNet enables
erators employ a combination of computer vision techniques the direct flow of information from earlier layers to subsequent
to process the images and natural language processing (NLP) layers, facilitating the learning process. ResNet has achieved
algorithms to generate the descriptions. These algorithms can remarkable success, outperforming previous models in various
be trained on enormous datasets composed of photos and their computer vision tasks, such as image classification, object
captions, teaching the system how to correlate various visual detection, and image segmentation. Its deep architecture, with
cues with relevant descriptions. variants like ResNet-50, ResNet-101, and ResNet-152, has
become a standard benchmark in the field. Furthermore,
∗ Corresponding author: Duc Ngoc Minh Dang ([email protected]) ResNet’s impact extends beyond computer vision, the concept
431
zed licensed use limited to: Vignan's Foundation for Science Technology & Research (Deemed to be University). Downloaded on September 23,2024 at 13:27:03 UTC from IEEE Xplore. Restrictions
using the ReLU activation function. Softmax is used as the the output of the image and text features is concatenated
activation function to predict the output word. The details of together and then passed through LSMTs [3]. By feeding the
MXW2V are depicted in Figure 1. concatenated representation through LSTMs, the model can
effectively learn the relationships between the image and text
B. Merge-based InceptionResnetV2 – GloVe (MIRG)
features and capture the contextual information necessary for
predicting the next word in the sentence.
The MIRG employed in this approach builds upon the Fig. 4. Architecture of “Inject InceptionResnetV2 - GloVe” method
MXW2V, which utilized Xception and Word2Vec. However,
it introduces notable improvements by leveraging the power Similar to the IXW2V approach, this architecture leverages
of the InceptionResNetV2 [9] and the GloVe [10] technique the InceptionResnetV2 [9] and Glove [10] methods to extract
whose architecture is illustrated in Figure 2. InceptionRes- features from images and text, respectively. These features are
NetV2 [9] is an exceptionally deep network, comprising a subsequently concatenated and fed into LSTMs to predict the
total of 164 layers. It combines the innovative ideas of the next word. Figure 4 illustrates the architecture of the IIRG
Inception module with the residual connections. These residual model.
connections reduce the vanishing gradient problem commonly
encountered in deep networks and facilitate the training of E. VGG16 – GRU – Bahdanau attention (VGBA)
highly complex models. GloVe [10] technique changed the
generation of word feature vectors by utilizing global word
co-occurrence data. By capturing the semantic relationships
between words, GloVe effectively merges the advantages of
count-based methods. GloVe effectively merges the advantages
of count-based Latent Semantic Analysis [11] and context-
based Word2Vec [8].
C. Inject-based Xception – Word2Vec (IXW2V)
Fig. 5. Architecture of “VGG16 - GRU - Bahdanau attention” method
432
zed licensed use limited to: Vignan's Foundation for Science Technology & Research (Deemed to be University). Downloaded on September 23,2024 at 13:27:03 UTC from IEEE Xplore. Restrictions
input and output sequences to concentrate on the most crucial To assess the performance of our model, we employed
elements, resulting in improved performance. the BLEU metric [16], which is a widely used and estab-
lished evaluation measure in the field of natural language
F. Xception – TT-LSTMs with Bi-LSTMs (XBi-LSTM)
processing. BLEU is commonly utilized to evaluate the quality
of machine-generated text by comparing it to one or more
reference translations or human-generated text. It calculates
a score ranging from 0 to 1, with a higher score indicating
a better match between the generated text and the reference
text. The BLEU metric takes into account various factors such
as precision, n-gram matches, and brevity penalty. It considers
both the presence and the ordering of words, thereby capturing
Fig. 6. Architecture of “Xception - TT-LSTMs with Bi-LSTMs” method
the fluency and correctness of the generated captions. By
employing the BLEU metric, we aimed to quantitatively
The encoder-decoder structure in Figure 6 is used by the evaluate the performance of our model and compare it with
approach known as TT-LSTM [14] which is built using a other methods in the field. The use of this standard metric
combination of the merge and inject models. For both text and allows for a fair and objective assessment of the quality of the
image, TT-LSTM suggests creating two sub-encoder models. captions generated by our model.
The two aforementioned procedures will then be merged. The We carefully selected specific parameter configurations to
Xception network is utilized in the image encoder model. Bi- optimize our research outcomes. These include utilizing the
LSTM is employed for the decoder, while LSTMs are applied Adam optimizer with a learning rate of 0.001 to guide the
to the language encoder. training process. For the cost function, we employed the
cross-entropy loss function, a commonly used measure for
G. ResNet50 – BERT – Bahdanau attention (RBBA) multi-class classification tasks. To initialize the LSTMs/GRUs,
we applied the glorot-uniform initializer, which helps ensure
effective information flow within the model. During training,
we used a batch size of 32, which determines the number of
samples processed in each iteration. The model was trained
for a total of 10 epochs. To prevent overfitting and enhance
generalization, we incorporated a dropout rate of 0.5, which
randomly deactivates a portion of the neural network units
during training. This regularization technique encourages the
Fig. 7. Architecture of “ResNet50 - BERT - Bahdanau attention” method model to learn more robust and generalized features.
This architecture utilizes the ResNet50 [2] and BERT [15] B. Experiment results
methods, the same approach as mentioned in VGBA, to Tables I and II present the experimental results of various
extract features from images and text, respectively. In this architectures for image captioning. Among these architectures,
architecture, the Bahdanua attention mechanism is combined the VGBA demonstrates exceptional performance in terms of
with LSTMs [3] to generate the final output. The detailed both speed and accuracy. With a relatively short training time
architecture is illustrated in Figure 7. of 826.8s, this architecture achieves the best score in terms of
BLEU-2 and BLEU-3 (Table I). The inclusion of Bahdanau
IV. P ERFORMANCE EVALUATION
attention yields a significant enhancement in the output, as
A. Experiment setup it allows the model to comprehend the image context more
In this research, we utilized the Flickr8K dataset, which is accurately and consistently.
a widely used and diverse dataset for image captioning tasks. Table II illustrates these methods in action for captioning
The dataset consists of 8,000 high-quality images sourced the image. Among them, the RBBA attention stands out
from the popular online platform Flickr. These images cover as it describes the image with the highest level of detail
a wide range of life themes, including captivating scenes of and accuracy compared to the other methods. It effectively
animals such as dogs, cats, and people engaging in various leverages the ResNet50 and BERT models, along with the
activities. The dataset also includes images depicting fun and inclusion of Bahdanau attention, resulting in more precise and
entertainment activities, sports events, and daily life routines. comprehensive image captions.
The diversity of themes and subjects within the Flickr8K Figure 8 demonstrates the performance of various methods
dataset makes it a suitable choice for training and testing in generating image captions, as measured by the BLEU
the image caption generator models. By incorporating such a scores. The bar chart clearly illustrates the distinct differences
diverse collection of images, we aimed to enhance the model’s in performance among these methods. Notably, the Bahdanau
ability to generate accurate and meaningful captions for a wide attention mechanism still stands out as a particularly effective
range of visual content. approach. The Bahdanau attention mechanism has proven to be
433
zed licensed use limited to: Vignan's Foundation for Science Technology & Research (Deemed to be University). Downloaded on September 23,2024 at 13:27:03 UTC from IEEE Xplore. Restrictions
TABLE I
P ERFORMANCE COMPARISON ON F LICKR 8K DATASET
Methods Parameter Time training (per epochs) BLEU-1 BLEU-2 BLEU-3 BLEU-4
Inject Xception – Word2Vec 5,002,649 807.6 s 0.364886 0.190213 0.121723 0.046814
Merge Xception – Word2Vec 5,002,649 732.8 s 0.375879 0.196083 0.118743 0.042923
Inject InceptionResnetV2 – GloVe 5,512,165 3,387.1 s 0.360748 0.201883 0.116986 0.062958
Merge InceptionResnetV2 – GloVe 5,315,557 2,096.8 s 0.381128 0.221911 0.155406 0.069800
VGG16 – GRU – Bahdanau attention 4,872,345 826.8 s 0.461538 0.339683 0.254815 0.039086
Xception – TT LSTM with Bi-LSTM 8,517,273 3,139.8 s 0.426407 0.265075 0.191381 0.089676
ResNet50–BERT–Bahdanau attention 119,818,810 7,488.0 s 0.532143 0.227003 0.175572 0.126316
TABLE II
C APTION COMPARISON RESULTS OF METHODS FOR IMAGES
highly successful in generating accurate and contextually rel- approach falls short of the RBBA approach in terms of BLEU-
evant image captions. It encompasses a sophisticated attention 1 and BLEU-4 scores, it outshines the leading model when it
mechanism that allows the model to focus on different regions comes to BLEU-2 and BLEU-3 scores. In terms of speed, the
of the image while generating the corresponding captions. VGBA model appears to be significantly faster than the RBBA
model. VGBA model achieves the training of an epoch within
V. C ONCLUSIONS only 826.8 seconds, whereas the RBBA model, on the other
In conclusion, the comparison of these methods sheds light hand, takes up to 7,488 seconds.
on the advantages and trade-offs inherent in different model
and attention mechanism combinations. The RBBA approach R EFERENCES
excels in generating accurate and descriptive captions, which is [1] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
demonstrated by its impressive BLEU-1 and BLEU-4 scores, large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
as well as the meaningfulness of the captions produced for recognition,” in Proceedings of the IEEE Conference on Computer Vision
sample images. On the other hand, although the VGBA and Pattern Recognition (CVPR), 2016, pp. 770–778.
434
zed licensed use limited to: Vignan's Foundation for Science Technology & Research (Deemed to be University). Downloaded on September 23,2024 at 13:27:03 UTC from IEEE Xplore. Restrictions
Fig. 8. BLUE-score comparison on Flickr8K dataset
[3] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural for word representation,” in Proceedings of the 2014 conference on
computation, vol. 9, no. 8, pp. 1735–1780, 1997. empirical methods in natural language processing (EMNLP), 2014, pp.
[4] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, 1532–1543.
H. Schwenk, and Y. Bengio, “Learning phrase representations using [11] S. T. Dumais, “Latent semantic analysis,” Annual Review of Information
rnn encoder-decoder for statistical machine translation,” arXiv preprint Science and Technology (ARIST), vol. 38, pp. 189–230, 2004.
arXiv:1406.1078, 2014. [12] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel,
[5] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, and Y. Bengio, “Show, attend and tell: Neural image caption generation
Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in with visual attention,” in International conference on machine learning.
neural information processing systems, vol. 30, 2017. PMLR, 2015, pp. 2048–2057.
[6] M. Tanti, A. Gatt, and K. P. Camilleri, “Where to put the image in an
[13] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
image caption generator,” Natural Language Engineering, vol. 24, no. 3,
jointly learning to align and translate,” arXiv preprint arXiv:1409.0473,
pp. 467–489, 2018.
2014.
[7] F. Chollet, “Xception: Deep learning with depthwise separable convolu-
tions,” in Proceedings of the IEEE conference on computer vision and [14] P. P. Khaing et al., “Two-tier lstm model for image caption generation.”
pattern recognition, 2017, pp. 1251–1258. International Journal of Intelligent Engineering & Systems, vol. 14,
[8] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of no. 4, 2021.
word representations in vector space,” arXiv preprint arXiv:1301.3781, [15] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training
2013. of deep bidirectional transformers for language understanding,” arXiv
[9] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception-v4, preprint arXiv:1810.04805, 2018.
inception-resnet and the impact of residual connections on learning,” in [16] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for
Proceedings of the AAAI conference on artificial intelligence, vol. 31, automatic evaluation of machine translation,” in Proceedings of the 40th
no. 1, 2017. annual meeting of the Association for Computational Linguistics, 2002,
[10] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors pp. 311–318.
435
zed licensed use limited to: Vignan's Foundation for Science Technology & Research (Deemed to be University). Downloaded on September 23,2024 at 13:27:03 UTC from IEEE Xplore. Restrictions