Template Master USDB (11)
Template Master USDB (11)
List of Figures 3
List of Tables 5
1
1.4.2.4 Spatial and Channel-wise Attention..............................................19
1.4.2.5 Adaptive attention..........................................................................19
1.4.2.6 Self-attention..................................................................................20
1.4.2.7 comparison between different attention mechanisms....................21
1.5 Related works.............................................................................................................21
1.5.1 Deep neural network-based image captioning...............................................22
1.5.2 Transformer-based image captioning............................................................23
1.5.3 Attention-based image captioning.................................................................25
1.6 Semantic in image captioning....................................................................................27
1.6.1 WordNet in Naturel language processing......................................................27
1.6.1.1 wordnet..........................................................................................28
1.6.1.2 Structure of WordNet....................................................................28
1.6.1.3 Semantic relation in wordnet.........................................................29
1.6.1.4 Semantic similarity based on Wordnet..........................................29
1.6.2 Bert................................................................................................................30
1.7 Conclusion..................................................................................................................31
2 Modeling Chapter 33
2.1 Introduction................................................................................................................33
2.2 Model architecture......................................................................................................33
2.2.1 CNN encoder.................................................................................................34
2.2.2 Transformer decoder......................................................................................35
2.3 Delve into WordNet Enhanced Multihead-attention..................................................37
2.3.1 WordNet Similarity Matrix............................................................................37
2.3.2 Algorithm to Calculate WordNet Similarity Matrix (WSM).........................38
2.4 Multi-Head Attention.................................................................................................38
2.5 WordNet Enhanced Multi-Head Attention.................................................................40
2.5.1 Query, Key, and Value Matrices (Q, K, V)...................................................40
2.6 Conclusion..................................................................................................................42
Bibliography 43
2
List of Figures
3
2.6 Wordnet Enhanced Multi-Head Attention.......................................................................40
2.7 Query, Key, and Value Matrices......................................................................................42
4
List of Tables
1.1 A comparative table of CNN, R-CNNs, Fast R-CNN, Faster-RCNN, VGG, and ResNet. 12
1.2 Comparative Table of RNN LSTM GRU and Transformer Wang et al. (2019) Jurafsky
and Martin (nd) Shiri et al. (2023)...................................................................................16
1.3 Comparative Table of different attention mechanisms....................................................21
1.4 Comparative table of image captioning methods.............................................................23
1.5 Summary of Transformer-Based Methods for Image Captioning...................................25
1.6 Summary of Attention-Based Methods for Image Captioning........................................27
1.7 Comparison of Different Semantic Similarity Measures.................................................30
5
6
Chapter 1
1.1 Introduction
The image captioning task involves automatically generating natural language descriptions for
images. However, producing captions that accurately capture semantic visual details poses
significant challenges. This crucial task aims to develop models capable of generating relevant
textual descriptions from visual content, bridging the gap between the visual and linguistic
worlds. This chapter provides a comprehensive overview of state-of-the-art attention-based
models for tackling the image captioning task. We begin by introducing the general context
of image captioning. We then survey key encoder-decoder neural network architectures that
have become dominant for this task. Next, we broadly introduce attention mechanisms and the
different types. Furthermore, we delve into the utilization of semantic resources unraveling
their potential to enrich the semantic understanding of textual descriptions, by examining
semantic types and measures of similarity. We delve into the development of our image captioning
system, which integrates a CNN as the encoder with a Transformer as the decoder, further
enhanced by semantic knowledge.
1.2.1 definition
Image captioning (IC) is a key research problem in multi-modal learning, which joins repre-
sentations across vision and language, where the model can automatically generate a textual
informative description of a given input image Mokady et al. (2021). This task requires recogniz-
ing the relevant objects, their attributes, and their relationships in an input image. The model must
produce a correct sentence with proper syntactic and semantic structure Hossain et al. (2019).
7
As shown in Fig.1. The main goal of this process is to mimic human-level scene understanding
and generate corresponding textual captions that capture objects, relationships, and context.
Developing models with strong generalization abilities is a big challenge in computer vision.
The image-to-text synthesis has been used in many fields, including machine vision, remote
security, healthcare, and remote sensing. Since their origin Sharma and Padha (2023).
8
Figure 1.2: Image Captioning models .
1.2.3.1 CNN
A trained Convolutional Neural Network detects objects and their bounding boxes inside images.
If a simple CNN is used for object detection, one method for detecting object-bounding boxes
is to place a grid above the image and process the individual cells in it. Images contain items
of varied shapes and sizes that can be located anywhere. Therefore, a grid with defined cell
sizes is used. will not produce desirable results. To address this issue, grids with variable cell
sizes must be utilized to detect objects with varying settings, which might be computationally
intensive. To solve this issue, Girshick et al. (2014) introduced Region-based CNNs.
1.2.3.2 R-CNNs
Region-based CNN networks use selective search to retrieve only 2000 regions from an
image. The selective search method creates numerous candidate regions for segmenting the input
image. These regions’ proposals are merged recursively to create larger regions, which are then
chosen as the final region suggestions. Since no actual learning is utilized in the selective search
algorithm, it may generate inaccurate region proposals.
Fast R-CNN is a different network that Girshick (2015) introduced. R-CNN and Fast R-CNN
operate in remarkably similar ways. To create a convolutional feature map, the input image is
9
given to the CNN rather than the region proposals. The selective search technique and its feature
map are then used to create region proposals. In contrast to R-CNN, which fed 2000 region
proposals to a CNN, Fast R-CNN is faster since it convolves the picture only once and
extracts a feature map. But it employ the selective search algorithm, which has an impact on
network speed and takes time.
A novel approach was put out by Ren et al. (2015), in which the network learns the region
proposals rather than using the selective search algorithm. A CNN receives an image as input and
extracts a convolutional feature map, much like Fast R-CNN does. To predict region
proposals, an additional network is utilized in place of the selective search algorithm. object
detection can be simplified by using Faster R-CNN, which is faster than both R-CNN and Fast
R-CNN.
1.2.3.5 VGG
The Visual Geometry Group is an efficient and classical design principle for CNN that was
proposed by Simonyan and Zisserman (2014) at the University of Oxford.VGG has two
models: VGG16 and VGG19, with 16 and 19 weight layers, respectively. Characterized by the
use of small convolutional filters of size 3x3 and a deep architecture stacked on top of each
other in increasing depth. Reducing volume size is handled by max pooling. Two fully-connected
layers, each with 4,096 nodes are then followed by a softmax classifier. these small-size filters
made the receptive field similarly efficient to the large-size filters By decreasing the number of
parameters, therefore reducing computational complication. VGG’s computational cost was
excessive due to its utilization of around 140 million parameters, which represented its main
shortcoming. Figure
1.3 shows the structure of the network.
10
1.2.4 Resnet-101
ResNet101 is a deep residual neural network architecture that was introduced by He et al.
(2016) in 2016, that uses residual connections to disappearing gradients in deep CNNs. It is a
version of the Residual Network (Resnet) architecture, which stands for "Residual Network"
and is characterized by the use of residual connections, which is distinguished by the
utilization of residual connections, which are network shortcuts that avoid one or more tiers.
With 101 layers, the ResNet-101 architecture was trained on the 14 million photos and 1000
distinct classes found in the ImageNet dataset. It attained cutting-edge results on the ImageNet
classification test in 2016 and has since been extensively used as a foundation model for
several computer vision applications, including object identification and semantic
segmentation. The figure shows a building block for ResNet-101
1.2.4.1 comparison
This section delves into a comparative analysis of six feature extraction methods used in
image captioning such as traditional CNN, R-CNNs, Fast R-CNN, Faster-RCNN, VGG, and
ResNet. We will explore the advantages and limitations of each model.
11
Models Advantages Limitations
CNN • Fast and efficient feature extraction • May not capture detailed features
• Effective for simple images in complex images
Fast R-CNN • Faster than traditional CNNs due to • May not capture detailed features
shared convolutional features • Limited in handling complex images
Faster R-CNN • faster detection than Fast R-CNN • May not capture detailed features
• Limited in handling complex images
Resnet-101 • Effective for complex and large images • High computational cost and
• Mitigates Vanishing Gradient Problem memory usage
• High accuracy and state-of-the- • May not perform well on simple images
art performance
• Robust to overfitting and noise
Table 1.1: A comparative table of CNN, R-CNNs, Fast R-CNN, Faster-RCNN, VGG, and
ResNet.
Most image captioning systems use an encoder-decoder originally designed for translation
Sutskever et al. (2014), Cho et al. (2014a). Some researchers adopted them for image captioning.
The idea is that an input image is encoded into a vectorial feature Ji et al. (2021), typically a
CNN architecture as shown in Fig. 2, and then uses any decoder from NLP language models
for example an RNN architecture to translate that vector into words Sharma and Padha (2023).
Figure 1.5: General sturcture of image captioning methods based on the encoder-decoder
Gómez Martínez (2019).
12
1.3.1 Recurrent Neural Networks
A recurrent Neural Network (RNN) is a type of artificial neural network Simon (2009), with
internal memory for remembering past inputs, making it useful for tasks such as image captioning
by converting a static input into a succession. Unlike feed-forward neural networks, RNN
uses current input and the learned output from the previous step during inference time to perform
input sequences. In other words, RNNs are a more complicated structure in which neurons a
layer are interconnected and allow for feedback, resulting in information flowing in cycles.
The Figure
1.4. Illustrate the RNN process. However, despite their utility in applications characterized by
interrelated inputs, RNNs are beset by inherent limitations. One prevalent issue is the
Vanishing Gradient problem. Moreover, RNNs encounter the inability to process long
sequences and difficulty in training. This limitation can lead to the loss of crucial information
at the beginning of the sequence.
Long short-term memory (LSTM) Hochreiter and Schmidhuber (1997) is an improved version of
RNN networks, which was designed to resolve long-term dependency problems LSTM is mostly
composed of four gates: input gate, forget gate, cell, and output gate Its specific structure is
shown in Figure 1.5, they can learn which data is essential in a sequence or must be ignored. And
in turn, they stored important information through the sequence. LSTMs have been applied to
many difficult problems, such as recognition, language modeling, translation, speech
synthesis, and video analysis because they efficiently capture long-term temporal
dependencies without such optimization issues. They are widely used in image captioning to
generate textual data and extract features. Even while LSTM has shown encouraging results
13
across a range of tasks, it
14
might have trouble understanding input structures that are more intricate than a sequential format.
Also, because of their memory cells, they require a lot of storage space.
Figure 1.7: The structure of the LSTM network Tao et al. (2021).
The Gated Recurrent Unit (GRU) is another RNN network variant. that was introduced by
Cho et al. (2014b) which addresses the short-term memory issue and offers a simpler structure
than LSTM Abbaspour et al. (2020) with two gates: update and reset gates to control the flow
of information into and out of memory. To improve its computational efficiency and facilitate
training. Figure 1.6 illustrates the overview of a GRU. Additionally, GRU networks are an
effective tool for modeling sequential data, particularly in cases where computational
resources are limited or a simpler architecture is required. However, they have limitations like
slow convergence rate, low learning efficiency, and the complex state of time series data and
require careful consideration when selecting the right model for a given task.
Figure 1.8: The structure of the GRU network Zhiwei et al. (2022).
15
1.3.2 Transformer
The transformer is a crucial deep learning model architecture commonly used in NLP tasks,
introduced in the paper "Attention Is All You Need" by Vaswani et al. (2017) in 2017. This
architecture completely changed the field by including attention methods that enable the
model to simultaneously focus on pertinent segments of the input sequence. The transformer
consists of an encoder and a decoder Vaswani et al. (2017). both containing multiple layers of self-
attention and feed-forward neural networks. This enables the transformer to process input
sequences much more efficiently. The encoder comprises layers responsible for processing the
input iteratively one layer after another, and the decoder layers receive the encoder output for
generating the decoded output. As shown in Figure 1.7. The input and output sequence
embedding are added with positional encoding before being fed into the encoder and the
decoder that stack modules based on self-attention. The transformer encoder is a stack of six
identical layers, with two sub-layers in each layer. a multi-head self-attention pooling
mechanism This enables the encoder to attend to different parts of the input sequence
simultaneously, and a position-wise fully connected feed-forward network, which processes
the output of the self-attention layers to extract higher-level features from the input. Around
each sub-layer, a residual connection is used, followed by layer normalization Vaswani et al.
(2017). The decoder is also composed of a stack of six identical layers with three sub-layers: a
self-attention mechanism, a feed-forward neural network similar to the encoder, and an
additional third sub-layer to perform attention over the output of the encoder Hernández and
Amigó (2021).
16
1.3.3 Comparison of models
This section delves into a comparative analysis of four prominent approaches: Recurrent
Neural Networks (RNNs), Long Short-Term Memory (LSTM) networks, The Gated
Recurrent Unit (GRU), and Transformers. We will explore the advantages and limitations of
each model.
Table 1.2: Comparative Table of RNN LSTM GRU and Transformer Wang et al. (2019) Jurafsky
and Martin (nd) Shiri et al. (2023).
Another method to classify how attention is defined is to make a difference between "soft" and
"hard" attention. This distinction is made in the paper Attend and Tell Vinyals et al. (2015),
depending on whether the attention can view the complete image or only the patch.
In soft attention the alignment weights are learned and placed gently on every patch in
the source image, by allowing the model to attend to multiple elements simultaneously. They
are smooth and differentiable which makes them easier to optimize using gradient-based
techniques AI (2023). But expensive when the source input is large.
In hard attention, they select a subset of input elements to focus on while ignoring the
rest AI (2023). It functions by teaching users to discern which input pieces are most pertinent
to a particular job. They are less calculated at inference time but non-differentiable which
impedes gradient-based optimization techniques such as back-propagation. Researchers frequently
employ methods like surrogate gradients and reinforcement learning to overcome this Shen et
al. (2018). The figure illustrates the soft and hard effects of attention with the visual example
from the original paper
Figure 1.10: comparative of soft (top) vs. hard attention (bottom) during caption generation
Vinyals et al. (2015).
18
1.4.2.2 Global and Local attention
Luong et al. (2015) coined the local and global attention distinction. Global attention is similar to
soft attention while the local one is an interesting blend of soft and hard attention Global attention
mechanisms focus on all elements of the input sequence when computing attention weights.
The model calculates a context vector by taking a weighted sum of all input elements, where
the weights are determined by the attention scores AI (2023). These mechanisms facilitate the
understanding and generation of complex structures by allowing the model to take into
account long-range dependencies and overall context Local attention mechanisms focus on a
smaller, localized region of the input sequence when computing attention weights Shen et al.
(2018).In contrast to processing all input elements simultaneously, the model adopts a strategy of
selecting a specific region or window surrounding the current position. Within this localized
context, attention weights are computed, focusing solely on the pertinent information
contained within that region. This approach proves advantageous, particularly in scenarios
where relevant data is confined to specific, nearby contexts, such as in the analysis of time
series data or text with pronounced local patterns. Moreover, local attention offers
computational benefits over global attention by minimizing the number of attention
computations required. This computational efficiency renders local attention well-suited for
tasks involving extensive input sequences or constrained computational resources.
Figure 1.11: Comparative of global (left) and local (right) attention in seq2seq model Luong
et al. (2015).
The Bottom-up and Top-down attention mechanisms have been extensively used in image
captioning to enable deeper image understanding by extracting visual features over all regions to
generate captions.
In the top-down Bahdanau et al. (2014) Xu et al. (2015) Karpathy and Fei-Fei (2015)
Donahue et al. (2015) begins from a simple representation of an image by extracting visual
features first to generate a caption.
In the bottom-up Li et al. (2019) Fang et al. (2015) Farhadi et al. (2010) visual concepts
are extracted first and combined later to generate a caption.
19
These approaches have limitations like having difficulty dealing with fine details of
images. To address this limitation Anderson et al. (2017) proposed to combine both the
Bottom-Up and the Top-Down attention mechanisms for a better caption. The bottom-up
mechanism proposes image regions, each with an associated feature vector, while the top-down
mechanism determines feature weightings. Figure 1.10 shows an example of the generation of a
sequence representation for an input image using the top-down and bottom-up approaches
followed by a Transformer network for generating captions.
Figure 1.12: generation of a sequence representation for an input image using the top-down
and bottom-up techniques followed by a Transformer network for generating the textual
description Parameswaran and Das (2018).
Spatial and channel-wise attention plays a crucial role in improving the performance and inter-
pretability of deep learning models, especially in tasks that involve processing visual data and
generating textual descriptions like image captioning.
The spatial attention mechanism focuses on recognizing and identifying the relationship
between objects in the input image Nguyen et al. (2023). It helps the model identify these objects,
their actions, and the relationship between them to generate the describing sentence.
The channel-wise attention can be viewed as selecting semantic attributes based on the
demand of the sentence context. A channel-wise feature map is a detector response map for a
given filter Chen et al. (2017). It enables the model to concentrate on specific semantic
elements in the image, improving its capacity to extract important information for the task.
We can calculate attended features using two methods based on these two mechanisms.
the first is spatial-channel where We can utilize spatial attention to modify visual features for
channel-wise attention, and the second is channel-spatial where channel-wise attention modulates
visual features for spatial attention Chen et al. (2017).
Adaptive attention integrates spatial and channel-wise attention mechanisms to form a more com-
prehensive and adaptive attention model. It was introduced by Chen et al. (2017) they
proposed
20
a unified SCA-CNN framework to effectively integrate spatial,multi-layers, and channel-wise
attention in CNN features for image captioning. the idea behind adaptive attention is to find a
way for the model to know when it should focus on visual features and when it should focus
on textual features for caption generation. In particular. This approach contributes to a better
understanding of how CNN characteristics grow during sentence creation.
1.4.2.6 Self-attention
Figure 1.13: The scaled dot product attention-The multi-head-attention Vaswani et al. (2017).
21
1.4.2.7 comparison between different attention mechanisms
This section delves into a comparative analysis of several attention mechanisms, highlighting
their advantages and drawbacks. These mechanisms can be generally classified based on their
approach to processing information, including soft and hard attention, global and local
attention, bottom-up and top-down attention, spatial and channel-wise attention, adaptive
attention, self- attention, and multi-head attention.
22
1.5.1 Deep neural network-based image captioning
The neural network-based models are inspired by the success of deep neural networks in machine
learning tasks and are used in an encoder-decoder architecture to create image captions and
model language.
The first to use the encoder-decoder framework to generate descriptions of images was Kiros
et al. (2014). They used CNN to represent an image and joined image-text attributes to create
a multimodal neural language like language translation the image caption can be constructed
word by word. Their model can generate image captions without using any default template
or structure, making it more flexible. However, was unable to pick up latent representations
of the interactions between the objects in the image. Vinyals et al. (2015) propose a similar
method to Kiros et al. (2014). This method trains an LSTM based on maximum likelihood
estimation. A key challenge with this framework is the vanishing gradient problem, also the
caption tends to lose relevance when generating long sequences. Motivated by the
achievement, some researchers proposed methods to augment it by including high-level semantic
concepts. Jia et al. (2015) introduced an extension of LSTM (gLSTM), they extract semantic
information from images and this information is included as an extra input to each gate and cell
state of the LSTM network. While Wu et al. (2017) Incorporated high-level concepts into encoder-
decoder, trained CNN classifier for each attribute, decoder LSTM with prediction vector, and
integrated external knowledge for improving the overall performance of Vision-to-Language
tasks. Wang et al. (2018) designed an end-to-end bidirectional LSTM model (Bi-LSTM). This
model used two LSTMs to exploit both long-term back and forward contexts. By combining a
deep CNN and two separate LSTM networks, the proposed model can learn long-term visual-
language interactions by using history and future context information at high-level semantic
space. Gao et al. (2019) proposed The hierarchical LSTM with adaptive attention (hLSTMat)
approach improves image and video captioning performance by leveraging both low-level visual
information and high-level language context information. Chu et al. (2020) designed an
Automatic Image Captioning model based on ResNet50 and LSTM with Soft Attention, where
ResNet50 encodes images into graphical features and an LSTM decoder, integrated with a soft
attention model, generates descriptive sentences. Additionally Mallick and Naik (2021)
proposed a GRU-based model with attention mechanisms, which improves image captioning
precision and training speed over LSTM decoders. Ahmad et al. (2022) presented an image
captioning algorithm utilizing a CNN-GRU encoder-decoder framework, enhancing accuracy
and reducing time complexity by considering the semantic context. Zhang et al. (2022)
proposed a method using Bi-directional LSTM (Bi-LSTM) with a subsidiary attention
mechanism (S-Att), which captures both past and subsequent information to extract semantic
details and predict image content based on context clues. These advancements collectively
push the boundaries of image captioning through the integration of deep learning techniques
and attention mechanisms.
23
Author(s) Encoder Decoder Attention Technique & Solution Dataset Metrics
Type
Kiros et al. VGG-16 LSTM - Combined image-text at- Flickr8k, Recall
(2014) tributes in a multimodal neural Flickr30k
language model.
Vinyals GoogLeNet LSTM - Encoder-decoder framework PASCAL 1K, METEOR,
et al. (2015) with LSTM trained by maxi- Flickr 30K, CIDRE
mum likelihood estimation. MS-COCO
Jia et al. CNN gLSTM - gLSTM by including semantic MSCOCO, BLEU, ME-
(2015) information as extra input to Flickr8k, TEOR
LSTM gates and cell states. Flickr30k
. Wu et al. VGGnet LSTM - Incorporated high-level con- MS COCO, BLEU,
(2017) cepts into encoder-decoder, Flickr30k METEOR,
trained CNN classifier for each CIDRE,sentence
attribute, decoder LSTM with perplexity
prediction vector with the in-
corporating external knowl-
edge.
Wang et al. AlexNet Bi- - Used two LSTMs to exploit MSCOCO, BLEU,
(2018) VGG-16 LSTM both long-term backward and Flickr8K, METEOR,
forward contexts for better Flickr30k, CIDRE
visual-language interactions. Pascal1K
Gao et al. ResNet- hLSTM Adaptive At- Hierarchical LSTM with adap- MSCOCO, BLEU, ME-
(2019) 101 tention tive attention, leveraging low- Flickr30k TEOR,ROUGE,
level visual and high-level CIDRE,
lan- guage context SPICE
information.
Chu et al. ResNet50 LSTM Soft Attention Automatic image captioning MSCOCO BLEU,
(2020) with ResNet50 encoder and METEOR,
LSTM decoder with soft CIDRE
atten- tion for generating
descriptive sentences.
Mallick and INCEPTION GRU Soft Attention GRU-based model with Bah- MSCOCO BLEU
Naik (2021) V3 danau Attention for improved
image captioning precision
and training speed over LSTM
decoders.
Ahmad CNN GRU - CNN-GRU encoder-decoder MSCOCO BLEU,
et al. (2022) framework, enhancing accu- ROUGE, ME-
racy and reducing time com- TEOR,CIDRE
plexity by considering the se-
mantic context.
Zhang et al. Resnet-101 Bi- Subsidiary Bi-directional LSTM with MSCOCO BLEU,
(2022) LSTM Attention(S- subsidiary attention mecha- ROUGE, ME-
ATT) nism, capturing both forward TEOR,CIDRE
and backward information
for context-aware image
captioning.
24
image captioning
25
transformer architecture, at the decoding level, the model employs a mesh design to account
for both high-level and low-level semantics. The complex grid system accurately generates
captions by deciphering the visual context. Additionally,Guo et al. (2020) developed a two-
factored decoder architecture and focused on the self-attention notion and geometrically aware
self- attention Network of transformers to make the model more flexible. Recent research
continues to push the boundaries of image captioning with transformers. Liu et al. (2021)
proposed a Full Transformer Network for Image Captioning, which takes the sequentialized
raw images as the input to the Transformer. The model is convolution-free and can model
global context at every encoder layer from the beginning. Additionally, Wang et al. (2022)
proposed an End-to-End Transformer Based Model for Image Captioning they built a pure
Transformer-based model. Firstly, they adopted a SwinTransformer as a backbone encoder to
extract grid-level features from given images. Then, referring to Transformer. The refining
encoder refines the grid features by capturing the intra-relationship among them, and the
decoder decodes the refined features into captions word by word.. Zheng and Pun (2022)
proposed a Hybrid-Spatial Transformer for Image Captioning they combined the Global
information and Local information of the image as input of the encoder which was extracted
by VGG16 and Faster R-CNN respectively. This combination of features enhances the
performance of the model in image captioning tasks. Lu et al. (2023) presented the Full-Memory
Transformer, which includes an encoder with Full Layer Normalization (Full-LN) and a decoder
with Memory-Augmented Network (MAN). that aims to reduce losses in image feature transfer
and generate more accurate captions.
26
Author Encoder Decoder Attention Method Dataset Metric
Type
Yu et al. Faster Transformer Self- Used a multimodal trans- MS BLEU,
(2019) RCNN Attention former as a decoder for im- COCO METEOR,
age captioning CIDEr,
Rouge
Li et al. CNN Transformer ETA Entangled Transformer to MS BLEU,
(2019) bridge visual and textual COCO METEOR,
se- mantic gap CIDEr
Cornia et al. Faster Transformer Self- Meshed memory-based ar- MS BLEU,
(2020) R-CNN Attention chitecture to account for COCO METEOR,
ResNet-101 high-level and low-level se- CIDEr,
mantics Rouge,
Spice
Guo et al. Faster Transformer G-ASA Two-factored decoder archi- MS BLEU,
(2020) RCNN tecture with geometrically COCO METEOR,
aware self-attention CIDEr,
Rouge,
Spice
Liu et al. Transformer Transformer Self- Full Transformer Network MS BLEU,
(2021) Attention that is convolution-free COCO METEOR,
(CPTR) CIDEr,
Rouge
Wang et al. Swin- Transformer Self- Pure Transformer model MS BLEU,
(2022) Transformer Attention with SwinTransformer back- COCO METEOR,
bone and grid-level feature CIDEr
extraction
Zheng and VGG16, Transformer Spatial Hybrid-Spatial Transformer MS BLEU,
Pun (2022) Faster combining global and local COCO METEOR,
RCNN information as encoder in- CIDEr
put
Lu et al. Transformer Transformer MAN Full-Memory Trans- MS BLEU,
(2023) former with Full Layer COCO CIDEr
Normalization and Memory-
Augmented Network for
decoding
Figure 1.14: Structure of image captioning method based on the encoder-decoder architecture
with Attention mechanism Gómez Martínez (2019).
27
Xu et al. (2015) were the first to introduce attention to the problem of automatic image
captioning. They propose an attentive encoder-decoder model to be able to dynamically attend
salient image regions during the process of generating image descriptions. However different
attention mechanisms can be used with various function forms, Xu et al. (2015) have tried two
different methods to simulate attention: stochastic hard attention and deterministic soft attention.
These methods can work on fixed spatial locations in the image, yet a drawback stands out:
they utilize image features from the lower CNN layer, which may fail to capture high-level
information. Pedersoli et al. (2017) proposed the area-based attention mechanisms. Its models
allow direct association between the generated caption words, and the image regions.
Anderson et al. (2018) proposed to combine both the Bottom-Up and the Top-Down attention
mechanisms for caption generation. In the Bottom-up they proposed a set of image regions
represented by pooled convolutional feature vectors and in the Top-Down they calculated
attention weights over these regions during generation. Attention-based methods have shown
good efficiency in image captioning and other computer vision tasks. However, the
introduction of transformers marked another leap forward. therefore Huang et al. (2019)
presented an attention-on-attention (AoA) module to model the relevance between the attention
result and the attention query. Based on the attention result and the query, it first generates an
information vector and an attention gate, then utilizes another attention operation to them, thus
finally the expected knowledge is obtained. Pan et al. (2020) proposed a bilinear attention
module, which used spatial and channel bilinear attention distribution to capture the interactive
information between object features. through the multi-head attention mechanism’s
incorporation of the relative position data. Fei (2022) presented an attention-aligned
transformer for captioning images that directs attention learning without the need for
annotations in a perturbation-based, self-supervised manner. More specifically, they used a
learnable network to apply mask operation on picture regions to estimate the true function during
the final description generation process. They hypothesized that more attention should be paid
to the essential image region features where even a slight disturbance degrades performance.
A Label-Attention Transformer with Geometrically Coherent Objects for Image Captioning
was proposed by Dubey et al. (2023). In this work, they investigated two unexplored ideas
for image captioning using transformers: First, they demonstrated the enforcement of using
objects’ relevance in the surrounding environment. Second, learned an explicit association
between labels and language constructs. The proposed technique acquires a proposal of
geometrically coherent objects using a deep neural network (DNN) and generates captions by
investigating their relationships using a Label Attention Module (LAM), which in turn
associates the extracted object classes to the available dictionary using self-attention layers.
after that, a Geometric Coherence Proposal (GCP) Module captures the localized ratio of objects
by their size and position to acquire geometrically coherent objects. This helps the model
better understand the spatial relationships between objects.
28
Author Encoder Decoder Attention Method Dataset Metric
Xu et al. CNN LSTM Stochastic Attentive encoder-decoder MS BLEU,
(2015) Hard, De- model to dynamically attend COCO METEOR,
terministic to salient image regions CIDEr
Soft dur- ing caption generation
Pedersoli VGG16 GRU Area- Direct association between MS BLEU,
et al. (2017) Attention generated words and image COCO METEOR,
regions CIDEr
Anderson Faster R-CNN LSTM Bottom- Combines Bottom-Up at- MS BLEU,
et al. (2018) Up and tention (image regions as COCO CIDEr,
Top-Down pooled convolutional fea- SPICE
tures) and Top-Down atten-
tion (weights over these re-
gions during generation)
Huang et al. R-CNN LSTM Attention Models the relevance be- MS BLEU,
(2019) on Atten- tween attention result and at- COCO METEOR,
tion (AoA) tention query with an infor- CIDEr
mation vector and attention
gate, followed by another at-
tention operation
Pan et al. Faster R-CNN LSTM X-linear Uses spatial and channel- MS CIDEr
(2020) wise attention to capture COCO
interactive information be-
tween object features, in-
corporates relative position
data
Fei (2022) Faster R-CNN Transformer Attention Attention-aligned trans- MS BLEU,
with Resnet- Alignment former that directs attention COCO METEOR,
101 learning in a self-supervised CIDEr,
manner using perturbation- Rouge,
based mask operations on space
image regions
Dubey et al. Faster R-CNN Transformer Self- Label-Attention Trans- MS BLEU,
(2023) with Resnet- Attention, former that enforces objects’ COCO METEOR,
101 Label- relevance and learns explicit CIDEr,
Attention associations between labels ROUGE
and and language constructs use
Memory- a GCP Module to capture
Augmented spatial relationships
Attention
29
1.6.1.1 wordnet
WordNet is a semantic lexicon database for the English language. Created and maintained
since 1986 at the Cognitive Science Laboratory of Princeton University Fellbaum (1998).
WordNet provides brief, generic descriptions, records the different semantic linkages between
words, and classifies English words into sets of synonyms called synsets. WordNet consists of
words classified into four syntactic categories: nouns, verbs, adjectives, and adverbs. These
words are grouped into synsets, which represent a word’s meaning through a set of synonyms.
For example, a synset for the word ’car’ can be defined by the list of synonyms: auto, railcar,
gondola, elevator car, cable car as shown in Figure 1.12.
Figure 1.15: Illustration of information provided by WordNet for the term "Car"
The structure of the WordNet is made up of words and synsets linked together by conceptual-
semantic links, it consists of words, senses, and Synsets. For example, the nouns president,
chairman, chair, and chairperson form a synset. Figure 1.13 below best describes the structure of
WordNet.
30
1.6.1.3 Semantic relation in wordnet
WordNet is structured as a semantic net where lexical relations interlink words and synsets
are connected by semantic relations Miller (1995). Synonymy and antonymy are the major
lexical relations in WordNet. Semantic relations serve to build knowledge structures
depending on the syntactical category, there are different semantic relations able to build
those structures.
• Meronymy: indicating a part relation and is semantically diverse, present for some
nouns. it’s inverse holonymy (whole-name), are complex semantic relations. WordNet
distin- guishes parts, substantive parts, and member parts. For example, "hand" is a
meronym of "body" and "body" is a holonym of "hand".
• Troponymy: is a manner name for verbs, however, the resulting hierarchies are much
shallower.
WordNet is a widely used lexical database that organizes English words into synsets, providing a
structured framework for capturing semantic relationships between these synsets. This hierarchi-
cal organization enables WordNet to be a valuable resource for measuring semantic similarity
between words and concepts. The main goal of calculating the similarity degree between
words is to facilitate the construction of proximity equations with linguistic criteria Julian-
Iranzo and Sáenz-Pérez (2021). Similarity measures are limited to noun and verb pairs
because WordNet organizes nouns and verbs into hyponymy/hypernymy-based hierarchies of
concepts. Some hybrid techniques have already been proposed to measure the similarity
between sentences all the measures can be grouped into four classes: path length-based
measures, information content-based measures, feature-based measures, and hybrid measures.
The following table provides a summary of these similarities.
31
Category Mesure Description Avantages Limitations
Shortest Calculates similarity • Simple to compute • Does not consider
Path based on the length of • Intuitive interpretation concept positions in
Path-based
the shortest path the taxonomy
Measures
between two concepts. • Treats all edges equally
Wu and Considers the position of • Incorporates depth of • Requires computing the
Palmer concepts relative to the concepts LCS
(1994) least common • Performs better than
subsumer (LCS). the shortest path
Leacock Considers the maximum • Simple formula • Treats all edges equally
Chodorow depth of the taxonomy. • Considers the maximum
Leacock depth of the taxonomy
(1998)
Li et al. Combines the shortest • Captures both path • Requires manual tuning
(2003) path and the depth of length and depth of parameters alpha and
concepts in a non- • Non-linear function beta
linear function. provides more flexibility
Resnik Based on the information • Simple to compute • Does not consider the
Information
(1995) content (IC) of the least • Considers the information content of
Content-
common subsumer information content the individual
based
(LCS). of concepts
Measures
the LCS
Lin et al. Combines the IC of the • Considers the • Requires computing the
(1998) LCS with the IC of information content IC of individual concepts
the individual of both the LCS and
concepts. the individual
concepts
Jiang and Calculates semantic • Considers the • Requires computing the
Conrath distance to obtain information content IC of individual concepts
(1997) semantic similarity. of both the LCS and
the individual
concepts
• Provides a
distance-based measure
Tversky Feature-based measure • It is independent of the • The measure needs
Feature- (1977) considering common taxonomy structure manual tuning of the
based and non-common • Flexible and applicable parameter k
Measures features of concepts. • It can’t work well
when there is not a
complete
features set
• The computational
complexity
Zhou et al. Combines path-based and • Combines the strengths • Requires computing the
(2008) information of path-based and IC of individual concepts
Hybrid
content-based IC-based measures • Needs manual tuning
Measures
approaches. • Provides a flexible of the parameter K
tradeoff between the
two
approaches
1.6.2 Bert
Google BERT Bidirectional Encoder Representations from Transformer(BERT) Devlin et al.
(2018) From Google AI The BERT framework, is a new language representation model, that
uses pre-training and fine-tuning(transfer learning) to create SOTA (state-of-the-art) models for a
wide range of tasks.BERT is initially trained on a large dataset of text in an unsupervised manner,
enabling it to generate high-quality word representations and accurately predict words within
a context, as well as infer relationships between sentences. The idea is to teach the model to
predict missed words in a sentence. To do this, part of the words in the sentence are replaced
32
with a special token (MASK), and the task of the model is to predict those words by their context.
BERT uses a multi-layer bidirectional Transformer encoder. Its self-attention layer performs
33
self-attention in both directions figure 1.17 shows an overview of Bert. Google has introduced
two primary versions of the BERT model:
• BERT Base: Comprising 12 layers of Transformers and a total of 110 million parameters.
• BERT Large: Featuring a more complex architecture with 24 layers of Transformers and a
total of 340 million parameters.
1.7 Conclusion
In this chapter, mentioned above, we explored the complex landscape of image captioning
methodologies, dissecting various approaches and their underlying architectures. We explored
traditional encoder-decoder structures and attention mechanisms, surveying related works and
the utilization of semantic resources in image captioning. This chapter aimed to provide a
solid understanding of attention-based image captioning and how it can be used to generate
human-like perception for producing rich and accurate visual descriptions.
In the next chapter, we will delve into the architectural details of our proposed model for
image captioning and explore the background of these techniques.
34
Chapter 2
Modeling Chapter
2.1 Introduction
In this chapter, we delve into the development of our image captioning system, which integrates a
Convolutional Neural Network (CNN) as the encoder with a Transformer decoder, further
enhanced by semantic knowledge from WordNet. We introduce a streamlined and powerful
technique to incorporate word similarity knowledge from WordNet into the decoder by
directly influencing the model’s attention mechanism.
33
Figure 2.1: Model architecture.
34
Figure 2.2: CNN encoder.
The first sub-layer uses an external knowledge base like WordNet to guide the attention
scores and uses a masked multi-head self-attention mechanism. This layer prevents the
model from accessing future information by masking it, ensuring that the model
generates the current word based only on previous words. As shown in Figure 2.1, the
inputs to this layer are identical
The second sub-layer is a multi-head attention mechanism that does not employ
masking. It applies multi-head attention over the output of the first layer, effectively
correlating text information with image information through attention mechanisms.
35
3. Feed-forward Network:
36
2.3 Delve into WordNet Enhanced Multihead-attention
2 × depth(lcs)
score = depth(s1) + depth(s2).
Using this approach, we construct a WordNet Similarity Matrix (WSM) of size SeqLen ×
SeqLen. The purpose of this matrix is to enhance the multi-head attention mechanism by
emphasizing semantically similar word pairs. The value of each cell in the similarity matrix
SM is computed based on semantic relations in WordNet. For a lexical pair (wa, wb):
• If wa and wb are not synonyms, WSMab is set to a value between 0 and 1, as determined
by the Wu-Palmer similarity, which measures their topological distance in WordNet.
• If one or both words in a pair cannot be found in WordNet or do not have a valid Wu-Palmer
similarity value (e.g., stop words like "into"), the similarity value is set to 0.
• For proper names of people and places, if the words in the pair are the same, the value
is set to 1; otherwise, it is 0.
This similarity matrix is then used to adjust the attention scores in the Transformer
decoder, thereby guiding the model to focus more on semantically related words.
37
2.3.2 Algorithm to Calculate WordNet Similarity Matrix (WSM)
38
where:
Q
W i ∈ Rdmodel×dk ,
WiK ∈ Rdmodel×dk ,
WiV ∈ Rdmodel×dv ,
W O ∈ Rh·dv×dmodel .
Scores = QKT
√ + MASK; (2.3)
dk
Attention(Q, K, V ) = Scores
√ V, (2.4)
softmax d k
Here, the MASK matrix is used to prevent the model from attending to future tokens, ensuring
that the predictions are based only on previously seen words. By splitting the queries, keys,
and values into multiple heads, the model can focus on different parts of the input sequence,
capturing more nuanced relationships within the data. This structure is pivotal in achieving the
high performance of Transformer models across various tasks, including image captioning in our
context.
39
2.5 WordNet Enhanced Multi-Head Attention
To incorporate prior knowledge into our model, we modify the multi-head attention phase,
specifically the scaled dot- product attention. We achieve this by calculating the element-wise
product of the attention scores with the WordNet Similarity Matrix (WSM). This adjustment
ensures the model gives more attention to word pairs with higher semantic similarities in the input
caption. The integration process is illustrated in Figure 2, where the WordNet Similarity
Matrix (WSM) is used to enhance the multi-head attention mechanism of the Transformer
model. This enhancement leverages semantic similarities between words to improve the
model’s attention process.
then linearly projected h times, then compute the scaled –dot product attention following these
steps:
1.Normalized WSM:
40
2. Attention Score Calculation
Compute the raw attention scores between the query and key matrices.
41
QKT
A= √ + Mask (2.6)
dk
Where dk is the dimension of the key vectors.
3. Normalized Scores
A′ = Softmax(A) (2.7)
Integrate the WordNet Similarity Matrix to adjust the attention scores. Element-Wise Multiplication
with WSM’
α = A′ ⊙ WSM′ (2.8)
Where ⊙ denotes element-wise multiplication.
O = αV (2.9)
42
Figure 2.7 :Attention calculation process
2.6 Conclusion
In this chapter, we presented our image captioning system, combining a Convolutional Neural
Network (CNN) as the encoder with a Transformer as the decoder, enhanced by semantic
knowledge from WordNet. Our key innovation is the integration of WordNet into the
decoder’s multi-head attention mechanism, allowing the model to leverage word similarity
knowledge. This enhancement improves the model’s ability to generate accurate and
contextually relevant captions by capturing deeper semantic relationships. The detailed
architecture and components discussed demonstrate a powerful approach to image captioning,
achieving better alignment with human judgments through the innovative use of external
semantic knowledge.
43
Bibliography
Abbaspour, S., Fotouhi, F., Sedaghatbaf, A., Fotouhi, H., Vahabi, M., and Linden, M. (2020).
A comparative analysis of hybrid deep learning models for human activity recognition.
Sensors, 20(19):5707.
Afzal, M. K., Shardlow, M., Tuarob, S., Zaman, F., Sarwar, R., Ali, M., Aljohani, N. R., Lytras,
M. D., Nawaz, R., and Hassan, S.-U. (2023). Generative image captioning in urdu using
deep learning. Journal of Ambient Intelligence and Humanized Computing, 14(6):7719–
7731.
Ahmad, R. A., Azhar, M., and Sattar, H. (2022). An image captioning algorithm based on the
hybrid deep learning technique (cnn+ gru). In 2022 International Conference on Frontiers
of Information Technology (FIT), pages 124–129. IEEE.
AI, A. (2023). Must-read starter guide to mastering attention mechanisms in machine learning.
Arize AI.
Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., and Zhang, L.
(2017). Bottom-up and top-down attention for image captioning and vqa. arXiv preprint
arXiv:1707.07998, 2(4):8.
Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., and Zhang, L. (2018).
Bottom-up and top-down attention for image captioning and visual question answering. In
Proceedings of the IEEE conference on computer vision and pattern recognition, pages
6077–6086.
Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to
align and translate. arXiv preprint arXiv:1409.0473.
Chen, L., Zhang, H., Xiao, J., Nie, L., Shao, J., Liu, W., and Chua, T.-S. (2017). Sca-cnn: Spatial
and channel-wise attention in convolutional networks for image captioning. In Proceedings of
the IEEE conference on computer vision and pattern recognition, pages 5659–5667.
Cho, K., Van Merriënboer, B., Bahdanau, D., and Bengio, Y. (2014a). On the properties of neural
machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259.
44
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and
Bengio, Y. (2014b). Learning phrase representations using rnn encoder-decoder for statistical
machine translation. arXiv preprint arXiv:1406.1078.
Chu, Y., Yue, X., Yu, L., Sergei, M., and Wang, Z. (2020). Automatic image captioning based
on resnet50 and lstm with soft attention. Wireless Communications and Mobile Computing,
2020:1–7.
Cornia, M., Stefanini, M., Baraldi, L., and Cucchiara, R. (2020). Meshed-memory transformer
for image captioning. In Proceedings of the IEEE/CVF conference on computer vision and
pattern recognition, pages 10578–10587.
Devi, M. S. and Mittal, H. (2016). Machine learning techniques with ontology for subjective
answer evaluation. arXiv preprint arXiv:1605.02442.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep
bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko,
K., and Darrell, T. (2015). Long-term recurrent convolutional networks for visual
recognition and description. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 2625–2634.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani,
M., Minderer, M., Heigold, G., Gelly, S., et al. (2020). An image is worth 16x16 words:
Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
Dubey, S., Olimov, F., Rafique, M. A., Kim, J., and Jeon, M. (2023). Label-attention transformer
with geometrically coherent objects for image captioning. Information Sciences, 623:812–831.
Fang, H., Gupta, S., Iandola, F., Srivastava, R. K., Deng, L., Dollár, P., Gao, J., He, X., Mitchell,
M., Platt, J. C., et al. (2015). From captions to visual concepts and back. In Proceedings of
the IEEE conference on computer vision and pattern recognition, pages 1473–1482.
Farhadi, A., Hejrati, M., Sadeghi, M. A., Young, P., Rashtchian, C., Hockenmaier, J., and Forsyth,
D. (2010). Every picture tells a story: Generating sentences from images. In Computer Vision–
ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece,
September 5-11, 2010, Proceedings, Part IV 11, pages 15–29. Springer.
45
Gao, L., Li, X., Song, J., and Shen, H. T. (2019). Hierarchical lstms with adaptive attention for
visual captioning. IEEE transactions on pattern analysis and machine intelligence, 42(5):1112–
1131.
Girshick, R. (2015). Fast r-cnn. In Proceedings of the IEEE international conference on computer
vision, pages 1440–1448.
Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014). Rich feature hierarchies for accurate
object detection and semantic segmentation. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 580–587.
Guo, L., Liu, J., Zhu, X., Yao, P., Lu, S., and Lu, H. (2020). Normalized and geometry-aware
self-attention network for image captioning. In Proceedings of the IEEE/CVF conference
on computer vision and pattern recognition, pages 10327–10336.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pages
770–778.
Hernández, A. and Amigó, J. M. (2021). Attention mechanisms and their applications to complex
systems. Entropy, 23(3):283.
Hossain, M. Z., Sohel, F., Shiratuddin, M. F., and Laga, H. (2019). A comprehensive survey
of deep learning for image captioning. ACM Computing Surveys (CsUR), 51(6):1–36.
Hu, J. C., Cavicchioli, R., and Capotondi, A. (2022). Expansionnet v2: Block static expansion in
fast end to end training for image captioning. arXiv preprint arXiv:2208.06551.
Huang, L., Wang, W., Chen, J., and Wei, X.-Y. (2019). Attention on attention for image
captioning. In Proceedings of the IEEE/CVF international conference on computer vision,
pages 4634–4643.
Ji, J., Luo, Y., Sun, X., Chen, F., Luo, G., Wu, Y., Gao, Y., and Ji, R. (2021). Improving
image captioning by leveraging intra-and inter-layer global representation in transformer
network. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages
1655–1663.
Jia, X., Gavves, E., Fernando, B., and Tuytelaars, T. (2015). Guiding the long-short term memory
model for image caption generation. In Proceedings of the IEEE international conference
on computer vision, pages 2407–2415.
46
Jiang, J. J. and Conrath, D. W. (1997). Semantic similarity based on corpus statistics and lexical
taxonomy. arXiv preprint cmp-lg/9709008.
Jurafsky, D. and Martin, J. H. (n.d.). Speech and language processing: Sequence processing with
recurrent networks. Speech and Language Processing.
Karpathy, A. and Fei-Fei, L. (2015). Deep visual-semantic alignments for generating image
de- scriptions. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 3128–3137.
Kaur, M. and Mohta, A. (2019). A review of deep learning with recurrent neural network. In
2019 International Conference on Smart Systems and Inventive Technology (ICSSIT), pages
460–465. IEEE.
Kiros, R., Salakhutdinov, R., and Zemel, R. S. (2014). Unifying visual-semantic embeddings
with multimodal neural language models. arXiv preprint arXiv:1411.2539.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep
convolutional neural networks. Advances in neural information processing systems, 25.
Leacock, C. (1998). Combining local context and wordnet similarity for word sense identification.
WordNet: A Lexical Reference System and its Application, pages 265–283.
LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. nature, 521(7553):436–444.
Li, G., Zhu, L., Liu, P., and Yang, Y. (2019). Entangled transformer for image captioning. In
Proceedings of the IEEE/CVF international conference on computer vision, pages 8928–8937.
Li, Y., Bandar, Z. A., and McLean, D. (2003). An approach for measuring semantic similarity
between words using multiple information sources. IEEE Transactions on knowledge and
data engineering, 15(4):871–882.
Lin, D. et al. (1998). An information-theoretic definition of similarity. In Icml, volume 98, pages
296–304.
Liu, W., Chen, S., Guo, L., Zhu, X., and Liu, J. (2021). Cptr: Full transformer network for image
captioning. arXiv preprint arXiv:2101.10804.
Lu, T., Wang, J., and Min, F. (2023). Full-memory transformer for image captioning. Symmetry,
15(1):190.
47
Luong, M.-T., Pham, H., and Manning, C. D. (2015). Effective approaches to attention-based
neural machine translation. arXiv preprint arXiv:1508.04025.
Mallick, V. R. and Naik, D. (2021). Describing image with attention based gru. In 2021 6th
International Conference for Convergence in Technology (I2CT), pages 1–6. IEEE.
Medsker, L. and Jain, L. C. (1999). Recurrent neural networks: design and applications. CRC
press.
Miller, G. A. (1995). Wordnet: a lexical database for english. Communications of the ACM,
38(11):39–41.
Mokady, R., Hertz, A., and Bermano, A. H. (2021). Clipcap: Clip prefix for image captioning.
arXiv preprint arXiv:2111.09734.
Nguyen, B. T., Nguyen, S. T., and Vo, A. H. (2023). Channel and spatial attention mechanism
for fashion image captioning. International Journal of Electrical & Computer Engineering
(2088-8708), 13(5).
Nirenburg, S. (1995). Bar hillel and machine translation: then and now. In Proceedings of the
Fourth Bar Ilan Symposium on Foundations of Artificial Intelligence.
Pan, Y., Yao, T., Li, Y., and Mei, T. (2020). X-linear attention networks for image captioning. In
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition,
pages 10971–10980.
Parameswaran, S. N. and Das, S. (2018). A bottom-up and top-down approach for image
captioning using transformer. In Proceedings of the 11th Indian Conference on Computer
Vision, Graphics and Image Processing, pages 1–9.
Pedersoli, M., Lucas, T., Schmid, C., and Verbeek, J. (2017). Areas of attention for image
captioning. In Proceedings of the IEEE international conference on computer vision, pages
1242–1250.
Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster r-cnn: Towards real-time object detection
with region proposal networks. Advances in neural information processing systems, 28.
48
Sharma, H. and Padha, D. (2023). From templates to transformers: A survey of multimodal
image captioning decoders. In 2023 International Conference on Computer, Electronics &
Electrical Engineering & their Applications (IC2E3), pages 1–6. IEEE.
Shen, T., Zhou, T., Long, G., Jiang, J., Wang, S., and Zhang, C. (2018). Reinforced self-
attention network: a hybrid of hard and soft attention for sequence modeling. arXiv preprint
arXiv:1801.10296.
Shiri, F. M., Perumal, T., Mustapha, N., and Mohamed, R. (2023). A comprehensive overview
and comparative analysis on deep learning models: Cnn, rnn, lstm, gru. arXiv preprint
arXiv:2305.17473.
Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for large-scale image
recognition. arXiv preprint arXiv:1409.1556.
Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with neural
networks. Advances in neural information processing systems, 27.
Tao, C., Lu, J., Lang, J., Peng, X., Cheng, K., and Duan, S. (2021). Short-term forecasting of
photovoltaic power generation based on feature selection and bias compensation–lstm network.
Energies, 14(11):3086.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and
Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing
systems, 30.
Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015). Show and tell: A neural image caption
generator. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 3156–3164.
Wang, C., Yang, H., and Meinel, C. (2018). Image captioning with deep bidirectional lstms
and multi-task learning. ACM Transactions on Multimedia Computing, Communications, and
Applications (TOMM), 14(2s):1–20.
Wang, X., Xu, J., Shi, W., and Liu, J. (2019). Ogru: An optimized gated recurrent unit
neural network. In Journal of Physics: Conference Series, volume 1325, page 012089. IOP
Publishing.
49
Wang, Y., Xu, J., and Sun, Y. (2022). End-to-end transformer based model for image captioning.
In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 2585–2594.
Wu, Q., Shen, C., Wang, P., Dick, A., and Van Den Hengel, A. (2017). Image captioning and
visual question answering based on attributes and external knowledge. IEEE transactions
on pattern analysis and machine intelligence, 40(6):1367–1381.
Wu, Z. and Palmer, M. (1994). Verbs semantics and lexical selection. inproceedings of the 32nd
annual meeting on association for computational linguistics (pp. 133-138). In Association
for Computational Linguistics.
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., and Bengio, Y.
(2015). Show, attend and tell: Neural image caption generation with visual attention. In
International conference on machine learning, pages 2048–2057. PMLR.
Yao, B. Z., Yang, X., Lin, L., Lee, M. W., and Zhu, S.-C. (2010). I2t: Image parsing to text
description. Proceedings of the IEEE, 98(8):1485–1508.
Yu, J., Li, J., Yu, Z., and Huang, Q. (2019). Multimodal transformer with multi-view visual
representation for image captioning. IEEE transactions on circuits and systems for video
technology, 30(12):4467–4480.
Zhang, H., Ma, C., Jiang, Z., and Lian, J. (2022). Image caption generation using contextual
information fusion with bi-lstm-s. IEEE Access, 11:134–143.
Zheng, J. and Pun, C.-M. (2022). Hybrid-spatial transformer for image captioning. In
Proceed- ings of the 5th International Conference on Control and Computer Vision, pages
22–28.
Zhiwei, L. et al. (2022). Bearing fault diagnosis of end-to-end model design based on 1dcnn-
gru network. Discrete Dynamics in Nature and Society, 2022.
Zhou, Z., Wang, Y., and Gu, J. (2008). New model of semantic similarity measuring in wordnet.
In 2008 3rd International Conference on Intelligent System and Knowledge Engineering,
volume 1, pages 256–261. IEEE.
50