0% found this document useful (0 votes)
11 views

RBBA ResNet - BERT - Bahdanau Attention for Image Caption Generator

The document presents a study on an image caption generator that employs a ResNet50 - BERT - Bahdanau Attention model, which outperforms existing methods in generating descriptive captions for images. The research evaluates seven methodologies using the BLEU metric on the Flickr8K dataset, highlighting advancements in combining computer vision and natural language processing techniques. The proposed model achieved a BLEU-1 score of 0.532143 and a BLEU-4 score of 0.126316, demonstrating its effectiveness in the field of image caption generation.

Uploaded by

saiprathaptedla
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

RBBA ResNet - BERT - Bahdanau Attention for Image Caption Generator

The document presents a study on an image caption generator that employs a ResNet50 - BERT - Bahdanau Attention model, which outperforms existing methods in generating descriptive captions for images. The research evaluates seven methodologies using the BLEU metric on the Flickr8K dataset, highlighting advancements in combining computer vision and natural language processing techniques. The proposed model achieved a BLEU-1 score of 0.532143 and a BLEU-4 score of 0.126316, demonstrating its effectiveness in the field of image caption generation.

Uploaded by

saiprathaptedla
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

RBBA: ResNet - BERT - Bahdanau Attention for

Image Caption Generator


2023 14th International Conference on Information and Communication Technology Convergence (ICTC) | 979-8-3503-1327-7/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICTC58733.2023.10392496

Duc-Hieu Hoang Duc Ngoc Minh Dang* Hanh Dang-Ngoc


Faculty of Electrical & Electronics Eng. Dept. of Computing Fundamental School of Electrical and Data Engineering
Ton Duc Thang University FPT University University of Technology Sydney
Ho Chi Minh City, Vietnam Ho Chi Minh City, Vietnam Sydney, NSW, Australia
[email protected] [email protected] [email protected]

Anh-Khoa Tran Phuong-Nam Tran Cuong Tuan Nguyen


Faculty of Electrical and Electronics Dept. of Computing Fundamental Dept. of Information Technology Specialization
Ton Duc Thang University FPT University FPT University
Ho Chi Minh City, Vietnam Ho Chi Minh City, Vietnam Ho Chi Minh City, Vietnam
[email protected] [email protected] [email protected]

Abstract—In recent years, the topic of image caption genera- II. R ELATED WORKS
tors has gained significant attention. Several successful projects
have emerged in this field, showcasing notable advancements. A. Convolutional neural networks
Image caption generators automatically generate descriptive cap-
tions for images through the encoder and decoder mechanisms.
Convolutional neural networks (CNNs) have a crucial role
The encoder leverages computer vision models, while the decoder in computer vision. The emergence of architectures such as
utilizes natural language processing models. In this study, we aim VGGNet [1] and ResNet [2] has significantly enhanced the
to assess a comprehensive set of seven distinct methodologies, efficacy of computer vision models across various tasks.
including six existing methods from prior research and one newly VGGNet [1] is a well-known convolutional neural network
proposed. These methods are trained and evaluated with bilingual
evaluation (BLEU) on the Flickr8K dataset. In our experiments, architecture consisting of multiple layers with small convolu-
the proposed ResNet50 – BERT – Bahdanau Attention model tional filters. It gained popularity for its simplicity and effec-
outperforms the other models in terms of the BLEU-1 score of tiveness in image classification tasks. The network architecture
0.532143 and BLEU-4 score of 0.126316. typically follows a consistent pattern of stacking convolutional
Index Terms—Deep learning, Natural language processing, layers with 3x3 filters, followed by max-pooling layers to
Encoder-Decoder, Flickr8K, BLEU, Image Caption.
reduce the spatial dimensions. The VGGNet architecture of-
fers various configurations, commonly known as VGG16 and
I. I NTRODUCTION VGG19, depending on the depth of the network. Its archi-
Generating textual descriptions or captions for images is one tecture has demonstrated impressive performance on various
of the most challenging tasks for artificial intelligence (AI). image classification benchmarks, achieving high-performance
Despite the difficulty, image caption generators have a wide results and establishing itself as a reliable and effective choice
variety of uses, from providing automatic image descriptions for deep-learning tasks.
for the blind to enhancing image search outcomes and pro- ResNet [2] is a groundbreaking convolutional neural net-
ducing more interesting social media posts. The development work architecture that has revolutionized the field of computer
of more precise and sophisticated image caption generators vision. ResNet addresses a common challenge encountered in
has advanced significantly in recent years, and they are now deep neural networks known as the vanishing gradient prob-
widely used across a variety of industries. lem. As networks become deeper, the gradients can vanish,
Image caption generators analyze the contents of an image leading to difficulties in training and optimization. To address
using deep learning techniques and generate a description of this issue, ResNet utilizes skip connections that allow the
what is happening in the image. Typically, image caption gen- network to bypass certain layers. By doing so, ResNet enables
erators employ a combination of computer vision techniques the direct flow of information from earlier layers to subsequent
to process the images and natural language processing (NLP) layers, facilitating the learning process. ResNet has achieved
algorithms to generate the descriptions. These algorithms can remarkable success, outperforming previous models in various
be trained on enormous datasets composed of photos and their computer vision tasks, such as image classification, object
captions, teaching the system how to correlate various visual detection, and image segmentation. Its deep architecture, with
cues with relevant descriptions. variants like ResNet-50, ResNet-101, and ResNet-152, has
become a standard benchmark in the field. Furthermore,
∗ Corresponding author: Duc Ngoc Minh Dang ([email protected]) ResNet’s impact extends beyond computer vision, the concept

979-8-3503-1327-7/23/$31.00 ©2023 IEEE 430 ICTC 2023


zed licensed use limited to: Vignan's Foundation for Science Technology & Research (Deemed to be University). Downloaded on September 23,2024 at 13:27:03 UTC from IEEE Xplore. Restrictions
of residual learning has inspired advancements in other do- generate the caption. Instead of merging the features at a later
mains, including natural language processing and audio signal stage, the image features are injected into the RNNs during
processing. the caption generation process, influencing the output at each
time step. Both approaches have their advantages and trade-
B. Natural language processing offs, and their performance can vary depending on the dataset
The field of NLP includes various activities, such as text and specific requirements of the image captioning task.
generation, sentiment analysis, language translation, speech In merged models, the recurrent neural networks (RNNs)
recognition, and natural language comprehension. Thanks to never directly interact with the image feature vectors or any
advancements in machine learning and deep learning tech- derived vector from the image. Instead, the image is added
niques, NLP has experienced remarkable progress in recent to the language model after the RNNs have encoded the
years. With the appearance of Long Short-Term Memory full prefix. This architecture is known as late-binding, where
networks [3] (LSTMs), Gated Recurrent Unit networks [4] the image representation remains constant throughout the
(GRUs), and especially Transformer [5], the field of NLP decoding process and is not changed at each time step.
witnessed significant advancements. These breakthroughs in In injected models, the image feature vector or a derived
sequence modeling and language understanding have revolu- vector from the image serves as input to the RNNs in parallel
tionized various NLP tasks, including machine translation, text with the word feature vectors of the caption prefix, such that
generation, sentiment analysis, and question-answering. either RNNs take two separate inputs, or the word feature
LSTMs and GRUs are designed to address the vanish- vector is combined with the image feature vector into a single
ing gradient problem in traditional recurrent neural networks input before being passed to the RNNs. The image feature
(RNNs). They employ gating mechanisms that enable the vector does not have to be identical for every word, nor
networks to selectively update and forget information over does it need to be associated with each word. This mixed
time, allowing them to capture long-range dependencies in binding architecture allows for some flexibility in the image
sequential data. LSTMs have been widely used in various NLP representation. However, if the same image is repeatedly
applications, and GRUs, a simplified variant of LSTMs, have provided to the recurrent neural networks (RNNs) at each
gained popularity due to their computational efficiency. time step, modifying the image representation becomes more
However, it was the introduction of the Transformer model challenging as the RNNs’ hidden state is refreshed with the
in 2017 that truly revolutionized the field of NLP. The original image during each iteration.
Transformer [5] model introduced a novel architecture based
solely on self-attention mechanisms, doing away with recur- III. I MAGE C APTION G ENERATOR
rent connections entirely. This architecture enabled parallel A. Merge-based Xception – Word2Vec (MXW2V)
processing of input sequences, making it highly scalable and
efficient. The self-attention mechanism in Transformers allows
the model to weigh the importance of different words or tokens
within a sequence when processing each word. This attention
mechanism provides a global context for each word, enabling
the model to capture dependencies between words regardless
of their position in the sequence. The use of self-attention
also reduces the vanishing gradient problem, as information
can flow directly from any word to any other word in the
sequence.
Fig. 1. Architecture of “Merge Xception - Word2Vec” method
C. Image caption generators
Image caption generators are essential tools that help im- Merge-based model is not exposed to the image feature
prove accessibility for individuals with visual impairments. vector at any point. Instead, the image is processed by
These systems automatically create descriptive captions for CNNs and introduced into the language model after the prefix
images, allowing visually impaired users to better understand has been encoded by the RNNs in its entirety. This is a
visual content shared online, including posts on social me- late binding architecture and it does not modify the image
dia, articles in the news, and web pages. In image caption representation with every time step. The system architecture
generators, two common approaches are used to generate MXW2V is made up of three sub-models: the feature extrac-
descriptive captions for images: merged models and injected tion model (FE-MODEL), the caption encoding model (CAP-
models [6]. The merged models combine image features ENC-MODEL), and finally the merged information decoding
with NLP techniques to generate captions. It processes the model. Xception architecture [7] is employed in the FE-
image through CNNs to extract visual features, which are MODEL, while the Word2Vec [8] technique is employed in
then merged with textual features in a subsequent neural the CAP-ENC-MODEL. The merged information decoding
network to generate the caption. On the other hand, injected model simply concatenates both the feature extraction model
models incorporate the image features directly into RNNs that and caption encoding model and forwards to a dense layer

431
zed licensed use limited to: Vignan's Foundation for Science Technology & Research (Deemed to be University). Downloaded on September 23,2024 at 13:27:03 UTC from IEEE Xplore. Restrictions
using the ReLU activation function. Softmax is used as the the output of the image and text features is concatenated
activation function to predict the output word. The details of together and then passed through LSMTs [3]. By feeding the
MXW2V are depicted in Figure 1. concatenated representation through LSTMs, the model can
effectively learn the relationships between the image and text
B. Merge-based InceptionResnetV2 – GloVe (MIRG)
features and capture the contextual information necessary for
predicting the next word in the sentence.

D. Inject-based InceptionResnetV2 – GloVe (IIRG)

Fig. 2. Architecture of “Merge Inception ResnetV2 - GloVe” method

The MIRG employed in this approach builds upon the Fig. 4. Architecture of “Inject InceptionResnetV2 - GloVe” method
MXW2V, which utilized Xception and Word2Vec. However,
it introduces notable improvements by leveraging the power Similar to the IXW2V approach, this architecture leverages
of the InceptionResNetV2 [9] and the GloVe [10] technique the InceptionResnetV2 [9] and Glove [10] methods to extract
whose architecture is illustrated in Figure 2. InceptionRes- features from images and text, respectively. These features are
NetV2 [9] is an exceptionally deep network, comprising a subsequently concatenated and fed into LSTMs to predict the
total of 164 layers. It combines the innovative ideas of the next word. Figure 4 illustrates the architecture of the IIRG
Inception module with the residual connections. These residual model.
connections reduce the vanishing gradient problem commonly
encountered in deep networks and facilitate the training of E. VGG16 – GRU – Bahdanau attention (VGBA)
highly complex models. GloVe [10] technique changed the
generation of word feature vectors by utilizing global word
co-occurrence data. By capturing the semantic relationships
between words, GloVe effectively merges the advantages of
count-based methods. GloVe effectively merges the advantages
of count-based Latent Semantic Analysis [11] and context-
based Word2Vec [8].
C. Inject-based Xception – Word2Vec (IXW2V)
Fig. 5. Architecture of “VGG16 - GRU - Bahdanau attention” method

VGBA is based on the pioneering work of [12], which


leverages an advanced encoding and decoding framework.
Figure 5 illustrates the details of the VGBA architecture.
The encoder incorporates the VGG16 [1] for visual feature
extraction and the tokenized encoding method for caption
processing. By including these components, the encoder ef-
fectively processes the input data. In the decoder section, both
the GRU [4] network and the Bahdanau attention mecha-
Fig. 3. Architecture of “Inject Xception - Word2Vec” method nism [13] are employed to enhance the output. The Bahdanau
attention mechanism has demonstrated significant performance
The inject architecture, similar to the merge architecture, improvements. Its core concept involves assigning attention
incorporates the combination of image characteristics and weights to prioritize specific feature vectors within the input
caption words into RNNs. In this approach, each caption word sequence. These attention weights inform the decoder about
is processed alongside the image features, creating a new the level of attention each input word should receive at
representation of the image for different parts of the phrase as different stages of decoding. By utilizing a set of attention
it is generated. The IXW2V architecture specifically utilizes weights, the decoder can focus on the most relevant portion of
the inject architecture and is constructed based on it. Figure 3 the image, guided by the alignment scores computed by a feed-
illustrates the architecture of the IXW2V model. In IXW2V, forward neural network. This attention mechanism enables the

432
zed licensed use limited to: Vignan's Foundation for Science Technology & Research (Deemed to be University). Downloaded on September 23,2024 at 13:27:03 UTC from IEEE Xplore. Restrictions
input and output sequences to concentrate on the most crucial To assess the performance of our model, we employed
elements, resulting in improved performance. the BLEU metric [16], which is a widely used and estab-
lished evaluation measure in the field of natural language
F. Xception – TT-LSTMs with Bi-LSTMs (XBi-LSTM)
processing. BLEU is commonly utilized to evaluate the quality
of machine-generated text by comparing it to one or more
reference translations or human-generated text. It calculates
a score ranging from 0 to 1, with a higher score indicating
a better match between the generated text and the reference
text. The BLEU metric takes into account various factors such
as precision, n-gram matches, and brevity penalty. It considers
both the presence and the ordering of words, thereby capturing
Fig. 6. Architecture of “Xception - TT-LSTMs with Bi-LSTMs” method
the fluency and correctness of the generated captions. By
employing the BLEU metric, we aimed to quantitatively
The encoder-decoder structure in Figure 6 is used by the evaluate the performance of our model and compare it with
approach known as TT-LSTM [14] which is built using a other methods in the field. The use of this standard metric
combination of the merge and inject models. For both text and allows for a fair and objective assessment of the quality of the
image, TT-LSTM suggests creating two sub-encoder models. captions generated by our model.
The two aforementioned procedures will then be merged. The We carefully selected specific parameter configurations to
Xception network is utilized in the image encoder model. Bi- optimize our research outcomes. These include utilizing the
LSTM is employed for the decoder, while LSTMs are applied Adam optimizer with a learning rate of 0.001 to guide the
to the language encoder. training process. For the cost function, we employed the
cross-entropy loss function, a commonly used measure for
G. ResNet50 – BERT – Bahdanau attention (RBBA) multi-class classification tasks. To initialize the LSTMs/GRUs,
we applied the glorot-uniform initializer, which helps ensure
effective information flow within the model. During training,
we used a batch size of 32, which determines the number of
samples processed in each iteration. The model was trained
for a total of 10 epochs. To prevent overfitting and enhance
generalization, we incorporated a dropout rate of 0.5, which
randomly deactivates a portion of the neural network units
during training. This regularization technique encourages the
Fig. 7. Architecture of “ResNet50 - BERT - Bahdanau attention” method model to learn more robust and generalized features.

This architecture utilizes the ResNet50 [2] and BERT [15] B. Experiment results
methods, the same approach as mentioned in VGBA, to Tables I and II present the experimental results of various
extract features from images and text, respectively. In this architectures for image captioning. Among these architectures,
architecture, the Bahdanua attention mechanism is combined the VGBA demonstrates exceptional performance in terms of
with LSTMs [3] to generate the final output. The detailed both speed and accuracy. With a relatively short training time
architecture is illustrated in Figure 7. of 826.8s, this architecture achieves the best score in terms of
BLEU-2 and BLEU-3 (Table I). The inclusion of Bahdanau
IV. P ERFORMANCE EVALUATION
attention yields a significant enhancement in the output, as
A. Experiment setup it allows the model to comprehend the image context more
In this research, we utilized the Flickr8K dataset, which is accurately and consistently.
a widely used and diverse dataset for image captioning tasks. Table II illustrates these methods in action for captioning
The dataset consists of 8,000 high-quality images sourced the image. Among them, the RBBA attention stands out
from the popular online platform Flickr. These images cover as it describes the image with the highest level of detail
a wide range of life themes, including captivating scenes of and accuracy compared to the other methods. It effectively
animals such as dogs, cats, and people engaging in various leverages the ResNet50 and BERT models, along with the
activities. The dataset also includes images depicting fun and inclusion of Bahdanau attention, resulting in more precise and
entertainment activities, sports events, and daily life routines. comprehensive image captions.
The diversity of themes and subjects within the Flickr8K Figure 8 demonstrates the performance of various methods
dataset makes it a suitable choice for training and testing in generating image captions, as measured by the BLEU
the image caption generator models. By incorporating such a scores. The bar chart clearly illustrates the distinct differences
diverse collection of images, we aimed to enhance the model’s in performance among these methods. Notably, the Bahdanau
ability to generate accurate and meaningful captions for a wide attention mechanism still stands out as a particularly effective
range of visual content. approach. The Bahdanau attention mechanism has proven to be

433
zed licensed use limited to: Vignan's Foundation for Science Technology & Research (Deemed to be University). Downloaded on September 23,2024 at 13:27:03 UTC from IEEE Xplore. Restrictions
TABLE I
P ERFORMANCE COMPARISON ON F LICKR 8K DATASET

Methods Parameter Time training (per epochs) BLEU-1 BLEU-2 BLEU-3 BLEU-4
Inject Xception – Word2Vec 5,002,649 807.6 s 0.364886 0.190213 0.121723 0.046814
Merge Xception – Word2Vec 5,002,649 732.8 s 0.375879 0.196083 0.118743 0.042923
Inject InceptionResnetV2 – GloVe 5,512,165 3,387.1 s 0.360748 0.201883 0.116986 0.062958
Merge InceptionResnetV2 – GloVe 5,315,557 2,096.8 s 0.381128 0.221911 0.155406 0.069800
VGG16 – GRU – Bahdanau attention 4,872,345 826.8 s 0.461538 0.339683 0.254815 0.039086
Xception – TT LSTM with Bi-LSTM 8,517,273 3,139.8 s 0.426407 0.265075 0.191381 0.089676
ResNet50–BERT–Bahdanau attention 119,818,810 7,488.0 s 0.532143 0.227003 0.175572 0.126316

TABLE II
C APTION COMPARISON RESULTS OF METHODS FOR IMAGES

Test image Real caption Method Predict caption


1. girl with a black swimsuit plays Xception – Word2Vec (Inject) young boy is playing in the water
in the sprinkler Xception – Word2Vec (Merge) young girl is playing in the water
2. young girl is playing in fountain InceptionResnetV2 – GloVe (In- young girl in pink shirt is playing into
of water ject) the water
3. young girl plays in fountain wa- InceptionResnetV2 – GloVe girl in bathtub spits water from water
ter (Merge) fountain
4. little girl crouches to splash VGG16 – GRU – Bahdanau at- young girl playing in fountain water
fountain water tention
5. young girl in a bathing suit play- Xception – TT LSTM with Bi- boy in blue shirt is playing in the
ing with water shooting out of the LSTM water
ground ResNet50 – BERT – Bahdanau the little girl wearing the swimsuit is
attention playing into the water fountain
1. black dog jumps up to catch Xception – Word2Vec (Inject) dog is running through the grass
white ball Xception – Word2Vec (Merge) dog is running through the grass
2. dog catches ball in the air InceptionResnetV2 – GloVe (In- dog is jumping over hurdler
3. dog catches toy outside of brick ject)
house InceptionResnetV2 – GloVe black dog is running through the grass
4. dog leaps to catch ball (Merge)
5. the big black dog is jumping up VGG16 – GRU – Bahdanau at- black dog jumps up to catch the ball
in the air to catch ball tention
Xception – TT LSTM with Bi- black dog jumps to catch ball
LSTM
ResNet50 – BERT – Bahdanau dog catch ball outside
attention
1. boy plays basketball Xception – Word2Vec (Inject) two men are playing in the grass
2. boy wearing blue shorts is Xception – Word2Vec (Merge) two men are playing in the grass
bouncing basketball in front of the InceptionResnetV2 – GloVe (In- the man in the red shirt is playing the
net ject) basketball
3. little boy plays with basketball InceptionResnetV2 – GloVe the basketball player in the red strip
and toy basketball hoop (Merge) is trying to get the ball
4. little boy playing basketball in VGG16 – GRU – Bahdanau at- boy is holding basketball
the grass tention
5. the child in the blue shorts drib- Xception – TT LSTM with Bi- the boy is playing basketball
bled the basketball LSTM
ResNet50 – BERT – Bahdanau little boy wearing blue short is playing
attention basketball in the grass

highly successful in generating accurate and contextually rel- approach falls short of the RBBA approach in terms of BLEU-
evant image captions. It encompasses a sophisticated attention 1 and BLEU-4 scores, it outshines the leading model when it
mechanism that allows the model to focus on different regions comes to BLEU-2 and BLEU-3 scores. In terms of speed, the
of the image while generating the corresponding captions. VGBA model appears to be significantly faster than the RBBA
model. VGBA model achieves the training of an epoch within
V. C ONCLUSIONS only 826.8 seconds, whereas the RBBA model, on the other
In conclusion, the comparison of these methods sheds light hand, takes up to 7,488 seconds.
on the advantages and trade-offs inherent in different model
and attention mechanism combinations. The RBBA approach R EFERENCES
excels in generating accurate and descriptive captions, which is [1] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
demonstrated by its impressive BLEU-1 and BLEU-4 scores, large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
as well as the meaningfulness of the captions produced for recognition,” in Proceedings of the IEEE Conference on Computer Vision
sample images. On the other hand, although the VGBA and Pattern Recognition (CVPR), 2016, pp. 770–778.

434
zed licensed use limited to: Vignan's Foundation for Science Technology & Research (Deemed to be University). Downloaded on September 23,2024 at 13:27:03 UTC from IEEE Xplore. Restrictions
Fig. 8. BLUE-score comparison on Flickr8K dataset

[3] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural for word representation,” in Proceedings of the 2014 conference on
computation, vol. 9, no. 8, pp. 1735–1780, 1997. empirical methods in natural language processing (EMNLP), 2014, pp.
[4] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, 1532–1543.
H. Schwenk, and Y. Bengio, “Learning phrase representations using [11] S. T. Dumais, “Latent semantic analysis,” Annual Review of Information
rnn encoder-decoder for statistical machine translation,” arXiv preprint Science and Technology (ARIST), vol. 38, pp. 189–230, 2004.
arXiv:1406.1078, 2014. [12] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel,
[5] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, and Y. Bengio, “Show, attend and tell: Neural image caption generation
Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in with visual attention,” in International conference on machine learning.
neural information processing systems, vol. 30, 2017. PMLR, 2015, pp. 2048–2057.
[6] M. Tanti, A. Gatt, and K. P. Camilleri, “Where to put the image in an
[13] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
image caption generator,” Natural Language Engineering, vol. 24, no. 3,
jointly learning to align and translate,” arXiv preprint arXiv:1409.0473,
pp. 467–489, 2018.
2014.
[7] F. Chollet, “Xception: Deep learning with depthwise separable convolu-
tions,” in Proceedings of the IEEE conference on computer vision and [14] P. P. Khaing et al., “Two-tier lstm model for image caption generation.”
pattern recognition, 2017, pp. 1251–1258. International Journal of Intelligent Engineering & Systems, vol. 14,
[8] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of no. 4, 2021.
word representations in vector space,” arXiv preprint arXiv:1301.3781, [15] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training
2013. of deep bidirectional transformers for language understanding,” arXiv
[9] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception-v4, preprint arXiv:1810.04805, 2018.
inception-resnet and the impact of residual connections on learning,” in [16] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for
Proceedings of the AAAI conference on artificial intelligence, vol. 31, automatic evaluation of machine translation,” in Proceedings of the 40th
no. 1, 2017. annual meeting of the Association for Computational Linguistics, 2002,
[10] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors pp. 311–318.

435
zed licensed use limited to: Vignan's Foundation for Science Technology & Research (Deemed to be University). Downloaded on September 23,2024 at 13:27:03 UTC from IEEE Xplore. Restrictions

You might also like