A Study On Attention-Based Deep Learning Architecture Model For Image Captioning
A Study On Attention-Based Deep Learning Architecture Model For Image Captioning
Dhomas Hatta Fudholi, Umar Abdul Aziz Al-Faruq, Royan Abida N. Nayoan, Annisa Zahra
Department of Informatics, Universitas Islam Indonesia, Yogyakarta, Indonesia
Corresponding Author:
Dhomas Hatta Fudholi
Department of Informatics, Universitas Islam Indonesia
Kaliurang St. Km. 14.5 Yogyakarta 55584, Indonesia
Email: [email protected]
1. INTRODUCTION
Image captioning is the ability to describe the contents of an image in the form of sentences [1]. This
ability requires methods from two fields of artificial intelligence, namely computer vision to understand the
content of a given image and natural language processing (NLP) to convert the content in the image into
sentence form. Due to the advancement of deep learning models in these two fields, image captioning has
received a lot of attention in recent years. On the computer vision side, the improvement of the convolution
neural network (CNN) architecture and object detection contributed to the improvement of the image
captioning system. On the NLP side, more advanced sequential models, such as attention-based recurrent
networks, result in more accurate text generation.
Most successful image captioning uses an encoder-decoder approach inspired by the sequence-to-
sequence model for machine translation. This framework uses CNN + recurrent neural network (RNN), where
CNN is used as an image encoder that extracts region-based visual features from the input image and RNN is
used as a caption decoder to make sentences [2]. With the development of machine translation, a new
architecture emerged, namely attention, which has become a state-of-the-art method in the NLP field and
makes image captioning models. Transformer [3] is an attention-based architecture that largely being adopted
in the field.
This study aims to discover state-of-the-art method in image captioning, especially attention-based
architecture, including transformer. We also take Indonesian as one of the low resource languages as a case
study to give a picture of how mature the development of image captioning model in such language. An analysis
of the model and architecture used, as well as the results of the study was carried out. Additionally, we develop
a general concept that indicate what needs to be done, what techniques and data are typically used in image
captioning research. The results of this study are expected to be a reference for the image captioning method
in future research and provide recommendations for the best method for doing image captioning.
2. METHOD
2.1. Research flow
The systematic literature review presented in this paper has two main steps in its methodology. The
first step is to formulate and conduct a search for related literatures. The second step is to formulate, conduct,
and discuss the literature analysis from several points of view. Details of the methodological steps carried out
will be explained in the next section.
3.1.2. Transformer
Transformer works as an encoder-decoder architecture that uses an attention mechanism. The
captioning transformer (CT) study [18] was developed to overcome the problem of image captioning which is
often developed using an LSTM decoder. Although good at remembering sequentially, LSTM has a
complicated sequential problem in terms of timing. CT only has an attention module without a time
dependency, so this model not only remembers sequence dependencies, but can also be trained in parallel.
Research [1] uses a spatial graph encoding transformer layer with a modified encoding transformer
arrangement and an implicit decoding transformer layer which has a decoder layer and an LSTM layer in it to
overcome the structure of the semantic unit of the image and each word in a different sentence. Research [4]
improves image encoding and text prediction by using meshed transformer with memory to get low- and high-
level features so that it can predict images that are not even in the training data. Multimodal transformer [2] is
composed of an image encoder and a text decoder simultaneously capturing intra and inter-modal interactions
A study on attention-based deep learning architecture model for image captioning (Dhomas Hatta Fudholi)
26 ISSN: 2252-8938
such as the relationship of words, objects, words in attention blocks that can produce captions accurately.
Boosted transformer [9] utilizes semantic concepts (CGA) and visual features (VGA) to enhance the
description of the resulting image. Personality-captions [13] uses TransResNet and a dataset that supports
personality differentiation to produce image descriptions that are closer to humans. Conceptual dataset [14]
was also developed using Inception-ResNetv2 as a feature extractor and transformer to perform image
captioning. Another study with encoder and decoder transformer using object spatial relationship model [10]
was built by explicitly including spatial relationship information between input objects detected from attention
geometric. EnTangled transformer [11] was developed to exploit all semantic and spatial information from
images.
Since various transformer-based models have achieved promising success on the image captioning
task [31], recent research has still widely used it. dual-level collaborative transformer [26] was proposed to
complement region and grid features for image captioning by applying intra-level fusion via comprehensive
relation attention (CRA) and dual-way self attention (DWSA). Global enhanced transformer (GET) [27] makes
it possible to obtain a more comprehensive global representation, which guide the decoder in creating a high-
quality caption. Caption transformer (CPTR) [28], as a full transformer model, is capable of modeling global
context information throughout encoder at every layer. Transformer-based semi-autoregressive model for
image captioning, which keeps the autoregressive property in global and non-autoregressive property in local,
tackles the heavy latency during inference issue that is caused by adopting autoregressive decoders [29]. Spatial
and scale-aware transformer (S2 transformer) [30] explores both low-level and high-level encoded features
simultaneously in a scale-wise reinforcement module and learns pseudo regions by learning clusters in a
Spatial-aware Pseudo-supervised module. Relational transformer (ReFormer) [31] was proposed to improve
the quality of image captions by generating features that have relation information embedded, as well as
explicitly expressing pair-wise relationships between images and their objects. While research [32] used a
transformer-based architecture called attention-reinforced transformer to overcome the problem of cross
entropy limiting diversity in image captioning.
For research with Indonesian dataset, we only found two paper that use transformer in their study,
[36] and [37]. The result of research [37] showed that the implementation of the transformer architecture
significantly exceeded the results of existing Indonesian image captioning research. In addition, the use of
EfficientNet model obtains better results than InceptionV3. Research [36] has different approach, which use
ResNet family as the base of visual feature extraction.
3.2.2. MS COCO
MS COCO [38] is a dataset from Microsoft COCO which is a large dataset containing object
detection, segmentation, and captioning. This dataset has 328,000 images with a total of 2,500,000 labels, 80
object categories, and 91 object categories. This dataset is divided into training, validation, and testing data, as
in Table 3, using Karpathy's splits [39].
The image captions dataset in MS COCO consists of two dataset collections. The first dataset, MS
COCO c5, has five text references for each image in the MS COCO dataset training, validation, and testing.
The second dataset, MS COCO c40, has 40 text references and randomly selects 5,000 images from the MS
COCO testing dataset. MS COCO c40 builds on the many automated evaluation metrics that give results that
achieve higher correlations than human judgments when given more references [40].
1, 𝑖𝑓𝑙𝐶 > 𝑙𝑆
𝑏(𝐶, 𝑆) = { 1−𝑙𝑆 ∕𝑙𝐶 (2)
𝑒 , 𝑖𝑓𝑙𝐶 ≤ 𝑙𝑆
Metric for Evaluation for Translation with Explicit Ordering (METEOR) evaluates the candidate text
based on overlapping unigrams between candidate and reference texts. This corresponds to a unigram based
on meanings, exact and stemmed forms [44]. During calculating the alignment between the words in the
candidate and reference sentences, the number of contiguous and identically ordered chunks of tokens in the
sentence pair (𝑐ℎ) is minimized. This evaluation is conducted using the default parameters 𝛾, 𝛼 and 𝜃. So, based
on a set of alignments (𝑚), the METEOR score is derived from the harmonic mean of precision (𝑃𝑚 ) and recall
(𝑅𝑚 ) between the candidate and reference with the best score [40], see (4)-(8).
A study on attention-based deep learning architecture model for image captioning (Dhomas Hatta Fudholi)
28 ISSN: 2252-8938
𝑐ℎ 𝜃
𝑃𝑒𝑛 = 𝛾 ( ) (4)
𝑚
𝑃𝑚 𝑅𝑚
𝐹𝑚𝑒𝑎𝑛 = (5)
∝𝑃𝑚 +(1−∝)𝑅𝑚
|𝑚|
𝑃𝑚 = ∑ (6)
𝑘 ℎ𝑘 (𝑐𝑖 )
|𝑚|
𝑅𝑚 = ∑ (7)
𝑘 ℎ𝑘 (𝑆𝑖𝑗 )
𝑙(𝑐𝑖 ,𝑠𝑖𝑗 )
𝑅𝑙 = 𝑚𝑎𝑥𝑗 (9)
|𝑆𝑖𝑗 |
𝑙(𝑐𝑖 ,𝑠𝑖𝑗 )
𝑃𝑙 = 𝑚𝑎𝑥𝑗 (10)
|𝑐𝑖 |
ℎ𝑘 (𝑠𝑖𝑗) |𝐼|
𝑔𝑘 (𝑠𝑖𝑗 ) = ∑ log (∑ ) (12)
𝜔𝑙 ∈Ω ℎ𝑙 (𝑠𝑖𝑗 ) 𝐼𝑝 𝜖𝐼 min(1,∑𝑞 ℎ𝑘 (𝑠𝑝𝑞 ))
𝐶𝐼𝐷𝐸𝑟(𝑐𝑖 , 𝑆𝑖 ) = ∑𝑁
𝑛=1 𝑤𝑛 𝐶𝐼𝐷𝐸𝑟𝑛 (𝑐𝑖 , 𝑆𝑖 ) (14)
Semantic propositional image caption evaluation (SPICE) estimates text quality by converting
candidate and reference texts into semantic representations called “scene graphs” that encode objects,
attributes, and relationships found in the text [44]. In this evaluation, we first define the parsing captions’
subtask to scene graphs. We parse a caption 𝑐 into a scene graph, given a set of attribute types A, a set of object
classes C, and a set of relation types R, as (15) [46]. 𝑂(𝑐) ⊆ 𝐶 is the set of objects named in 𝑐, 𝐸(𝑐) ⊆
𝑂(𝑐) × 𝑅 × 𝑂(𝑐) is a hyper-edge set that represents the relationship between objects, and 𝐾(𝑐) ⊆ 𝑂(𝑐) × 𝐴
is the set of attributes related to the object. For the second step, we calculate the F-score. We view the scene
graph semantic relationships as a conjunction of logical propositions or tuples, to compare how closely two
scene graphs, resemble one another. So, we have a function T that reads the scene graph and returns a logical
tuple as (16). Each tuple consists of one, two, or three components that, accordingly, represent the objects,
attributes, and relations. We define the binary matching operator ⊗ as the function that returns matching tuples
in two scene graphs by looking at the semantic propositions in the scene graph as a set of tuples. Next, we
define 𝑆𝑃𝐼𝐶𝐸, recall 𝑅, and precision 𝑃 as (17)-(19).
|𝑇(𝐺(𝑐))⊗𝑇(𝐺(𝑆))|
𝑃(𝑐, 𝑆) = |𝑇(𝐺(𝑐))|
(17)
|𝑇(𝐺(𝑐))⊗𝑇(𝐺(𝑆))|
𝑅(𝑐, 𝑆) = |𝑇(𝐺(𝑆))|
(18)
2⋅𝑃(𝑐,𝑆)⋅𝑅(𝑐,𝑆)
𝑆𝑃𝐼𝐶𝐸(𝑐, 𝑆) = 𝐹1 (𝑐, 𝑆) = (19)
𝑃(𝑐,𝑆)+𝑅(𝑐,𝑆)
Of all the metrics used, namely BLEU-n, METEOR, ROGUE-L, CIDEr and SPICE, we selected four
evaluation metrics that are used in almost all literature: BLEU-4, ROUGE-L, CIDEr, and METEOR. Table 4
shows the average value obtained in the evaluation metric of the transformer and attention architecture used in
the literature discussed. However, calculations in Table 4 only include literature that contains c5 and c40 scores.
It can be seen from the table that the transformer model gets a higher score for each evaluation metric than the
vanilla attention mechanism model.
A study on attention-based deep learning architecture model for image captioning (Dhomas Hatta Fudholi)
30 ISSN: 2252-8938
4. CONCLUSION
From this study on attention-based deep learning architecture model for image captioning, 36
literatures were found. With all models are attention-based architecture, half of the works listed in this study
uses transformer. Five types of evaluation metrics were used across the works: BLEU, CIDEr, METEOR,
ROUGE-L, and SPICE. BLEU still become the most used evaluation metrics for image captioning. From the
analysis, it is known that on average, the transformer model obtains higher evaluation metric score at
BLEU-4, CIDEr, ROUGE-L, and METEOR than the works with vanilla attention mechanism model. In the
Indonesian language domain, as one example of a low resource language, only few works are found and most
of them still rely on the common MSCOCO dataset as the base. However, the is an effort to create a novel
dataset which is great to capture local culture in the caption. Finally, this study provides a foundation for the
future development of the image captioning model and presents a general understanding of the process of
developing image captions, including a representation of the process in low resource languages.
ACKNOWLEDGEMENT
The study is sponsored by SAME (Scheme for Academic Mobility and Exchange) 2022 program
(Decree No. 3253/E4/DT.04.03/2022) from Directorate of Resources, Directorate General of Higher Education,
Research and Technology, Ministry of Education, Culture, Research and Technology, Republic of Indonesia.
APPENDIX
A study on attention-based deep learning architecture model for image captioning (Dhomas Hatta Fudholi)
32 ISSN: 2252-8938
REFERENCES
[1] S. He, W. Liao, H. R. Tavakoli, M. Yang, B. Rosenhahn, and N. Pugeault, “Image captioning through image transformer,” Lecture
Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 12625
LNCS, pp. 153–169, 2021, doi: 10.1007/978-3-030-69538-5_10.
[2] J. Yu, J. Li, Z. Yu, and Q. Huang, “Multimodal transformer with multi-view visual representation for image captioning,” IEEE
Transactions on Circuits and Systems for Video Technology, vol. 30, no. 12, pp. 4467–4480, 2020, doi: 10.1109/TCSVT.2019.2947482.
[3] A. Vaswani et al., “Attention is all you need,” in Proceedings of the 31st Conference on Neural Information Processing Systems, Dec.
2017, pp. 5998–6008, doi: 10.48550/arXiv.1706.03762.
[4] M. Cornia, M. Stefanini, L. Baraldi, and R. Cucchiara, “Meshed-memory transformer for image captioning,” Proceedings of the IEEE
Computer Society Conference on Computer Vision and Pattern Recognition, pp. 10575–10584, 2020, doi: 10.1109/CVPR42600.
2020.01059.
[5] Y. Li, T. Yao, Y. Pan, H. Chao, and T. Mei, “Pointing novel objects in image captioning,” Proceedings of the IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, vol. 2019-June, pp. 12489–12498, 2019, doi: 10.1109/CVPR.2019.01278.
[6] T. Yao, Y. Pan, Y. Li, and T. Mei, “Hierarchy parsing for image captioning,” Proceedings of the IEEE International Conference on
Computer Vision, vol. 2019-Oct., pp. 2621–2629, 2019, doi: 10.1109/ICCV.2019.00271.
[7] W. Wang, Z. Chen, and H. Hu, “Hierarchical attention network for image captioning,” 33rd AAAI Conference on Artificial Intelligence,
AAAI 2019, 31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019 and the 9th AAAI Symposium on Educational
Advances in Artificial Intelligence, EAAI 2019, pp. 8957–8964, 2019, doi: 10.1609/aaai.v33i01.33018957.
[8] L. Huang, W. Wang, J. Chen, and X. Y. Wei, “Attention on attention for image captioning,” Proceedings of the IEEE International
Conference on Computer Vision, vol. 2019-Oct., pp. 4633–4642, 2019, doi: 10.1109/ICCV.2019.00473.
[9] J. Li, P. Yao, L. Guo, and W. Zhang, “Boosted transformer for image captioning,” Applied Sciences (Switzerland), vol. 9, no. 16,
pp. 1–15, 2019, doi: 10.3390/app9163260.
[10] S. Herdade, A. Kappeler, K. Boakye, and J. Soares, “Image captioning: Transforming objects into words,” Advances in Neural
Information Processing Systems, vol. 32, no. NeurIPS, pp. 1–11, 2019.
[11] G. Li, L. Zhu, P. Liu, and Y. Yang, “Entangled transformer for image captioning,” Proceedings of the IEEE International Conference
on Computer Vision, vol. 2019-Oct., no. C, pp. 8927–8936, 2019, doi: 10.1109/ICCV.2019.00902.
[12] X. Yang, K. Tang, H. Zhang, and J. Cai, “Auto-encoding scene graphs for image captioning,” Proceedings of the IEEE Computer
Society Conference on Computer Vision and Pattern Recognition, vol. 2019-June, pp. 10677–10686, 2019, doi: 10.1109/CVPR.
2019.01094.
[13] K. Shuster, S. Humeau, H. Hu, A. Bordes, and J. Weston, “Engaging image captioning via personality,” Proceedings of the IEEE
Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2019-June, pp. 12508–12518, 2019, doi: 10.1109/
CVPR.2019.01280.
[14] P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic
image captioning,” ACL 2018-56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference
(Long Papers), vol. 1, pp. 2556–2565, 2018, doi: 10.18653/v1/p18-1238.
[15] W. Jiang, L. Ma, Y. G. Jiang, W. Liu, and T. Zhang, “Recurrent fusion network for image captioning,” Lecture Notes in Computer
Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 11206 LNCS,
pp. 510–526, 2018, doi: 10.1007/978-3-030-01216-8_31.
[16] Y. Feng, L. Ma, W. Liu, and J. Luo, “Unsupervised image captioning,” Proceedings of the IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, vol. 2019-June, pp. 4120–4129, 2019, doi: 10.1109/CVPR.2019.00425.
[17] T. Yao, Y. Pan, Y. Li, and T. Mei, “Exploring visual relationship for image captioning,” Lecture Notes in Computer Science (including
subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 11218 LNCS, pp. 711–727, 2018,
doi: 10.1007/978-3-030-01264-9_42.
[18] X. Zhu, L. Li, J. Liu, H. Peng, and X. Niu, “Captioning transformer with stacked attention modules,” Applied Sciences (Switzerland),
vol. 8, no. 5, 2018, doi: 10.3390/app8050739.
[19] P. Anderson et al., “Bottom-up and top-down attention for image captioning and visual question answering,” Proceedings of the IEEE
Computer Society Conference on Computer Vision and Pattern Recognition, pp. 6077–6086, 2018, doi: 10.1109/CVPR.2018.00636.
[20] J. Lu, C. Xiong, D. Parikh, and R. Socher, “Knowing when to look: Adaptive attention via a visual sentinel for image captioning,”
Proceedings-30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, vol. 2017-Jan., pp. 3242–3250, 2017,
doi: 10.1109/CVPR.2017.345.
[21] L. Chen et al., “SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning,” Proceedings-30th
IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, vol. 2017-Jan., pp. 6298–6306, 2017, doi: 10.1109/CVPR.
2017.667.
[22] S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy, “Improved image captioning via policy gradient optimization of SPIDEr,”
Proceedings of the IEEE International Conference on Computer Vision, vol. 2017-Oct., pp. 873–881, 2017, doi: 10.1109/ICCV.
2017.100.
[23] J. Aneja, A. Deshpande, and A. G. Schwing, “Convolutional image captioning,” Proceedings of the IEEE Computer Society Conference
on Computer Vision and Pattern Recognition, pp. 5561–5570, 2018, doi: 10.1109/CVPR.2018.00583.
[24] C. Liu, J. Mao, F. Sha, and A. Yuille, “Attention correctness in neural image captioning,” 31st AAAI Conference on Artificial
Intelligence, AAAI 2017, pp. 4176–4182, 2017, doi: 10.1609/aaai.v31i1.11197.
[25] T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei, “Boosting image captioning with attributes,” Proceedings of the IEEE International
Conference on Computer Vision, vol. 2017-Octob, pp. 4904–4912, 2017, doi: 10.1109/ICCV.2017.524.
[26] Y. Luo et al., “Dual-level collaborative transformer for image captioning,” 35th AAAI Conference on Artificial Intelligence, AAAI 2021,
vol. 3B, no. 3, pp. 2286–2293, doi: 10.1609/aaai.v35i3.16328.
[27] J. Ji et al., “Improving image captioning by leveraging intra- and inter-layer global representation in transformer network,” 35th AAAI
Conference on Artificial Intelligence, AAAI 2021, vol. 2B, pp. 1655–1663, 2021, doi: 10.1609/aaai.v35i2.16258.
[28] W. Liu, S. Chen, L. Guo, X. Zhu, and J. Liu, “CPTR: Full transformer network for image captioning,” Computer Science > Computer
Vision and Pattern Recognition, pp. 1–5, 2021, doi: 10.48550/arXiv.2101.10804.
[29] Y. Zhou, Y. Zhang, Z. Hu, and M. Wang, “Semi-autoregressive transformer for image captioning,” Proceedings of the IEEE
International Conference on Computer Vision, vol. 2021-Oct., pp. 3132–3136, 2021, doi: 10.1109/ICCVW54120.2021.00350.
[30] P. Zeng, H. Zhang, J. Song, and L. Gao, “S2 transformer for image captioning,” IJCAI International Joint Conference on Artificial
Intelligence, pp. 1608–1614, 2022, doi: 10.24963/ijcai.2022/224.
[31] X. Yang, Y. Liu, and X. Wang, “ReFormer: the relational transformer for image captioning,” in Proceedings of the 30th ACM
International Conference on Multimedia, Oct. 2022, pp. 5398–5406, doi: 10.1145/3503161.3548409.
[32] Z. Wang, S. Shi, Z. Zhai, Y. Wu, and R. Yang, “ArCo: Attention-reinforced transformer with contrastive learning for image captioning,”
Image and Vision Computing, vol. 128, p. 104570, Dec. 2022, doi: 10.1016/j.imavis.2022.104570.
[33] F. Liu, X. Ren, X. Wu, W. Fan, Y. Zou, and X. Sun, “Prophet attention: Predicting attention with future attention for image captioning,”
Computer Science > Computer Vision and Pattern Recognition, 2022, doi: 10.48550/arxiv.2210.10914.
[34] M. R. S. Mahadi, A. Arifianto, and K. N. Ramadhani, “Adaptive attention generation for Indonesian image captioning,” in 2020 8th
International Conference on Information and Communication Technology (ICoICT), Jun. 2020, pp. 1–6, doi: 10.1109/ICoICT49345.
2020.9166244.
[35] D. H. Fudholi et al., “Image captioning with attention for smart local tourism using efficientNet,” IOP Conference Series: Materials
Science and Engineering, vol. 1077, no. 1, p. 012038, 2021, doi: 10.1088/1757-899x/1077/1/012038.
[36] R. Mulyawan, A. Sunyoto, and A. H. Muhammad, “Automatic Indonesian image captioning using CNN and transformer-based model
approach,” ICOIACT 2022-5th International Conference on Information and Communications Technology: A New Way to Make AI
Useful for Everyone in the New Normal Era, Proceeding, pp. 355–360, 2022, doi: 10.1109/ICOIACT55506.2022.9971855.
[37] U. A. A. Al-Faruq and D. H. Fudholi, “EfficientNet-Transformer for image captioning in Bahasa,” Vii International Conference “Safety
Problems of Civil Engineering Critical Infrastructures” (Spceci2021), vol. 2701, p. 020037, 2023, doi: 10.1063/5.0118155.
[38] T.-Y. Lin et al., “Microsoft COCO: common objects in context,” in In: European conference on computer vision, May 2014,
pp. 740–755, doi: 10.1007/978-3-319-10602-1_48.
[39] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 39, no. 4, pp. 664–676, 2017, doi: 10.1109/TPAMI.2016.2598339.
[40] X. Chen et al., “Microsoft COCO captions: Data collection and evaluation server,” Computer Science > Computer Vision and Pattern
Recognition, 2015, doi: 10.48550/arXiv.1504.00325.
[41] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to visual denotations: New similarity metrics for semantic
inference over event descriptions,” Transactions of the Association for Computational Linguistics, vol. 2, pp. 67–78, Dec. 2014,
doi: 10.1162/tacl_a_00166.
[42] S. He and Y. Lu, “A modularized architecture of multi-branch convolutional neural network for image captioning,” Electronics, vol. 8,
no. 12, p. 1417, Nov. 2019, doi: 10.3390/electronics8121417.
[43] Y. Cui, G. Yang, A. Veit, X. Huang, and S. Belongie, “Learning to evaluate image captioning,” in 2018 IEEE/CVF Conference on
Computer Vision and Pattern Recognition, Jun. 2018, pp. 5804–5812, doi: 10.1109/CVPR.2018.00608.
[44] N. Sharif, L. White, M. Bennamoun, and S. A. A. Shah, “NNEval: Neural network based evaluation metric for image captioning,” in
ECCV 2018: Computer Vision – ECCV 2018, 2018, pp. 39–55, doi: 10.1007/978-3-030-01237-3_3.
[45] A. Cohan and N. Goharian, “Revisiting summarization evaluation for scientific articles,” Proceedings of the 10th International
Conference on Language Resources and Evaluation, LREC 2016, pp. 806–813, 2016, doi: 10.48550/arXiv.1604.00400.
[46] P. Anderson, B. Fernando, M. Johnson, and S. Gould, “SPICE: Semantic propositional image caption evaluation,” Lecture Notes in
Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 9909 LNCS,
pp. 382–398, 2016, doi: 10.1007/978-3-319-46454-1_24.
[47] Karpathy, “GitHub-karpathy/neuraltalk: NeuralTalk is a Python+numpy project for learning multimodal recurrent neural networks that
describe images with sentences,” GitHub, 2015. https://github.com/karpathy/neuraltalk (accessed May 27, 2023).
[48] R. Staniute and D. Šešok, “A systematic literature review on image captioning,” Applied Sciences (Switzerland), vol. 9, no. 10,
pp. 1–20, 2019, doi: 10.3390/app9102024.
[49] J. Yi et al., “MICER: a pre-trained encoder-decoder architecture for molecular image captioning,” Bioinformatics (Oxford, England),
vol. 38, no. 19, pp. 4562–4572, 2022, doi: 10.1093/bioinformatics/btac545.
[50] Tylin, “GitHub-tylin/coco-caption,” GitHub, 2018. https://github.com/tylin/coco-caption (accessed May 27, 2023).
[51] N. Aafaq, A. Mian, W. Liu, S. Z. Gilani, and M. Shah, “Video description: A survey of methods, datasets, and evaluation metrics,”
ACM Computing Surveys, vol. 52, no. 6, pp. 1–37, 2019, doi: 10.1145/3355390.
[52] Y. Graham, “Re-evaluating automatic summarization with BLEU and 192 shades of ROUGE,” Conference Proceedings-EMNLP 2015:
Conference on Empirical Methods in Natural Language Processing, pp. 128–137, 2015, doi: 10.18653/v1/d15-1013.
BIOGRAPHIES OF AUTHORS
A study on attention-based deep learning architecture model for image captioning (Dhomas Hatta Fudholi)
34 ISSN: 2252-8938
Royan Abida Nur Nayoan is a Research and Development Engineer in the field
of Artificial Intelligence, specializing in computer vision and natural language processing.
With a Master's degree in Informatics from Universitas Islam Indonesia, she is currently
focusing on developing new AI applications to enhance company products. She has
contributed to the field of AI through research publications and presentations. With a passion
for AI, she is dedicated to staying up-to-date with the latest advancements in the field. She
can be contacted at email: [email protected].