0% found this document useful (0 votes)
8 views

Generating_Caption_From_Images_Using_Flickr_Image_Dataset

This document discusses a research project focused on generating captions from images using advanced techniques in artificial intelligence, particularly integrating recurrent neural networks and natural language processing. It highlights the challenges and methodologies involved in automating image description, including feature extraction and the use of attention mechanisms. The study aims to improve the accuracy and contextual relevance of generated captions, addressing the intersection of computer vision and language understanding.

Uploaded by

Nikhil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Generating_Caption_From_Images_Using_Flickr_Image_Dataset

This document discusses a research project focused on generating captions from images using advanced techniques in artificial intelligence, particularly integrating recurrent neural networks and natural language processing. It highlights the challenges and methodologies involved in automating image description, including feature extraction and the use of attention mechanisms. The study aims to improve the accuracy and contextual relevance of generated captions, addressing the intersection of computer vision and language understanding.

Uploaded by

Nikhil
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

IEEE - 61001

Generating Caption From Images Using Flickr


2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT) | 979-8-3503-7024-9/24/$31.00 ©2024 IEEE | DOI: 10.1109/ICCCNT61001.2024.10724963

Image Dataset
Piyush Sharma Kiranpreet kaur Aaryan Gandotra
Department of CSE Department of CSE Department of CSE
Chandigarh University Chandigarh University Chandigarh University
Mohali, India Mohali, India Mohali, India
[email protected] [email protected] [email protected]

Pranay Patial Radhika Sehaj Gaba


Department of CSE Department of CSE Department of CSE
Chandigarh University Chandigarh University Chandigarh University
Mohali, India Mohali, India Mohali, India
[email protected] [email protected] [email protected]

Abstract— In an age where the Internet is Providing sufficient information in images in


dominated by visual content, the generation of an age of increased digital data generation is emerging
animated captions has become a must. It has as the main challenge, stimulating a great deal of
always been an interesting study for researchers in research interest in computer vision and artificial
the Department of Artificial Intelligence. Enabling intelligence. Think of it as a "picture-theme
the machine to describe images with the same generator." The name itself indicates our aim to create
skillful accuracy as the human has important the best possible systems that can generate logically
applications in various fields such as robotic vision, and syntactically realistic images. Researchers have
manufacturing, and beyond This project integrates actively pursued effective methods to improve
recurrent neural networks with is a topic of predictability, enabling us to explore options that are
contextual parallelism dedicated to extracting good. The impact of visual content on the Internet is
features from images Natural Language Processing particularly evident in the areas of social networking
Computer Vision and integrating them seamlessly, and e-commerce. This has spurred growing demand
this research provides insights a it goes further on and the possibility of automated graphics. This study
this interdisciplinary topic. Additionally, explores the fascinating intersection of natural
annotations for the sample images are created and language processing and computer vision, and offers a
performed a comparative analysis of different new method for retrieving text from images through
feature extraction and encoder patterns to deep learning. It used deep neural networks and
determine which model provided the highest machine learning to create a complex model.
accuracy and delivered the desired results. Organizing text into images goes beyond technical
skills; It reflects the human capacity to interpret,
Keywords—Caption from an image, Caption of an clarify and contextualize visual objects, and includes a
desire to imbue machines with a power of meaning
image, Encoder- Decoder, outputs text from an
and narrative.
image
This goes beyond the usual limitations of
I. INTRODUCTION machine learning, and requires a seamless integration
of visual perception, linguistics, and subtle
Understanding and describing visual stuff is
representational features The process consists of two
an essential human skill. Nonetheless, machines still
steps: elementary feature extraction from images in the
face a difficult time deriving meaning from visuals.
use of convolutional neural networks (CNN) and
This gap restricts the potential of several applications
next-generation natural language sentence-based
and makes it more difficult for people who are
models using recurrent neural networks (RNN).
visually impaired to utilize. One potential answer is
Instead of identifying features in an image, a specific
picture captioning, which is the automatic creation of
approach to complex feature extraction by capturing
natural language descriptions for photographs.
even subtle differences between similar images For
1

15th ICCCNT IEEE Conference,


June 24-28,
Authorized licensed use limited to: INDIAN INST OF INFO TECH AND MANAGEMENT. 2024,
Downloaded on March 19,2025 at 06:30:45 UTC from IEEE Xplore. Restrictions apply.
IIT - Mandi, Kamand, India
IEEE - 61001

this purpose, VGG-16 (Visual Geometry Group). The In recent years, there has been an increase in
model used 16 convolutional layers designed for interest in text-image production, namely the creation
object recognition . Moving on to the second step, the of images from text annotations. A number of
extracted features should be trained with the topics methodological and conceptual methods, including
provided in the dataset. generative adversarial networks (GANs), have been
Certainly! The predicted incorporates the developed. Researchers have also looked into
subsequent 4 essential ranges: (1) Object Detection controllability and semantic linkages in text-image
and Recognition-Identifying and categorizing items production. Li et al., 2019 [3] created an
within the given context; (2) Attribute Prediction- image-referencing method that focuses on the
Anticipating specific attributes associated with the controllability of generated images. Yin et al., 2019
detected items; (three) Scene [1] concentrated on semantic splitting to improve
Classification-Classifying the general scene or text-to-image generation by separating different
surroundings based on the identified gadgets; (4) semantic aspects in the input for precise image
Description Generation- Creating a descriptive integration.
narrative or rationalization of the scene by way of
combining records from the preceding stages. When performing activities like labeling
This study explores state-of-the-art deep images and answering visual questions, both
learning-based picture captioning techniques. bottom-up and top-down attention strategies are
Analysis of the prevailing encoder-decoder crucial. According to Sussman et al., 2016 [15],
architecture, in which Long Short-Term Memory top-down reactions are motivated by cognitive
(LSTM) networks convert image characteristics into strategies, whereas bottom-up responses are linked to
natural language captions and Convolutional Neural salient visual objects that automatically capture
Networks (CNNs) extract picture features. Next, attention. These methods require different processes
comes the development of multimodal feature within the visual cortex. suggested a model for picture
inclusion and attention mechanisms, which are all captioning and visual question answering that
advancing the field toward more precise and combines top-down and bottom-up attention, using
insightful image descriptions. Faster R-CNN-like mechanisms for bottom-up
attention (Anderson et al., 2018 [4]; Wang et al., 2020
II. Literature Review [5]). Using several reasoning phases and fine-grained
analysis, this approach enables a more thorough study
Earlier efforts to address this issue involved
of images (Anderson et al., 2018 [4]).
employing template-based approaches, which utilized
image classification techniques to categorize objects
Eye-tracking research has shown that this
into predefined classes. These objects were then
method can be used to measure both top-down and
inserted into a standard template sentence. However,
bottom-up attention, offering insights into how people
contemporary advancements have shifted towards
respond to visual stimuli (Boardman et al. (2021)
Recurrent Neural Networks (RNNs) as the primary
[14]). Furthermore, research on how thinking styles
focus of research in tackling this problem. Recurrent
and visual attention processes affect aesthetic choices
Neural Networks (RNNs) have gained significant
has shown the importance of bottom-up visual
popularity in various Natural Language Processing
attention, especially when images are involved (Chen
(NLP) tasks, notably in machine translation, where
et al., 2023 [17]). Furthermore, it has been proposed
they excel at generating sequences of words.
that the integration of top-down and bottom-up
Extending this capability, image caption generators
attention can improve tasks such as visual question
leverage RNNs to produce descriptions for images by
answering, in which attention is focused on pertinent
generating words sequentially, thereby associating
image regions according to language characteristics
textual descriptions with visual content.
associated with the query (Yang et al., 2021 [6]).

15th ICCCNT IEEE Conference,


June 24-28,
Authorized licensed use limited to: INDIAN INST OF INFO TECH AND MANAGEMENT. 2024,
Downloaded on March 19,2025 at 06:30:45 UTC from IEEE Xplore. Restrictions apply.
IIT - Mandi, Kamand, India
IEEE - 61001

Since Transformer-based architectures have process of captioning images. An adaptive attention


been used, image captioning has evolved model with a visual sentinel that selects which area of
tremendously. By expanding on Transformers' success a picture to focus on in order to extract significant
in natural language processing, these systems have features for creating sequential captions was presented
shown higher performance in picture captioning tasks by Lu et al., 2017 [10]. Using networks and models
(Qiu et al., 2021 [19]). In order to precisely adapt the referred to as visual sentinels, this method picks
classic Transformer structure for picture captioning, specific areas of the image. In the captioning process,
researchers have made a number of changes to it, non-visual words can be aligned using the adaptive
including layer normalization, embedding layers, and attention model with a visual sentinel (Zhang et al.,
the removal of residual connections (Yang et al., 2020 2021 [11]).
[7]). Because of its ability to manage long-term
relationships, the Transformer architecture has become The field of producing textual descriptions
the most widely used framework for picture captioning using scene graphs is explored by Choi et al., 2022
(Li et al., 2021 [12]). [18] . The work focuses on turning scene graphs' visual
representations of information into language that is
According to Zhang et al., 2023 [16], recent both logical and descriptive.
research has focused on improving picture captioning
models through the integration of Transformer models This work advances the area by investigating
with attention mechanisms. Furthermore, researchers the creation of textual descriptions that faithfully
have created inventive image captioning models that capture the relationships and material shown in the
can recognize particular items within photos by fusing scene graphs. In order to enable robots to explain
Transformer models with other methods like facial complicated visual situations in a way that is
recognition (Wang et al., 2022 [20]). The effectiveness understandable to humans, the process of translating
of Transformer models in producing captions for structured visual data from scene graphs into natural
images is demonstrated by the standardization of their language text is probably covered in this work. The
integration with encoder-decoder architectures in model may produce comprehensive textual
image captioning (Wang et al., 2021 [8]). descriptions that effectively convey the key
components and interrelationships present in the
A number of methods, including self-locating portrayed scenes by utilizing the data contained in
mechanisms and group sparse embedding, have scene graphs.
improved image captioning. In order to improve
collaborative captioning, Chen et al., 2020 [13]
Author Dataset Objectives Outcomes
presented a novel method called GroupCap that
focuses on structural relevance and variety among Knowledge-Driv Used a
group photos. This technique makes use of the joint en Encode, Knowledge-Drive
modeling of these elements to maximize captioning Retrieve, n Encode,
Paraphrase Retrieve,
efficiency. Moreover, Xie et al., 2019 [9] presented a
(KERP) strategy Paraphrase
framework for image compressive sensing recovery
Li et al., Medical for medical process that
that uses group sparse representation modeling to [3] images imaging report combined
jointly enforce image sparsity and self-similarity. By (2019) paired with production different methods
taking into account local sparsity and self-similarity in correspond sought to for producing
an adaptive group domain, this method guarantees ing reports. improve report medical picture
picture reconstruction. generating reports.
accuracy and
A research project that aims to build models of efficiency.
adaptive attention that employ visual sentinels to
decide where and when to attend throughout the

15th ICCCNT IEEE Conference,


June 24-28,
Authorized licensed use limited to: INDIAN INST OF INFO TECH AND MANAGEMENT. 2024,
Downloaded on March 19,2025 at 06:30:45 UTC from IEEE Xplore. Restrictions apply.
IIT - Mandi, Kamand, India
IEEE - 61001

Applying an Improvements in architecture for the CNN and RNN model, training the
adaptive the use of model on image-caption pairs using methods like
Medical multimodal multimodal teacher forcing and a suitable loss function, evaluating
Yang et images attention method attention the model's performance on a validation set using
al., [6] with to improve the processes to metrics like BLEU, METEOR, or CIDEr,
2021 proper precision and generate reports post-processing the generated captions to improve their
captioned productivity of based on readability, deploying the model in a setting conducive
reports. ultrasound image ultrasound images to user interaction, and monitoring and updating the
report generation model’s performance over time. The process of
creating captions for images usually combines
Emphasize the Clarifies the vital
computer vision techniques to comprehend the image's
problem of component of content with natural language processing techniques to
caption dataset quality provide meaningful and cogent descriptions.
Images hallucinations and how it affects
Zhang et with caused by the process,
al., [11] improper erroneous or especially when it
2021 captions biased comes to
attached. visual-textual grounded image
correlations that captioning
are acquired
from datasets.
PureT facilitates
Creating an end-to-end training
Images end-to-end and eliminates the
Wang et with Transformer-base requirement for
al., [20] human d approach called pretraining the
2022 generated PureT for picture object detection
captions captioning. component, both o
each. which boost
efficiency.
Using Findings
eye-tracking suggest that
methodology, individual
Observe the impact of differences in
Chen d the eye visual attention visual attention
et al., moveme processes and processes and
[2] nts and cognitive thought styles,
2023 fixations styles on as well as visual
of the environmental attention,
subjects aesthetic influence
preferences is preferences for
examined. environmental
aesthetics. Fig. 1

Table 1: Literature Review The flow chart for the entire process is shown
in Fig. 1 It starts with data preparation and collection,
III. Methodology which includes formal photographs and written
The steps involved in developing an image captions. Next comes preprocessing, when unnecessary
caption generator are obtaining a dataset of captioned data is removed from the collection and pruned. The
images, preprocessing the images with CNNs to extract model is selected for training on the dataset after all
features, tokenizing the captions, constructing an preprocessing and data mining have been

15th ICCCNT IEEE Conference,


June 24-28,
Authorized licensed use limited to: INDIAN INST OF INFO TECH AND MANAGEMENT. 2024,
Downloaded on March 19,2025 at 06:30:45 UTC from IEEE Xplore. Restrictions apply.
IIT - Mandi, Kamand, India
IEEE - 61001

completed,and it is repeatedly trained there until the convolutional layers, it examines the preprocessed
accuracy of the model stabilizes. Finally, the evaluation image and extracts high-level visual information such
grid checks and fine-tunes the tested model as objects, shapes, and their relationships. After that, a
consistency. compressed vector representation of this data is
An image captioning model's training process created.
requires a critical first step: data preparation. It As the "storyteller," the "decoder," which is
guarantees that both the textual and visual input can be usually an LSTM network, performs the role. After
processed and understood by the model effectively. receiving the encoded picture features, it starts an
The two crucial preprocessing procedures are internal state that will help it recall words that have
examined in more detail below: already been formed. Building the description word by
a. Image preprocessing word, the LSTM uses this data at each stage to forecast
b. Caption preprocessing the most likely phrase to appear in the caption. In other
This ensures that all of the collection's images words, whereas the LSTM analyzes and turns those
are consistent. It involves tasks like: Images are qualities into a natural language narrative, the CNN
reduced in size to a common size (e.g., 224x224 essentially "sees" the image and seizes its core.
pixels) in order to increase processing efficiency. Pixel
values are often standardized to a predefined range The key component of learning is the training
(between 0 and 1) to aid training. loop. Here, the dataset's image-caption pairs are
It takes preparation to prepare text captions as well. frequently displayed to the model. Using the encoded
The smaller pieces that make up captions are words or image attributes, it makes word-by-word caption
sentences. This allows the model to process them predictions. A loss function is used to determine how
sequentially. much these predictions depart from the actual
captions. Using an optimization technique similar to
backpropagation, this loss directs the model to modify
its internal parameters (weights and biases). The
objective is to reduce the loss as much as possible,
bringing the generated captions by the model closer to
the ones written by humans. By repeatedly iterating
over the full dataset, this loop enables the model to
progressively understand the complex relationship
that exists between visual information and its
associated plain language description.

Fig. 2

The encoder-decoder architecture is the


foundation of the picture captioning model (fig. 2).
Fig. 3
Convolutional Neural Networks (CNNs) serve as the
While enhancing the model's caption
"encoder," or "visionary" of the model. Using
generation skills is the main goal of the training

15th ICCCNT IEEE Conference,


June 24-28,
Authorized licensed use limited to: INDIAN INST OF INFO TECH AND MANAGEMENT. 2024,
Downloaded on March 19,2025 at 06:30:45 UTC from IEEE Xplore. Restrictions apply.
IIT - Mandi, Kamand, India
IEEE - 61001

procedure, there are other phases that are essential for quality by assimilating flexible matching and
best results. Same model has been trained on the synonyms.
dataset multiple times with feedback as well. With the On the test set, this model received an
feedback performance showcased itself. Above figure average BLEU score (fig. 4) of 0.62, a CIDEr score of
(fig. 3) is an exponentially decreasing graph, which 1.05, a ROUGE score of 0.70 and a METEOR score
tells about the validation loss with increase in training
of 0.46. In comparison to baseline Encoder-Decoder
of the model.
Evaluation measures, which frequently use with Attention models, this model demonstrated a
parameters like the BLEU score, aid in evaluating the noteworthy enhancement in both BLEU (0.58),
caliber of the output captions. A popular method for CIDEr (0.98), ROUGE (0.65) and METEOR (0.44)
fine-tuning models is grid search, which looks at (fig. 4).
several hyperparameter configurations (which Additionally, carried out an ablation research
regulate the training process) to find the one that in which several parts of this model are disassembled
performs best. Lastly, the model is kept for later usage to examine their distinct roles. This demonstrated that
on fresh photos when it has been trained and adjusted, during the process of creating captions, the attention
enabling it to demonstrate its image captioning mechanism was essential in directing attention
capabilities.
towards particular areas of the images, resulting in
more precise depictions of the objects and their
IV. RESULTS
interactions.
Model is tested on the Flickr Image dataset.
There are over 31,783 photos in this dataset, and each
V. CONCLUSION
one has five handwritten captions. BLEU score,
When the very suggested image captioning
CIDEr, ROUGE and METEOR are the two primary
model was applied to the Flickr picture dataset, it
evaluation measures used.
significantly performed well in terms of both BLEU
and CIDEr scores. The ablation study demonstrates
how well the attention mechanism focuses on relevant
areas of the image. It does a great job at producing
captions that are related. Future work will concentrate
on improving the model's capacity to manage
complex situations and produce a wider variety of
unique captions. This may involve implementing
sophisticated language generation algorithms or
object relationship identification. Finally, this model
offers a strong tool for automatic image description
with potential for additional complexity and creative
growth. It is a huge advancement in image captioning.

VI. REFERENCES
Fig. 4 [1] Yin C, Qian B, Wei J et al “Automatic generation
of medical imaging diagnostic report with hierarchical
The generated caption's BLEU score recurrent neural network”. In: 2019 19TH IEEE
compares it to the reference captions' similarity, international conference on data mining (ICDM
whereas CIDEr takes into account both relevance and 2019). DOI: 10.1007/s10462-022-10270-w
the usage of n-grams, or word sequences. ROUGE [2] Vinyals O, Toshev A, Bengio S et al 2015 Show
however, evaluates the overlap of n-grams between and tell: a neural image caption generator. In:
the generated and reference captions, also accounts in Proceedings of the IEEE conference on computer
recall, precision and F1-score. METEOR provides a vision and pattern recognition,
more nuanced evaluation of translation and captioning DOI:10.1109/CVPR.2015.7298935

15th ICCCNT IEEE Conference,


June 24-28,
Authorized licensed use limited to: INDIAN INST OF INFO TECH AND MANAGEMENT. 2024,
Downloaded on March 19,2025 at 06:30:45 UTC from IEEE Xplore. Restrictions apply.
IIT - Mandi, Kamand, India
IEEE - 61001

[3] Li C, Liang X, Hu Z et al (2019) Graph Representation Learning for Better Grounded


Knowledge-driven encode, retrieve, paraphrase for Image Captioning. DOI:10.1609/aaai.v35i4.16452
medical image report generation. Thirty-third AAAI [12] Li, J., Selvaraju, R. R., Gotmare, A. D, Joty,
conference on artificial intelligence/thirty-first (2021) Align before Fuse: Vision and language
innovative applications of artificial intelligence representation learning with momentum distillation.
conference/ninth AAAI symposium on educational In NeurIPS, 2021a.
advances in artificial intelligence. https://arxiv.org/pdf/2107.07651.pdf
DOI:10.1609/aaai.v33i01.33016666 [13] Chen, Y., Li, L., Yu, L., Kholy, A. E., Ahmed, F.,
[4] Anderson P, He X, Buehler C et al (2018) Gan, Z., Cheng, Y., and Liu, J. UNITER (2020):
Bottom-up and top-down attention for image universal image-text representation learning. In
captioning and visual question answering. In: ECCV, volume 12375,
Proceedings of the IEEE conference on computer DOI:10.48550/arXiv.1909.11740
vision and pattern recognition. DOI: [14] Boardman, R. and Mccormick, H. (2021),
10.1109/CVPR.2018.00636 "Attention and behaviour on fashion retail websites:
[5] Fuyu Wang, Xiaodan Liang, Lin Xu, Liang Lin an eye-tracking study", Information Technology &
(2020) Unifying relational sentence generation and People. DOI:10.1108/ITP-08-2020-0580
retrieval for medical image report composition. IEEE
[15] Tamara J. Sussman, Jingwen Jin, Aprajita
transactions on cybernetics.
Mohanty - Top-down and bottom-up factors in
DOI:10.1109/TCYB.2020.3026098
threat-related perception and attention in anxiety
[6] Yang S, Niu J, Wu J et al (2021) Automatic
2016, DOI:10.1016/j.biopsycho.2016.08.006
ultrasound image report generation with adaptive
[16] Haonan Zhang, Pengpeng Zeng, Lianli Gao,
multimodal attention mechanism. Neurocomputing
Xinyu Lyu, Jingkuan Song, Heng Tao Shen
DOI:10.1016/j.neucom.2020.09.084
2023-SPT: Spatial Pyramid Transformer for Image
[7] Yang S, Niu J, Wu J, et al (2020) Automatic
Captioning. DOI:10.1109/TCSVT.2023.3336371
medical image report generation with multi-view and
[17] Wan Chen, Rongbin Ruan, Weiwei Deng, Junxi
multi-modal attention mechanism. In: 20th
Gao 2023 - The effect of visual attention process and
international conference on algorithms and
thinking styles on environmental aesthetic preference:
architectures for parallel processing, ICA3PP 2020
An eye-tracking study
12454. DOI:10.1007/978-3-030-60248-2_48
DOI:10.3389/fpsyg.2022.1027742
[8] Wang X, Guo Z, Xu C et al (2021) Imagesem
[18] Woo Suk Choi, Yu-Jung Heo, Dharani Punithan,
group at imageclefmed caption 2021 task: exploring
Byoung-Tak Zhang 2022 Scene Graph Parsing via
the clinical significance of the textual descriptions
Abstract Meaning Representation in Pre-trained
derived from medical images. In: CLEF2021 working
Language Models DOI:
notes, CEUR workshop proceedings, CEUR-WS. org,
10.18653/v1/2022.dlg4nlp-1.4
Bucharest, Romania.
[19] Yue Qiu, Shozo Yamamoto, Kodai Nakashima,
https://ceur-ws.org/Vol-2936/paper-118.pdf
Ryoichi Suzuki, Kenji Iwata, Hirokatsu Kataoka,
[9] Xie X, Xiong Y, Yu P et al (2019) Attention-based Yutaka Satoh 2021 Describing and Localizing
abnormal-aware fusion network for radiology report Multiple Changes with Transformers DOI:
generation. In: 24th international conference on 10.1109/iccv48922.2021.00198
database systems for advanced applications, DASFAA [20] Yiyu Wang, Jungang Xu, Yingfei Sun 2022
2019 11448. DOI:10.1007/978-3-030-18590-9_64 End-to-End Transformer Based Model for Image
[10] Jiasen Lu, Caiming Xiong, Devi Parikh, & Captioning DOI: 10.1609/aaai.v36i3.20160
Richard Socher (2017) Knowing When to Look:
Adaptive Attention via a Visual Sentinel for Image
Captioning. DOI:10.1109/CVPR.2017.345
[11] Wenqiao Zhang, Hang Shi, Siliang Tang, Jun
Xiao, Qiang Yu, Yueting Zhuang (2021) Consensus

15th ICCCNT IEEE Conference,


June 24-28,
Authorized licensed use limited to: INDIAN INST OF INFO TECH AND MANAGEMENT. 2024,
Downloaded on March 19,2025 at 06:30:45 UTC from IEEE Xplore. Restrictions apply.
IIT - Mandi, Kamand, India

You might also like