Generating_Caption_From_Images_Using_Flickr_Image_Dataset
Generating_Caption_From_Images_Using_Flickr_Image_Dataset
Image Dataset
Piyush Sharma Kiranpreet kaur Aaryan Gandotra
Department of CSE Department of CSE Department of CSE
Chandigarh University Chandigarh University Chandigarh University
Mohali, India Mohali, India Mohali, India
[email protected] [email protected] [email protected]
this purpose, VGG-16 (Visual Geometry Group). The In recent years, there has been an increase in
model used 16 convolutional layers designed for interest in text-image production, namely the creation
object recognition . Moving on to the second step, the of images from text annotations. A number of
extracted features should be trained with the topics methodological and conceptual methods, including
provided in the dataset. generative adversarial networks (GANs), have been
Certainly! The predicted incorporates the developed. Researchers have also looked into
subsequent 4 essential ranges: (1) Object Detection controllability and semantic linkages in text-image
and Recognition-Identifying and categorizing items production. Li et al., 2019 [3] created an
within the given context; (2) Attribute Prediction- image-referencing method that focuses on the
Anticipating specific attributes associated with the controllability of generated images. Yin et al., 2019
detected items; (three) Scene [1] concentrated on semantic splitting to improve
Classification-Classifying the general scene or text-to-image generation by separating different
surroundings based on the identified gadgets; (4) semantic aspects in the input for precise image
Description Generation- Creating a descriptive integration.
narrative or rationalization of the scene by way of
combining records from the preceding stages. When performing activities like labeling
This study explores state-of-the-art deep images and answering visual questions, both
learning-based picture captioning techniques. bottom-up and top-down attention strategies are
Analysis of the prevailing encoder-decoder crucial. According to Sussman et al., 2016 [15],
architecture, in which Long Short-Term Memory top-down reactions are motivated by cognitive
(LSTM) networks convert image characteristics into strategies, whereas bottom-up responses are linked to
natural language captions and Convolutional Neural salient visual objects that automatically capture
Networks (CNNs) extract picture features. Next, attention. These methods require different processes
comes the development of multimodal feature within the visual cortex. suggested a model for picture
inclusion and attention mechanisms, which are all captioning and visual question answering that
advancing the field toward more precise and combines top-down and bottom-up attention, using
insightful image descriptions. Faster R-CNN-like mechanisms for bottom-up
attention (Anderson et al., 2018 [4]; Wang et al., 2020
II. Literature Review [5]). Using several reasoning phases and fine-grained
analysis, this approach enables a more thorough study
Earlier efforts to address this issue involved
of images (Anderson et al., 2018 [4]).
employing template-based approaches, which utilized
image classification techniques to categorize objects
Eye-tracking research has shown that this
into predefined classes. These objects were then
method can be used to measure both top-down and
inserted into a standard template sentence. However,
bottom-up attention, offering insights into how people
contemporary advancements have shifted towards
respond to visual stimuli (Boardman et al. (2021)
Recurrent Neural Networks (RNNs) as the primary
[14]). Furthermore, research on how thinking styles
focus of research in tackling this problem. Recurrent
and visual attention processes affect aesthetic choices
Neural Networks (RNNs) have gained significant
has shown the importance of bottom-up visual
popularity in various Natural Language Processing
attention, especially when images are involved (Chen
(NLP) tasks, notably in machine translation, where
et al., 2023 [17]). Furthermore, it has been proposed
they excel at generating sequences of words.
that the integration of top-down and bottom-up
Extending this capability, image caption generators
attention can improve tasks such as visual question
leverage RNNs to produce descriptions for images by
answering, in which attention is focused on pertinent
generating words sequentially, thereby associating
image regions according to language characteristics
textual descriptions with visual content.
associated with the query (Yang et al., 2021 [6]).
Applying an Improvements in architecture for the CNN and RNN model, training the
adaptive the use of model on image-caption pairs using methods like
Medical multimodal multimodal teacher forcing and a suitable loss function, evaluating
Yang et images attention method attention the model's performance on a validation set using
al., [6] with to improve the processes to metrics like BLEU, METEOR, or CIDEr,
2021 proper precision and generate reports post-processing the generated captions to improve their
captioned productivity of based on readability, deploying the model in a setting conducive
reports. ultrasound image ultrasound images to user interaction, and monitoring and updating the
report generation model’s performance over time. The process of
creating captions for images usually combines
Emphasize the Clarifies the vital
computer vision techniques to comprehend the image's
problem of component of content with natural language processing techniques to
caption dataset quality provide meaningful and cogent descriptions.
Images hallucinations and how it affects
Zhang et with caused by the process,
al., [11] improper erroneous or especially when it
2021 captions biased comes to
attached. visual-textual grounded image
correlations that captioning
are acquired
from datasets.
PureT facilitates
Creating an end-to-end training
Images end-to-end and eliminates the
Wang et with Transformer-base requirement for
al., [20] human d approach called pretraining the
2022 generated PureT for picture object detection
captions captioning. component, both o
each. which boost
efficiency.
Using Findings
eye-tracking suggest that
methodology, individual
Observe the impact of differences in
Chen d the eye visual attention visual attention
et al., moveme processes and processes and
[2] nts and cognitive thought styles,
2023 fixations styles on as well as visual
of the environmental attention,
subjects aesthetic influence
preferences is preferences for
examined. environmental
aesthetics. Fig. 1
Table 1: Literature Review The flow chart for the entire process is shown
in Fig. 1 It starts with data preparation and collection,
III. Methodology which includes formal photographs and written
The steps involved in developing an image captions. Next comes preprocessing, when unnecessary
caption generator are obtaining a dataset of captioned data is removed from the collection and pruned. The
images, preprocessing the images with CNNs to extract model is selected for training on the dataset after all
features, tokenizing the captions, constructing an preprocessing and data mining have been
completed,and it is repeatedly trained there until the convolutional layers, it examines the preprocessed
accuracy of the model stabilizes. Finally, the evaluation image and extracts high-level visual information such
grid checks and fine-tunes the tested model as objects, shapes, and their relationships. After that, a
consistency. compressed vector representation of this data is
An image captioning model's training process created.
requires a critical first step: data preparation. It As the "storyteller," the "decoder," which is
guarantees that both the textual and visual input can be usually an LSTM network, performs the role. After
processed and understood by the model effectively. receiving the encoded picture features, it starts an
The two crucial preprocessing procedures are internal state that will help it recall words that have
examined in more detail below: already been formed. Building the description word by
a. Image preprocessing word, the LSTM uses this data at each stage to forecast
b. Caption preprocessing the most likely phrase to appear in the caption. In other
This ensures that all of the collection's images words, whereas the LSTM analyzes and turns those
are consistent. It involves tasks like: Images are qualities into a natural language narrative, the CNN
reduced in size to a common size (e.g., 224x224 essentially "sees" the image and seizes its core.
pixels) in order to increase processing efficiency. Pixel
values are often standardized to a predefined range The key component of learning is the training
(between 0 and 1) to aid training. loop. Here, the dataset's image-caption pairs are
It takes preparation to prepare text captions as well. frequently displayed to the model. Using the encoded
The smaller pieces that make up captions are words or image attributes, it makes word-by-word caption
sentences. This allows the model to process them predictions. A loss function is used to determine how
sequentially. much these predictions depart from the actual
captions. Using an optimization technique similar to
backpropagation, this loss directs the model to modify
its internal parameters (weights and biases). The
objective is to reduce the loss as much as possible,
bringing the generated captions by the model closer to
the ones written by humans. By repeatedly iterating
over the full dataset, this loop enables the model to
progressively understand the complex relationship
that exists between visual information and its
associated plain language description.
Fig. 2
procedure, there are other phases that are essential for quality by assimilating flexible matching and
best results. Same model has been trained on the synonyms.
dataset multiple times with feedback as well. With the On the test set, this model received an
feedback performance showcased itself. Above figure average BLEU score (fig. 4) of 0.62, a CIDEr score of
(fig. 3) is an exponentially decreasing graph, which 1.05, a ROUGE score of 0.70 and a METEOR score
tells about the validation loss with increase in training
of 0.46. In comparison to baseline Encoder-Decoder
of the model.
Evaluation measures, which frequently use with Attention models, this model demonstrated a
parameters like the BLEU score, aid in evaluating the noteworthy enhancement in both BLEU (0.58),
caliber of the output captions. A popular method for CIDEr (0.98), ROUGE (0.65) and METEOR (0.44)
fine-tuning models is grid search, which looks at (fig. 4).
several hyperparameter configurations (which Additionally, carried out an ablation research
regulate the training process) to find the one that in which several parts of this model are disassembled
performs best. Lastly, the model is kept for later usage to examine their distinct roles. This demonstrated that
on fresh photos when it has been trained and adjusted, during the process of creating captions, the attention
enabling it to demonstrate its image captioning mechanism was essential in directing attention
capabilities.
towards particular areas of the images, resulting in
more precise depictions of the objects and their
IV. RESULTS
interactions.
Model is tested on the Flickr Image dataset.
There are over 31,783 photos in this dataset, and each
V. CONCLUSION
one has five handwritten captions. BLEU score,
When the very suggested image captioning
CIDEr, ROUGE and METEOR are the two primary
model was applied to the Flickr picture dataset, it
evaluation measures used.
significantly performed well in terms of both BLEU
and CIDEr scores. The ablation study demonstrates
how well the attention mechanism focuses on relevant
areas of the image. It does a great job at producing
captions that are related. Future work will concentrate
on improving the model's capacity to manage
complex situations and produce a wider variety of
unique captions. This may involve implementing
sophisticated language generation algorithms or
object relationship identification. Finally, this model
offers a strong tool for automatic image description
with potential for additional complexity and creative
growth. It is a huge advancement in image captioning.
VI. REFERENCES
Fig. 4 [1] Yin C, Qian B, Wei J et al “Automatic generation
of medical imaging diagnostic report with hierarchical
The generated caption's BLEU score recurrent neural network”. In: 2019 19TH IEEE
compares it to the reference captions' similarity, international conference on data mining (ICDM
whereas CIDEr takes into account both relevance and 2019). DOI: 10.1007/s10462-022-10270-w
the usage of n-grams, or word sequences. ROUGE [2] Vinyals O, Toshev A, Bengio S et al 2015 Show
however, evaluates the overlap of n-grams between and tell: a neural image caption generator. In:
the generated and reference captions, also accounts in Proceedings of the IEEE conference on computer
recall, precision and F1-score. METEOR provides a vision and pattern recognition,
more nuanced evaluation of translation and captioning DOI:10.1109/CVPR.2015.7298935