0% found this document useful (0 votes)

7 views

7

This document discusses the development of an Arabic image captioning model using deep learning techniques, specifically leveraging ResNet50 for image feature extraction and LSTM for language processing. The model is trained on the Flickr8k dataset, which contains 8091 images with three Arabic captions each, aiming to improve the generation of meaningful captions in Arabic compared to existing English models. The paper outlines the challenges of Arabic language processing and the necessary preprocessing steps, along with the architecture and evaluation metrics used in the model's development.

Uploaded by

shamosaaliahmed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

7

Uploaded by

shamosaaliahmed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Jordan University of Science and Technology

Arabic image captioning using ResNet50.

Batool Alazzam

Abstract:
Deep learning is a branch of machine learning which uses neural networks.
Nowadays, the interest in this field is increasing rapidly, and there are many
applications of it that do a great job and compared to humans such as self-driving cars,
medical care, read lips movements, and photo descriptions. Photo description or image
captioning is a deep learning application. In which a set of images are passed to the
model, and the model does some processing to generate output as captions or
descriptions of the entered images. This process needs high computational resources,
and a huge dataset to be trained well; to decrease the probability of generating
meaningless sentences which not related to the images. There are many works on this
topic with English language comparing to the Arabic language; because it has many
complicated features to deal with such as the direction of writing from right to left, has
many letters that not pronounced by many other languages, and compared to English,
Arabic language words are mostly connected. Finally, there as many as half a billion
people speaking Arabic. In this work, Arabic image captioning with Keras will be the
core point, Flickr8k dataset will be used, each image is associated with three different
Arabic captions that describe the entities and events in the image. Flickr8k Arabic
dataset has 8091 images, each image has three different descriptions in Arabic, training
image ids, and testing image ids. Most works are on the Flickr8k English dataset. CNN
model will be used as the image model and RNN-LSTM as the language model. The
minimum loss value is achieved after running 50 epochs is 1.75 with very good captions
describe the image components. 42% is a BLEU score.

Keywords: Image captions generator, Arabic text, LSTM, ResNet50, Flickr8k, BLEU,
Arabic light stemming, CNN, RNN.

1. Introduction:
Image Caption Generator is an application of deep learning. It helps in generating
a caption and description for an image. All that needs to pass an image to the model
and is trained to process and generate captions and descriptions for it. Day by day there
is new research in this field, which is considered a challenging topic, where the
description text must be generated for a given image. It needs methods from the
computer version to deal with and understand the image content and a language model
from natural language processing; to turn the image understanding into words in the
right way. The compositionality and the nature of both the language and visual view

|Page1
have a big role in the way of training the model. This type of model needs to be trained
on the co-occurrence of some objects in the dataset context, and able to generalize. The
human brain can easily understand the image content and tell what it is, but in the
computer that is not easy as human. For that, there is a need to build models which
facilitate this process using computers. The power of deep learning methods achieved
state-of-the-art results on image caption generation, which can build a single model that
can predict a caption and give a photo [1]. Text preprocessing is the most important
step that should be done when dealing with a text. It is needed to convert text data that
humans can understand easily, to format that understandable by the machine also. Each
language type in the world needs a specific way to deal with it. The text preprocessing
step is decided according to the task, for example, the Arabic language differs from
other languages, which has connected letters, there is no capital, small letter, and
diacritics. The Arabic text needs some preprocessing steps like, remove diacritics,
remove prefixes and suffixes, remove the connective letter, that differ from English text
preprocessing that needs; change capital letters to small letters [2].

Image caption generator consists of CNN to extract spatial information from the
images, RNN to generate the sequence of the word, and LSTM to remember the long
word sequences. Image caption generator allows employing techniques from computer
vision and natural language processing to extract overall textual information about the
target images. Image captioning has many applications, such as information retrieval,
for impaired people, in social media, and so on. Although the lack of data sources, there
are great accomplished works for English compared with the progress on Arabic image
captioning is still slow. The works in Arabic are still limited, because of the shortness
of dataset availability, and the complex nature of the language. The Arabic language is
very rich and needs specialists to deal with and know how to deal with it.

This work aims to build a model for image caption generation that deals with Arabic
captions using a developed dataset based on Flickr8K, which has 8000 images with
three different captions for each image in the dataset. The dataset is divided into training
set with 6000 images, validation set with 1000 images, and testing set with 1000
Images. The work has pre-processing step to prepare images and text ready to use. For
image preparation, there are many models to be used, but in this case, the ResNet50
will be used, which Keras provides directly. The used network is pre-trained on
ImageNet. It allows training deep neural networks with 150+layers successfully. As a
brief step, images' features will be extracted and saved to a file, then used later and feed
to the model. This step will load each image, prepare it for the ResNet50, and collect
the extracted features. The result will be saved in a .pkl file, the images will be ready
for the testing model too. For the text part of the dataset, each image has three different
captions and this part needs cleaning, stemming, and normalization. Light stemming
for Arabic text will be used. The final preprocessed text will be saved in a text file.

The deep learning model definition is divided into three parts: 1. Photo feature
extracting, the pre-trained ResNet50 architecture will be used. This step's output is
features extracted from the image dataset. 2. Sequence processor, word embedding with

|Page2
LSTM RNN layer. 3. The output from both previous steps combined and processed by
a Dense layer to make a final prediction. At each epoch, the copy of trained model will
be saved in a .h5 file, running time depends on the hardware capabilities that be used,
at average each epoch will take ten minutes using GPU. After that, the model fit step
and the model evaluate step will be implemented using the validation dataset this will
help to avoid the overfitting problem. This problem is caused by the model will quickly
learn and overfit the training dataset. BLEU, or the Bilingual Evaluation Understudy,
scores are used to evaluate the translated text versus one or more reference translations
[3]. In this case, the BLEU scores will be used to evaluate the generated text.

The rest of the paper is organized as follows. In the next section, literature reviews
are presented. Then, methods and materials are used in this paper are presented,
followed by experiments and results then discussions of the findings. Finally, the paper
is concluded.

2. literature reviews:
In 2018, Xinxin Zhu et al two improvements are proposed. The first method named
the triple attention (TA-LSTM) image caption method is used at the input stage of the
hidden unit of LSTM and the output stage of each hidden unit. At the input stage. They
use the text-conditional embedding method, which uses the previous information of text
and combines the image information. Also, stack and parallel LSTM are used to
improve the performance. The parallel LSTM is used to catch information from input
context, each LSTM has identical parameters which can represent input context with
variant information. Finally, the average pooling layer is used to combine all outputs of
LSTM [4]. In 2020, Yan Chu et al a joint model for image caption generator was
provided based on ResNet50, LSTM, and soft attention (AICRL ). The model has one
encoder and one decoder. The encoder applies ResNet50 based on CNN to provide
representation for the given images dataset by embedding them into a specific vector.
The decoder was adopted with LSTM, CNN, and soft attention mechanism; to
selectively focus on specific parts of the image to predict the text. Different metrics
were used for evaluation like BELUE, METEROR, and CIDEr [5]. In 2018, Huda A.
Al-muzaini et al the aim of this work to develop a model for image caption generation
to describe images in Arabic language. The model consists of two subnetworks, one is
RNN for text and the other CNN for images. These subnetworks interact to each other
to generate a correct or meaningful descriptions for the given images. The COCO and
Flicker datasets with English description were used. At first, they translated the English
descriptions for Arabic using Google translator. Then, 3427 images, with a vocabulary
size of 9854 were fed to the model. For encoder, visual geometry group (VGG) 16
layers CNN is adopted [6]. In 2017, et al Philip Kinghorn the model was used is based
on deep learning for image description generation. Two RNNs -attribute prediction and
encoder-decoder were embedded together to give descriptions for given images. The
proposed system focused on the image regions of people and objects of the image. IAPR
TC-12 dataset was used. R-CNN is a CNN but with extra outputs which predict
|Page3
bounding box coordinates and was trained using the ILSVRC13 dataset [7]. In 2018,
et al Ying Hua Tan the Long Short-Term Memory (phi-LSTM) architecture was used
to produce the image descriptions. Its role is to decode the captions from phase to
sentence, the phrase decoder decodes the nouns with variants length also, abbreviated
sentence decoder. The image's caption generated by aggregating the phrases were
generated with sentence during the inference step. Flickr8k, Flickr30k and MS-COCO
datasets were used [8]. In 2017, et al a simple implementation of encoder decoder-based
with some editing and improvements to function in a real-time environment; that makes
it possible to run the models on low-end hardware and personal devices. TensorFlow
and android application were used. The MSCOCO dataset was applied in this work [9].
In 2018, et al Aghasi Poghosyan the work provides a model which can generate images
captioning automatically using RNN, with a modified LSTM cell, there is an additional gate,
its role to for image features. All these adjustments help to improve high accurate image
captions. The MSCOCO dataset used in this work [10]. In There is a problem when working in
image captioning tasks, which that there are not enough data sources that help to provide highly
accurate models for image captioning. This work comes with a solution which is a novel word
injection. This model uses the pre-trained caption producer and works on the output of the
producer to insert objects that are not present in the dataset into the caption. BLUE, CIDEr, and
ROUGE-L normalized metrics are used for evaluating the work [11]. The work uses Regional
Object Detector (RODe) for detection, recognition, and generates captions for the given images
dataset. Object detection is done using R-CNN, feature extraction from images is done using
CNN, creating attributes step for defining the string attributes, and the fourth step is encoder
and decoder for string labels using RNN [12]. In 2019, et al Neeraj Gupta the most models rely
on images to describe the context more than the text or caption embedded with the image. The
feature extraction mostly depends on the image itself. The work aims to use images with text
captions to describe the content of the image [13]. The work proposed an analysis of three
components: CNN, RNN, and sentence generation. They found the VGG network
works better and that shown in the BLEU score results. Also, a new recurrent layer was
provided as a simplified version (GRU) DONE USING C++ and MATLAB. The result
is near for the result using LSTM, but it differentiates by having a few parameters which
save the memory and make the learning process faster [14].

This work provides a model for image captioning generation, which can generate
natural sentences for the given image to describe it, based on deep learning and RNN.
The provided model trained to maximize the likelihood for the correct sentence that
describes the given train image. The model takes care of both qualitatively and
quantitatively [15]. The multimode RNN model is provided in this work which can
generate captions for the given image. The captions are generated based on the
probability distribution of generating a word depend on the previous words. The model
has two networks: RNN for dealing with text and CNN for dealing with images. Also,
the provided model used at retrieval tasks (images or sentences) and got a very good
performance [16]. This work aims to study RNN and LSTM, the equation, and its
derivative. They explain the difficulties of training the RNN especially vanishing and
exploding problems. Instead, replace RNN with the Vanilla LSTM1 network, and
explain the forward pass and backward pass. The core contribution is their own

|Page4
approach to analyze RNN and Vanilla LSTM from a single processing perspective
[17]. A bidirectional LSTM model is provided to deal with Arabic text, LSTM network
able to process and connect each part, that makes it suitable for NER task. A pre-trained
model is used to train the input that entered into the LSTM network. They don't use any
feature engineering or preprocessing. LSTM is very helpful to solve the Arabic Nes
more than other methods [18]. Neural Network is used to deal with Arabic text, to
correct the language modeling, generate the text, and predict the missing words in the
text. They aim to adapt RNN with the Arabic language model to produce correct Arabic
sequences. CNN architecture is used to predict the missing text in Arabic documents
[19]. The RNN and SVM are used in this work to analyze Arabic Hotels' reviews. The
provided approach is trained with lexical, word, syntactic, morphological, and semantic
features for a long. SVM shows better performance in category identification, extract
target expression, and polarity identification. ON the other hand, RNN is faster in time
execution [20]. The provided transfer learning model aims to detect if the given written
Arabic text is written by humans or generated automatically. Set of tweets used as a
dataset for the work. GPT2-Small-Arabic is used to generate fake Arabic sentences.
LSTM, BI-LSTM, GRU, and BI-GRU are RNN word embeddings-based baseline
models used for evaluation. The contribution of this work is ARABERT and GPT2
combined for detecting and classifying Arabic text for the first time [21]. In 2019, et
al Songtao Ding Although image captions generation tasks in deep learning achieve a
great result; there are still problems face this task, such as trying to get accurate results
at high-level visual tasks, and for images that have multiple targets. This work proposes
two mechanisms to solve these issues: stimulus-driven and concept-driven. The model
depends on the integration of CNN to deal with images and LSTM to deal with
sentences [22]. In 2020, et al Wenqiao Zhang This work proposes a method to do image
captions generation tasks. The method is combined the ICM and IRM into Cooperative
Learning, it allows to share the common knowledge to translate the heterogeneous
information. The HRA is a Hierarchical Refined Attention its role to clean the visual
information from the image, it can attend sequentially to the image feature and semantic
attributes. Then, merge them to get a better caption for the images [23].

In 2015, Li Yao et al also managed to employ the LSTM network in videos, so as to

generate semantic text summarization of long videos [24]

In 2019, Xianhua Zeng et al Using deep learning in image caption generation tasks in
the medical field still lacks, especially in the analysis and description information from
ultrasound image understanding. This work provided a method for detect and encodes
spatial areas in ultrasound images. LSTM is used for decoding and produces
observation text information that describes the disease's content in the images [25].
There are two ways for image captions generator. The first is a retrieval way that
depends on recover the information to produce a description of the image content. The
second is the generating way using the encoder-decoder framework. CNN-RNN is
combined with a beam-search which role is to generate multiple captions for the same
image, and the best caption will be selected depending on the lexical similarity with the

|Page5
reference [26]. CNN helps to understand the word embeddings and better history
representation dealing with small datasets. Is better at classification accuracy and
predicted high-level words. The prediction captions are more human-like [27]. Using
image captions generator methods to produce or generate video captions as building
blocks. So, the video caption is the same as a summarization of image captions [28].

3. Methods and materials:

This work aims to make a model for image caption generation in the Arabic
language. Firstly, the work environment must be prepared, which needs Keras with
TensorFlow backend and RNN-LSTM to deal with the text part. ResNet50 pre-trained
model will be used to deal with the content of images. ResNet50 contains five stages,
each one with an identity convolution block. Each block has three identity convolution
layers. The ResNet-50 has over 23 million trainable parameters. Also, BLEU score is
used, which is an automatic evaluation metric for machine translation and focus on
word reordering, will be used to evaluate translation quality.

3.1. Dataset:

The used dataset is developed based on Flickr8K, has 8000 images with Arabic
descriptions, three different captions for each image. The dataset split to train set with
6000 images, validation set with 1000 images, and test set with 1000 images as .txt
files, consist of two columns: the image name .jpg and the text captions.

3.2. Data preprocessing:

3.2.1. Photo pre-processing:

In this work, the ResNet50 pre-trained model will be used to deal with images,
Keras provides this model and can use it directly. Images features will be pre-computed
and saved to a file as a description of the images dataset and fed to the model. This
optimization needs less time and consumes less memory. Keras.preprocessing.image
will be used to load, preprocess the image size, and reshape. The loaded images inserted
to ResNet50 using: "keras.applications. resnet50". Finally, the extracted images are
saved in a .pkl file, created by pickle, Python module, that contains a byte stream that
represents the object.

3.2.2. Description text pre-processing:

Arabic text will be pre-processed to be efficiently used Natural Language

Processing (NLP). The preprocessing includes: remove punctuations and diacritics,
normalize Hamza, remove repeating characters, remove foreign characters, and remove
one-character words. It started with defining Arabic diacritics and punctuations. Then,
trying to use light Arabic light stemming, which takes an Arabic text and perform
stemming such as remove length three and length two in prefixes, and suffixes, remove
|Page6
connective '‫ 'و‬if it precedes a word beginning with '‫'و‬, and normalize initial hamza to
bare alif. Finally, define functions, each has a specific role to do. Finally, the final
normalized text will be saved in a .txt file.

3.3. Define the model:

First of all, the features of the extractor model is a ResNet50 model pre-trained on
ImageNet dataset, this model features output will be used as input of 2048 vector
elements. The dense layer is used to process these features to produce a 256-element
representation of the photo. Then, the sequence processor is a word embedding dealing
with text part, followed by LSTM with 256 memory units recurrent neural network
layer. It is expected input with 26 words as maximum length. The output for both
previous parts is merged and processed to make the prediction by the Dense layer at the
decoder. 20% dropout is used as a regularization form, Adam optimizer is used, and the
learning rate that is used 0.01. The model trained using a training dataset, which consists
of 6000 images, vocabulary size is 9090, and the maximum sequence length is 26
words. The used activation function is a Relu for all layers, except the final layer uses
softmax to normalize the output. Softmax layer will convert the output to probabilities.

3.4. Fitting the model:

There is a big chance for the model to overfit the training dataset; for that, the
validation dataset is used to observe the training model performance. At the end of
epochs, the model performance will be improved and will save the model in a file.
ModelCheckpoint in Keras is used for observing loss in the model. Initially, ten epochs
will be used to fit the model. Then, start increasing the number of epochs until reaching
the minimum loss value. Running epochs take a long time, ten minutes approximately
for each epochs using modern hardware like Hardware accelerator GPU or TPU.

|Page7
Figure 1: Image captioning DL model.

Figure 1 shows the architecture of the used model in this work for image captions
generator.

4. Experiments and results:

4.1. Evaluate the model:

In this step, the model performance in prediction image captions will be

measured using a cost function based on the test dataset. For image description
generation, the trained model will be called recursively. The description will start with
the "start'' word, then the model will generate the words as input until reach the
maximum length, in this dataset is 26 words. BLEU score in Python is used to show the
goodness of the used model, it measures how much the current sentence similar to the

|Page8
one or more reference sentences [29]. The output range is between 0 and 1, closer to 1
is better.

4.2. Results:

Num. Epoch Loss

Epoch #1 5.60
Epoch #2 4.27
Epoch #3 3.9
Epoch #4 3.63
Epoch #5 3.42
Epoch #6 3.25
Epoch #7 3.08
Epoch #8 2.94
Epoch #9 2.81
Epoch #10 2.70
Table 1: Loss values after running 10 epochs.

Training the model starts with run ten epochs and calculate the loss value after each
epoch, the last epoch has the minimum loss value. The total epochs take 84 minutes.
BLEU result is 0.38.

Figure 2 : Model prediction result after 10 epochs.

Figure 2 shows the final model's result after training the model for ten epochs and
compute the loss value. The caption output indicates that the model is able to distinguish
the main component or object in the image. After running ten epochs, the best result
of the model is saved and used for prediction, the result is not accurate 100%.

Num. Epoch Loss

Epoch #1 5.58
Epoch #2 4.23
Epoch #3 3.85
Epoch #4 3.58

|Page9
Epoch #5 3.37
Epoch #6 3.19
Epoch #7 3.05
Epoch #8 2.90
Epoch #9 2.77
Epoch #10 2.67
Epoch #11 2.57
Epoch #12 2.50
Epoch #13 2.42
Epoch #14 2.36
Epoch #15 2.30
Epoch #16 2.26
Epoch #17 2.17
Epoch #18 2.13
Epoch #19 2.10
Epoch #20 2.00

Table 2: Loss values after running 20 epochs.

Noticed that loss values are decreased after running more epochs but still, it is needed
to reduce the values to get more accurate prediction result at testing time. At the epoch
number 20 the loss value reaches 2.

Once more trail, 35 epochs are running again and show how the loss values are
decreased. The minimum value of 1.95 is reached at epoch number 35. The BLEU value
reaches 0.42.

Figure 3: Prediction result after running 35 epochs.

Figure 3 is the result after training the model longer than the first time, the result still
consists of the main object in the image, and here is a standing man in a narrow passage.

Running more epochs helps to reduce loss values, the minimum result is gotten at epoch
number 35. Still, the result is not accurate 100% yet. The aim is to get an accurate and
meaningful description of the content of the target image.

| P a g e 10
Figure 4: Prediction result after running 35 epochs.

Figure 4 shows the result of generated caption using the trained epoch that saved in
epoch number 35. The model can capture more details in the image.

Figure 5: Prediction result after running 35 epochs.

At figure 5, the model starts distinguishing and captures more details in the image more
accurately.

As a final trial, 50 epochs are run and the minimum loss value stabilized at 1.72. Then,
each running epoch has a copy of the trained model, each model used at prediction time
to show which is more accurate.

| P a g e 11
Figure 6: Prediction result using model in epoch # 44.

Figure 6 shows how the model performance is improved after running it for a longer
time and starts getting descriptive caption for the target image.

Using the trained model that is saved at epoch number 44, the result is descriptive. The
result using the trained model at epoch number 49, as shown the result is less
representative, though this epoch has the minimum loss value result.

Figure 7: Prediction result using model in epoch #49

Figure 8: Prediction result using model in epoch #49

| P a g e 12
Figures 7 and 8 show how the trained model that saved in the final epoch misleading
some features in the image and produce non-accurate captions for the target images but
is still focused on the main object in them.

After all these trials are noticed that at the training step, the model focus on the main
content or main part of the image. For example, the model at all dogs photos focuses
on the dog whatever what the color or if the dog run, walk, or play.

Num. Epoch Loss

Epoch #1 6.08
Epoch #2 4.50
Epoch #3 4.16
Epoch #4 3.80
Epoch #5 3.58
Epoch #6 3.33
Epoch #7 3.11
Epoch #8 2.86
Epoch #9 2.66
Epoch #10 2.45
Epoch #11 2.27
Epoch #12 2.10
Epoch #13 1.96
Epoch #14 1.80
Epoch #15 1.67
Epoch #16 1.57
Epoch #17 1.45
Epoch #18 1.35
Epoch #19 1.27
Epoch #20 1.18
Epoch #21 1.12
Epoch #22 1.07
Epoch #23 1.00
Epoch #24 0.96
Epoch #25 0.94
Epoch #26 0.89
Epoch #27 0.88
Epoch #28 0.84
Epoch #29 0.80
Epoch #30 0.81
Table 3: Loss values using testing dataset.

Table 3 shows the loss values that resulted after running 30 epochs using the testing
dataset which consist of 1000 images non seen before, it clears how the results are
enhanced compared with loss values of the training dataset. The minimum loss is
gotten 0.80 at epoch number 29.

| P a g e 13
Figure 9: Prediction result using testing dataset Figure 10: Prediction result using testing.

Figures 9 and 10 are prediction results using the testing dataset. The predictions of
image captions according to loss values are accurate and described well the images'
content.

5. Discussion and Limitations:

5.1. Result Discussion:

This work focuses on image caption generation using the Arabic language.
Dealing with Arabic is very challenging, it differs from the English language, which
needs some different preprocessing steps like remove punctuations and diacritics,
normalize the "Hamza", remove repeating characters, remove foreign characters, and
remove one-character words. Also, light Arabic light stemming is used, which takes an
Arabic text and performs the stemming such as remove length three and length two in
prefixes, and suffixes, remove connective '‫ 'و‬if it precedes a word beginning with '‫'و‬,
and normalize initial hamza to bare "alif". After preparing the text part from the dataset,
it is saved in a .txt file to use later at training and prediction steps. On the other hand,
images also need to preprocess, all images' features are extracted and saved in a .pkl
file, it takes a long time to complete all images. After that, images and descriptions are
ready to be used to train the model, each epoch has a copy of the trained model and the
loss value is computed. The decreasing of the loss value at the final epoch is very slow
and steady at 1.72 in epoch number 49. Running each epoch takes an average of ten
minutes. The final results after running 50 epochs indicate that the model at first focus
on the main content or main part of the image. For example, the model in all dogs'
photos focuses on the dog, whatever is the color? or if the dog run, walk, or play.

| P a g e 14
Figure 11: Prediction result. Figure 11: Prediction result.

After running 50 epochs and each epoch has a copy of the trained model, some of these
copies are used at testing the model performance, the model copies are used from epoch
number 35 and each is tried at testing. Figures 10 and 11 show descriptive captions for
the target images it can produce captions that describe the dogs and what do in each
image.

Figure 12: Prediction result. Figure 13: Prediction result.

The model performance is not stable with all the target images and able to produce
accurate captions even though it is trained for a long. Figures 12 and 13 show how the
model generates not accurate captions to describe the images.

| P a g e 15
Figure 14: Prediction result. Figure 15: prediction result.

Figures 14 and 15 show accurate and representative generated captions for the target
images.

Figure 16: Prediction result. Figure 17: Prediction result.

Again figures 16 and 17 show non accurate 100% for generated captions for the target
images at testing time.

| P a g e 16
Figure 18: prediction result. Figure 19: Prediction result.

Figures 18 and 19 are addition examples show the model performance at testing
approach and the generated captions for the targeted images.

Figure 20: Prediction result on testing data. Figure 21: Prediction on testing data.

Figures 20 and 21 are resulted from running two models saved in epoch number 28,
its result shows in figure 21, and epoch number 20, its result shows in figure 20, using
the testing dataset. Noticed that the model was confused about the location or the
nature of the land in the figures.

| P a g e 17
Figure 22: Prediction result on testing data Figure 23: Prediction result on testing data

Figures 22 and 23 are other examples of non-accurate 100% results for captions
prediction using the testing dataset. But it still distinguishes the main content of the
image correctly.

Figure 24: Failed prediction result. Figure 25: Failed prediction result.

Figure 26: Failed prediction result.

| P a g e 18
Figures 24, 25, and 26 are examples of failed predictions result that the model makes
using a testing dataset non seen before. However, the predictions are not wrong
completely, the model able to distinguish the dogs in Figures 24 and 25. Figure 26 is
confused between the young boy and the young girl but can distinguish that he wears a
blue swimming cloth.

Most searches work on the Arabic image captions generator use a combination
of two datasets MS COCO and translated English Flicker dataset not the Arabic version
of the dataset [6]. Also, the works rely on computing the loss values and BLEU scores
to show their results, in addition, to show the predicted captions with the images results.
The provided model in [6] uses RNN-LSTM based language model and CNN and the
BLEU result that got is 46% compared with the result of this work is 42%.

5.2. Limitations:

The limitation in Arabic data availability and resources of previous works

makes the background knowledge limited also. In addition, collecting a large dataset
needs time, knowledge in the Arabic language, and more computational resources.
Nevertheless, the final results at the prediction step in this work show to some extent
accurate result describe the target images, in the worst case it focus on the main content
of the target image man, girl, dog, or car, etc.

6. Conclusion:
Image caption generator is one of the deep learning applications, it combines
computer vision and natural language processing concepts. It merges CNN-RNN
architectures, which CNN is used to extract images' features, and RNN is used to deal
with the text part of the dataset. LSTM is used to generate descriptions for the target
images. There are many works for image captions generator in the English language.
The Arabic language is challenging to deal with; because it needs special preprocessing
steps that differ from the English. The Arabic language has many complicated features
to deal with such as the direction of writing from right to left, has many letters that are
not pronounced by many other languages, and compared to English, Arabic language
words are mostly connected.

This work focuses to generate Arabic captions for the images. ResNet50 will be used
for this task, which pre-trained on ImageNet. This network extracts the features from
the images and is saved in a .pkl file to be used later in training the model. LSTM is
used to deal with the text part in the dataset and do some preprocessing steps to make
the text captions ready, which facilitates the model dealing with. At the training step,
50 epochs are run and each epoch has a copy of the trained model and the loss value
will be computed and is saved in a .h5 file to be used and evaluate the performance on
the dataset. On average, running each epoch takes ten minutes using GPU. The
Flickr_8K dataset will be used in this work, it is divided into the train set with 6000
images, validation set with 1000 images, and test set with 1000 images. Each image has

| P a g e 19
three different captions which help the model to train well. The final results indicate
that the model focus on the main content or main part of the image. Then starts
describing the other content, what does the dog do? run, play, catch the ball, what does
the man wear? and what is the color? The minimum loss value in the training phase is
1.72 at epoch number 50, and the minimum loss value in the testing phase is 0.80 at
epoch number 29.

Although there is a limitation in the Arabic dataset availability, previous work, or

background knowledge, the result is very satisfactory. The model able to generate
accurate descriptions for the target images and can get rid of generating meaningless
descriptions at initially trained model copies. In future work, maybe a chance to collect
a large dataset and high resources available to expand the work, using alternate pre-
trained photo models, smaller vocabulary, or using pre-trained word vectors.

Conflict of Interest' statement:

The author whose name is listed below certify that she has NO affiliations with
or involvement in any organization or entity with any financial interest, or non-financial
interest in the subject matter or materials discussed in this research.

Author name:

Batool M. Alazzam.

| P a g e 20
References

[1] M. Bhikadiya, "Automatic Image Captioning Using Deep Learning," Medium, 2020.

[2] O. Davydova, "Text Preprocessing in Python: Steps, Tools, and Examples," Data
Monsters, 2018.

[3] J. Brownlee, "A Gentle Introduction to Calculating the BLEU Score for Text in Python,"
Machine Learning Mastery , 2017.

[4] Xinxin Zhu, Lixiang Li, Jing Liu, Ziyi Li, Haipeng Peng, Xinxin Niu, "Image captioning with
Triple-Attention and Stack Parallel LSTM," ELSEVER\ Neurocomputing , p. 16, 2018.

[5] Yan Chu ,1 Xiao Yue ,2 Lei Yu,1 Mikhailov Sergei,1 and Zhengkui Wang3 , "Automatic
Image Captioning Based on ResNet50 and LSTM with Soft Attention," Hindawi Wireless
Communications and Mobile Computing , p. 7, 2020.

[6] Huda A. Al-muzaini, Tasniem N. Al-yahya, Hafida Benhidour , "Automatic Arabic Image
Captioning using RNN-LSTM-Based Language Model and CNN," (IJACSA) International
Journal of Advanced Computer Science and Applications, p. 7, 2018.

[7] Philip Kinghorn, Li Zhang, Ling Shao, "A region-based image caption generator with
reﬁned descriptions," ELSIVER, p. 9, 2017.

[8] Ying Hua Tan, Chee Seng Chan, "Phrase-based image caption generator with
hierarchical LSTM network," ELSIVER, p. 15, 2018.

[9] Pranay Mathur∗, Aman Gill†, Aayush Yadav‡, Anurag Mishra§ and Nand Kumar
Bansode¶ , "Camera2Caption:AReal-TimeImageCaptionGenerator," ICCIDS, p. 6, 2017.

[10] Aghasi Poghosyan, Hakob Sarukhanyan , "Long Short-Term Memory with Read-only
Unit in Neural Image Caption Generator," in IEEE, 2018.

[11] Mirza Muhammad Ali Baig, Mian Ihtisham Shah, Muhammad Abdullah Wajahat,
Nauman Zafar and Omar Arif , "Image Caption Generator with Novel Object Injection,"
in IEEE, 2019.

[12] N. Komal Kumar1, D. Vigneswari2, A. Mohan3, K. Laxman4, J. Yuvaraj5 , "Detection and

Recognition of Objects in Image Caption Generator System: A Deep Learning
Approach," in ICACCS, 2019.

[13] Neeraj Gupta & Anand Singh Jalal , "Integration of textual cues for fine-grained image
captioning using deep CNN and LSTM," in Springer Link, 2019.

| P a g e 21
[14] Shuang Liu, Liang Bai, Yanli Hu, Haoran Wang,
"ImageCaptionGeneratorBasedOnDeepNeuralNetworks," in MATEC Web of
Conferences, 2018.

[15] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan, "Show and Tell:
Lessons learned from the 2015 MSCOCO Image Captioning Challenge," IEEE, p. 12,
2016.

[16] Junhua Mao, Wei Xu & Yi Yang & Jiang Wang & Zhiheng Huang, Alan Yuille , "DEEP
CAPTIONING WITH MULTIMODAL RECURRENT NEURAL NETWORKS (M-RNN)," ICLR
2015, p. 17, 2015.

[17] AlexSherstinsky, "FundamentalsofRecurrentNeuralNetwork(RNN)and LongShort-

TermMemory(LSTM)network," ELSVIER, p. 28, 2020.

[18] Mohammed N. A. Ali 1, Guanzheng Tan 1,* and Aamir Hussain 2 , "Bidirectional
Recurrent Neural Network Approach for Arabic Named Entity Recognition," MDPI, p.
12, 2018.

[19] Adnan Souri1, Mohammed Al Achhab2, Badr Eddine Elmohajir3, Abdelali Zbakh4 ,
"Neural network dealing with Arabic language," International Journal of Informatics
and Communication Technology , p. 10, 2020.

[20] Mohammad AL-Smadi Omar Qawasmeh Mahmoud Al-Ayyoub Yaser Jararweh Brij
Gupta, " Deep Recurrent Neural Network vs. Support Vector Machine for Aspect-Based
Sentiment Analysis of Arabic Hotels’ Reviews," CDATA [Journal of Computational
Science], p. 19, 2017.

[21] Fouzi Harrag, Maria Debba, Kareem Darwish, Ahmed Abdelali,

"BERTTransformermodelforDetectingArabicGPT2Auto-Generated Tweets," Cornell
University, 2021.

[22] Songtao Dinga, Shiru Qu, Yuling Xi, Shaohua Wan, "Stimulus-driven and concept-driven
analysis for image caption GENERATION," ELSEVIER, p. 11, 2019.

[23] Wenqiao Zhang1 · Siliang Tang1 ·Jiajie Su1 ·Jun Xiao1 · Yueting Zhuang1, "Tell and
guess: cooperative learning for natural image caption generation with hierarchical
refined attention," Springer, p. 16, 2020.

[24] Li Yao; Atousa Torabi; Kyunghyun Cho; Nicolas Ballas; Christopher Pal; Hugo Larochelle;
Aaron Courville, "Describing Videos by Exploiting Temporal Structure," in IEEE, 2015.

[25] Xianhua Zeng∗, Li Wen, Banggui Liu, Xiaojun Qi, "Deep learning for ultrasound image
caption generation based on object detection," ELSEVIER, p. 10, 2019.

| P a g e 22
[26] "A NEW CNNRNN FRAMEWORK FOR REMOTE SENSING IMAGE CAPTIONING," in IEEE ,
2020.

[27] Saloni Kalra & Alka Leekha, "Survey of convolutional neural networks for image
captioning," Journal of Information and Optimization Sciences, p. 23, 2020.

[28] SOHEYLA AMIRIAN, , KHALED RASHEED1,2, THIAB R. TAHA1, HAMID R. ARABNIA ,

"Automatic Image and Video Caption Generation With Deep Learning: A Concise
Review and Algorithmic Overlap," in IEEE, 2020.

[29] "How to calculate BLEU Score in Python?," Journal Dev, 2021.

[30] Huda A. Al-muzaini, Tasniem N. Al-yahya, Hafida Benhidour, "Automatic Arabic Image
Captioning using RNN-LSTM-Based Language Model and CNN," International Journal of
Advanced Computer Science and Applications, p. 7, 2018.

| P a g e 23

Transcription Rules
No ratings yet
Transcription Rules
11 pages
Final First Arabic 01
93% (28)
Final First Arabic 01
141 pages
RP Springer
No ratings yet
RP Springer
10 pages
Image Caption Generator PCL
No ratings yet
Image Caption Generator PCL
19 pages
4
No ratings yet
4
7 pages
Automatic Image Captioning Using Neural Networks
No ratings yet
Automatic Image Captioning Using Neural Networks
9 pages
DW & Caption Generator - Paper 1
No ratings yet
DW & Caption Generator - Paper 1
6 pages
Apply Deep Learning-based CNN and LSTM for Visual Image Caption Generator
No ratings yet
Apply Deep Learning-based CNN and LSTM for Visual Image Caption Generator
6 pages
Image Captioning - A Deep Learning Approach
No ratings yet
Image Captioning - A Deep Learning Approach
4 pages
Image Captioning Using R-CNN & LSTM Deep Learning Model
No ratings yet
Image Captioning Using R-CNN & LSTM Deep Learning Model
4 pages
Image Captioning Generator Using Deep Machine Learning
No ratings yet
Image Captioning Generator Using Deep Machine Learning
3 pages
Image Caption Generator Using Deep Learning
No ratings yet
Image Caption Generator Using Deep Learning
8 pages
DL Group 6 Rep
No ratings yet
DL Group 6 Rep
11 pages
14
No ratings yet
14
8 pages
Conference Paper A5
No ratings yet
Conference Paper A5
9 pages
Image Caption Generator Research Paper
No ratings yet
Image Caption Generator Research Paper
4 pages
5
No ratings yet
5
8 pages
Project Review
No ratings yet
Project Review
12 pages
Image Caption Generator
No ratings yet
Image Caption Generator
2 pages
Show and Tell: A Neural Image Caption Generator
No ratings yet
Show and Tell: A Neural Image Caption Generator
9 pages
Visual Image Caption Generator Using Deep Learning
No ratings yet
Visual Image Caption Generator Using Deep Learning
7 pages
Research Paper of Generating Caption From Image
No ratings yet
Research Paper of Generating Caption From Image
5 pages
Attention Based Image Caption Generation ABICG Using Encoder-Decoder Architecture
No ratings yet
Attention Based Image Caption Generation ABICG Using Encoder-Decoder Architecture
9 pages
Synopsis May 2024 (Pradeep, Vikas) - 1
No ratings yet
Synopsis May 2024 (Pradeep, Vikas) - 1
14 pages
Image Captioning: Department of Computer Science University of Engineering & Technology Taxila
No ratings yet
Image Captioning: Department of Computer Science University of Engineering & Technology Taxila
10 pages
Narrative Paragraph Generation
No ratings yet
Narrative Paragraph Generation
13 pages
Image Caption Generator
No ratings yet
Image Caption Generator
6 pages
Image Caption Generator Report
No ratings yet
Image Caption Generator Report
27 pages
Ex 3 SRS
No ratings yet
Ex 3 SRS
5 pages
Paper 17881
No ratings yet
Paper 17881
6 pages
ImagecaptionusingCNNandLSTM
No ratings yet
ImagecaptionusingCNNandLSTM
11 pages
ROHAN PRASAD FinalProjectReport - Rohan Gamer
No ratings yet
ROHAN PRASAD FinalProjectReport - Rohan Gamer
39 pages
Mini Project Fln..
No ratings yet
Mini Project Fln..
51 pages
Image Caption Generator Using AI: Review - 1
No ratings yet
Image Caption Generator Using AI: Review - 1
9 pages
Design of Machine Learning Algorithms For Object Captioning
No ratings yet
Design of Machine Learning Algorithms For Object Captioning
45 pages
LLM 2
No ratings yet
LLM 2
12 pages
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
From Everand
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
Robert Johnson
No ratings yet
Image Caption Bot With Keras and Speech Generation For
No ratings yet
Image Caption Bot With Keras and Speech Generation For
7 pages
Image Captioning
No ratings yet
Image Captioning
17 pages
Journey DB
No ratings yet
Journey DB
20 pages
Image Captioning
No ratings yet
Image Captioning
33 pages
Hybrid_Image_Captioning_Model
No ratings yet
Hybrid_Image_Captioning_Model
6 pages
Major Report Final
No ratings yet
Major Report Final
40 pages
Image Caption Generator Using Deep Learning
No ratings yet
Image Caption Generator Using Deep Learning
9 pages
Image+Caption(1)
No ratings yet
Image+Caption(1)
8 pages
Automatic Image Caption Generation System
No ratings yet
Automatic Image Caption Generation System
4 pages
IJCRT2310418
No ratings yet
IJCRT2310418
8 pages
IJIEMR March 2023 COPY RIGHT (2 Files Merged)
No ratings yet
IJIEMR March 2023 COPY RIGHT (2 Files Merged)
8 pages
2501
No ratings yet
2501
6 pages
Image Caption Generator Using Deep Learning: Guided by Dr. Ch. Bindu Madhuri, M Tech, PH.D
No ratings yet
Image Caption Generator Using Deep Learning: Guided by Dr. Ch. Bindu Madhuri, M Tech, PH.D
9 pages
Image To Caption Generator
No ratings yet
Image To Caption Generator
7 pages
IJNRD2309143
No ratings yet
IJNRD2309143
11 pages
dalle 4
No ratings yet
dalle 4
14 pages
AIML - Final Report _ version1
No ratings yet
AIML - Final Report _ version1
24 pages
Image Captioning Generator Using CNN and LSTM
No ratings yet
Image Captioning Generator Using CNN and LSTM
8 pages
2111.11133v11
No ratings yet
2111.11133v11
18 pages
Ref12
No ratings yet
Ref12
7 pages
PGCON Paper Final
No ratings yet
PGCON Paper Final
4 pages
Deep Visual-Semantic Alignments For Generating Image Descriptions
No ratings yet
Deep Visual-Semantic Alignments For Generating Image Descriptions
17 pages
Image Caption
No ratings yet
Image Caption
16 pages
Sunnit Singh Shivam Kumar Soham Chatterjee Abhishek Kumar Sujata Dawn MuHmt
No ratings yet
Sunnit Singh Shivam Kumar Soham Chatterjee Abhishek Kumar Sujata Dawn MuHmt
6 pages
Project Report
No ratings yet
Project Report
35 pages
Tajweed
100% (11)
Tajweed
85 pages
AlQiraaah EN
No ratings yet
AlQiraaah EN
102 pages
Arabic Alphabet Begining Middle End
No ratings yet
Arabic Alphabet Begining Middle End
4 pages
Challengesin Arabic NLP
No ratings yet
Challengesin Arabic NLP
26 pages
Arabic Alphabet Letters Free PDF Download
100% (1)
Arabic Alphabet Letters Free PDF Download
5 pages
ECR - Paper 1
No ratings yet
ECR - Paper 1
24 pages
Elixir Thesis
No ratings yet
Elixir Thesis
107 pages
On Process Arabic Lessons
No ratings yet
On Process Arabic Lessons
45 pages
Living Language: Complete Guide To Arabic Script
67% (6)
Living Language: Complete Guide To Arabic Script
117 pages
Modern Literary Arabic
100% (2)
Modern Literary Arabic
1,034 pages
Shellabear Girdlestone William. - The Singapore Triglot Vocabulary - English, Malay, Chinese
No ratings yet
Shellabear Girdlestone William. - The Singapore Triglot Vocabulary - English, Malay, Chinese
114 pages
Music and Musicians in Late Mughal India: Histories of the Ephemeral, 1748–1858 1st Edition Schofield - The latest ebook is available, download it today
100% (1)
Music and Musicians in Late Mughal India: Histories of the Ephemeral, 1748–1858 1st Edition Schofield - The latest ebook is available, download it today
67 pages
Alif Baa Taa Thaa Jeem Haa Khaa Daal: Dhaad
100% (3)
Alif Baa Taa Thaa Jeem Haa Khaa Daal: Dhaad
3 pages
Introducing Arabic Letters - Alif
No ratings yet
Introducing Arabic Letters - Alif
13 pages
Noorani Qaida Al-Osman-1-1
No ratings yet
Noorani Qaida Al-Osman-1-1
63 pages
Revision Yr 9 - EOT1
No ratings yet
Revision Yr 9 - EOT1
8 pages
What Are Common Mistakes in Tajweed That Non-Arabs Make?
No ratings yet
What Are Common Mistakes in Tajweed That Non-Arabs Make?
163 pages
Dreams that matter Egyptian landscapes of the imagination 1st Edition Amira Mittermaier All Chapters Instant Download
100% (5)
Dreams that matter Egyptian landscapes of the imagination 1st Edition Amira Mittermaier All Chapters Instant Download
47 pages
Arabic Conversation and Phrase Booklet
67% (3)
Arabic Conversation and Phrase Booklet
70 pages
What Is Tajweed - (PDFDrive)
No ratings yet
What Is Tajweed - (PDFDrive)
114 pages
Yassarnal Quran English 2023
No ratings yet
Yassarnal Quran English 2023
76 pages
TS - Madd
No ratings yet
TS - Madd
85 pages
Different Writings Styles of Quran
No ratings yet
Different Writings Styles of Quran
27 pages
Quran Word To Word Transliteration
No ratings yet
Quran Word To Word Transliteration
908 pages
Book Two Aysarul Aqwaal Sharhu Tuhfatul Atfaal
100% (1)
Book Two Aysarul Aqwaal Sharhu Tuhfatul Atfaal
85 pages
BOOK 1 New 2020
No ratings yet
BOOK 1 New 2020
38 pages
Khayru Tasreef Book 2 Textbook
No ratings yet
Khayru Tasreef Book 2 Textbook
108 pages
Sarf Lesson 16
100% (1)
Sarf Lesson 16
21 pages