7
7
Abstract:
Deep learning is a branch of machine learning which uses neural networks.
Nowadays, the interest in this field is increasing rapidly, and there are many
applications of it that do a great job and compared to humans such as self-driving cars,
medical care, read lips movements, and photo descriptions. Photo description or image
captioning is a deep learning application. In which a set of images are passed to the
model, and the model does some processing to generate output as captions or
descriptions of the entered images. This process needs high computational resources,
and a huge dataset to be trained well; to decrease the probability of generating
meaningless sentences which not related to the images. There are many works on this
topic with English language comparing to the Arabic language; because it has many
complicated features to deal with such as the direction of writing from right to left, has
many letters that not pronounced by many other languages, and compared to English,
Arabic language words are mostly connected. Finally, there as many as half a billion
people speaking Arabic. In this work, Arabic image captioning with Keras will be the
core point, Flickr8k dataset will be used, each image is associated with three different
Arabic captions that describe the entities and events in the image. Flickr8k Arabic
dataset has 8091 images, each image has three different descriptions in Arabic, training
image ids, and testing image ids. Most works are on the Flickr8k English dataset. CNN
model will be used as the image model and RNN-LSTM as the language model. The
minimum loss value is achieved after running 50 epochs is 1.75 with very good captions
describe the image components. 42% is a BLEU score.
Keywords: Image captions generator, Arabic text, LSTM, ResNet50, Flickr8k, BLEU,
Arabic light stemming, CNN, RNN.
1. Introduction:
Image Caption Generator is an application of deep learning. It helps in generating
a caption and description for an image. All that needs to pass an image to the model
and is trained to process and generate captions and descriptions for it. Day by day there
is new research in this field, which is considered a challenging topic, where the
description text must be generated for a given image. It needs methods from the
computer version to deal with and understand the image content and a language model
from natural language processing; to turn the image understanding into words in the
right way. The compositionality and the nature of both the language and visual view
|Page1
have a big role in the way of training the model. This type of model needs to be trained
on the co-occurrence of some objects in the dataset context, and able to generalize. The
human brain can easily understand the image content and tell what it is, but in the
computer that is not easy as human. For that, there is a need to build models which
facilitate this process using computers. The power of deep learning methods achieved
state-of-the-art results on image caption generation, which can build a single model that
can predict a caption and give a photo [1]. Text preprocessing is the most important
step that should be done when dealing with a text. It is needed to convert text data that
humans can understand easily, to format that understandable by the machine also. Each
language type in the world needs a specific way to deal with it. The text preprocessing
step is decided according to the task, for example, the Arabic language differs from
other languages, which has connected letters, there is no capital, small letter, and
diacritics. The Arabic text needs some preprocessing steps like, remove diacritics,
remove prefixes and suffixes, remove the connective letter, that differ from English text
preprocessing that needs; change capital letters to small letters [2].
Image caption generator consists of CNN to extract spatial information from the
images, RNN to generate the sequence of the word, and LSTM to remember the long
word sequences. Image caption generator allows employing techniques from computer
vision and natural language processing to extract overall textual information about the
target images. Image captioning has many applications, such as information retrieval,
for impaired people, in social media, and so on. Although the lack of data sources, there
are great accomplished works for English compared with the progress on Arabic image
captioning is still slow. The works in Arabic are still limited, because of the shortness
of dataset availability, and the complex nature of the language. The Arabic language is
very rich and needs specialists to deal with and know how to deal with it.
This work aims to build a model for image caption generation that deals with Arabic
captions using a developed dataset based on Flickr8K, which has 8000 images with
three different captions for each image in the dataset. The dataset is divided into training
set with 6000 images, validation set with 1000 images, and testing set with 1000
Images. The work has pre-processing step to prepare images and text ready to use. For
image preparation, there are many models to be used, but in this case, the ResNet50
will be used, which Keras provides directly. The used network is pre-trained on
ImageNet. It allows training deep neural networks with 150+layers successfully. As a
brief step, images' features will be extracted and saved to a file, then used later and feed
to the model. This step will load each image, prepare it for the ResNet50, and collect
the extracted features. The result will be saved in a .pkl file, the images will be ready
for the testing model too. For the text part of the dataset, each image has three different
captions and this part needs cleaning, stemming, and normalization. Light stemming
for Arabic text will be used. The final preprocessed text will be saved in a text file.
The deep learning model definition is divided into three parts: 1. Photo feature
extracting, the pre-trained ResNet50 architecture will be used. This step's output is
features extracted from the image dataset. 2. Sequence processor, word embedding with
|Page2
LSTM RNN layer. 3. The output from both previous steps combined and processed by
a Dense layer to make a final prediction. At each epoch, the copy of trained model will
be saved in a .h5 file, running time depends on the hardware capabilities that be used,
at average each epoch will take ten minutes using GPU. After that, the model fit step
and the model evaluate step will be implemented using the validation dataset this will
help to avoid the overfitting problem. This problem is caused by the model will quickly
learn and overfit the training dataset. BLEU, or the Bilingual Evaluation Understudy,
scores are used to evaluate the translated text versus one or more reference translations
[3]. In this case, the BLEU scores will be used to evaluate the generated text.
The rest of the paper is organized as follows. In the next section, literature reviews
are presented. Then, methods and materials are used in this paper are presented,
followed by experiments and results then discussions of the findings. Finally, the paper
is concluded.
2. literature reviews:
In 2018, Xinxin Zhu et al two improvements are proposed. The first method named
the triple attention (TA-LSTM) image caption method is used at the input stage of the
hidden unit of LSTM and the output stage of each hidden unit. At the input stage. They
use the text-conditional embedding method, which uses the previous information of text
and combines the image information. Also, stack and parallel LSTM are used to
improve the performance. The parallel LSTM is used to catch information from input
context, each LSTM has identical parameters which can represent input context with
variant information. Finally, the average pooling layer is used to combine all outputs of
LSTM [4]. In 2020, Yan Chu et al a joint model for image caption generator was
provided based on ResNet50, LSTM, and soft attention (AICRL ). The model has one
encoder and one decoder. The encoder applies ResNet50 based on CNN to provide
representation for the given images dataset by embedding them into a specific vector.
The decoder was adopted with LSTM, CNN, and soft attention mechanism; to
selectively focus on specific parts of the image to predict the text. Different metrics
were used for evaluation like BELUE, METEROR, and CIDEr [5]. In 2018, Huda A.
Al-muzaini et al the aim of this work to develop a model for image caption generation
to describe images in Arabic language. The model consists of two subnetworks, one is
RNN for text and the other CNN for images. These subnetworks interact to each other
to generate a correct or meaningful descriptions for the given images. The COCO and
Flicker datasets with English description were used. At first, they translated the English
descriptions for Arabic using Google translator. Then, 3427 images, with a vocabulary
size of 9854 were fed to the model. For encoder, visual geometry group (VGG) 16
layers CNN is adopted [6]. In 2017, et al Philip Kinghorn the model was used is based
on deep learning for image description generation. Two RNNs -attribute prediction and
encoder-decoder were embedded together to give descriptions for given images. The
proposed system focused on the image regions of people and objects of the image. IAPR
TC-12 dataset was used. R-CNN is a CNN but with extra outputs which predict
|Page3
bounding box coordinates and was trained using the ILSVRC13 dataset [7]. In 2018,
et al Ying Hua Tan the Long Short-Term Memory (phi-LSTM) architecture was used
to produce the image descriptions. Its role is to decode the captions from phase to
sentence, the phrase decoder decodes the nouns with variants length also, abbreviated
sentence decoder. The image's caption generated by aggregating the phrases were
generated with sentence during the inference step. Flickr8k, Flickr30k and MS-COCO
datasets were used [8]. In 2017, et al a simple implementation of encoder decoder-based
with some editing and improvements to function in a real-time environment; that makes
it possible to run the models on low-end hardware and personal devices. TensorFlow
and android application were used. The MSCOCO dataset was applied in this work [9].
In 2018, et al Aghasi Poghosyan the work provides a model which can generate images
captioning automatically using RNN, with a modified LSTM cell, there is an additional gate,
its role to for image features. All these adjustments help to improve high accurate image
captions. The MSCOCO dataset used in this work [10]. In There is a problem when working in
image captioning tasks, which that there are not enough data sources that help to provide highly
accurate models for image captioning. This work comes with a solution which is a novel word
injection. This model uses the pre-trained caption producer and works on the output of the
producer to insert objects that are not present in the dataset into the caption. BLUE, CIDEr, and
ROUGE-L normalized metrics are used for evaluating the work [11]. The work uses Regional
Object Detector (RODe) for detection, recognition, and generates captions for the given images
dataset. Object detection is done using R-CNN, feature extraction from images is done using
CNN, creating attributes step for defining the string attributes, and the fourth step is encoder
and decoder for string labels using RNN [12]. In 2019, et al Neeraj Gupta the most models rely
on images to describe the context more than the text or caption embedded with the image. The
feature extraction mostly depends on the image itself. The work aims to use images with text
captions to describe the content of the image [13]. The work proposed an analysis of three
components: CNN, RNN, and sentence generation. They found the VGG network
works better and that shown in the BLEU score results. Also, a new recurrent layer was
provided as a simplified version (GRU) DONE USING C++ and MATLAB. The result
is near for the result using LSTM, but it differentiates by having a few parameters which
save the memory and make the learning process faster [14].
This work provides a model for image captioning generation, which can generate
natural sentences for the given image to describe it, based on deep learning and RNN.
The provided model trained to maximize the likelihood for the correct sentence that
describes the given train image. The model takes care of both qualitatively and
quantitatively [15]. The multimode RNN model is provided in this work which can
generate captions for the given image. The captions are generated based on the
probability distribution of generating a word depend on the previous words. The model
has two networks: RNN for dealing with text and CNN for dealing with images. Also,
the provided model used at retrieval tasks (images or sentences) and got a very good
performance [16]. This work aims to study RNN and LSTM, the equation, and its
derivative. They explain the difficulties of training the RNN especially vanishing and
exploding problems. Instead, replace RNN with the Vanilla LSTM1 network, and
explain the forward pass and backward pass. The core contribution is their own
|Page4
approach to analyze RNN and Vanilla LSTM from a single processing perspective
[17]. A bidirectional LSTM model is provided to deal with Arabic text, LSTM network
able to process and connect each part, that makes it suitable for NER task. A pre-trained
model is used to train the input that entered into the LSTM network. They don't use any
feature engineering or preprocessing. LSTM is very helpful to solve the Arabic Nes
more than other methods [18]. Neural Network is used to deal with Arabic text, to
correct the language modeling, generate the text, and predict the missing words in the
text. They aim to adapt RNN with the Arabic language model to produce correct Arabic
sequences. CNN architecture is used to predict the missing text in Arabic documents
[19]. The RNN and SVM are used in this work to analyze Arabic Hotels' reviews. The
provided approach is trained with lexical, word, syntactic, morphological, and semantic
features for a long. SVM shows better performance in category identification, extract
target expression, and polarity identification. ON the other hand, RNN is faster in time
execution [20]. The provided transfer learning model aims to detect if the given written
Arabic text is written by humans or generated automatically. Set of tweets used as a
dataset for the work. GPT2-Small-Arabic is used to generate fake Arabic sentences.
LSTM, BI-LSTM, GRU, and BI-GRU are RNN word embeddings-based baseline
models used for evaluation. The contribution of this work is ARABERT and GPT2
combined for detecting and classifying Arabic text for the first time [21]. In 2019, et
al Songtao Ding Although image captions generation tasks in deep learning achieve a
great result; there are still problems face this task, such as trying to get accurate results
at high-level visual tasks, and for images that have multiple targets. This work proposes
two mechanisms to solve these issues: stimulus-driven and concept-driven. The model
depends on the integration of CNN to deal with images and LSTM to deal with
sentences [22]. In 2020, et al Wenqiao Zhang This work proposes a method to do image
captions generation tasks. The method is combined the ICM and IRM into Cooperative
Learning, it allows to share the common knowledge to translate the heterogeneous
information. The HRA is a Hierarchical Refined Attention its role to clean the visual
information from the image, it can attend sequentially to the image feature and semantic
attributes. Then, merge them to get a better caption for the images [23].
In 2019, Xianhua Zeng et al Using deep learning in image caption generation tasks in
the medical field still lacks, especially in the analysis and description information from
ultrasound image understanding. This work provided a method for detect and encodes
spatial areas in ultrasound images. LSTM is used for decoding and produces
observation text information that describes the disease's content in the images [25].
There are two ways for image captions generator. The first is a retrieval way that
depends on recover the information to produce a description of the image content. The
second is the generating way using the encoder-decoder framework. CNN-RNN is
combined with a beam-search which role is to generate multiple captions for the same
image, and the best caption will be selected depending on the lexical similarity with the
|Page5
reference [26]. CNN helps to understand the word embeddings and better history
representation dealing with small datasets. Is better at classification accuracy and
predicted high-level words. The prediction captions are more human-like [27]. Using
image captions generator methods to produce or generate video captions as building
blocks. So, the video caption is the same as a summarization of image captions [28].
3.1. Dataset:
The used dataset is developed based on Flickr8K, has 8000 images with Arabic
descriptions, three different captions for each image. The dataset split to train set with
6000 images, validation set with 1000 images, and test set with 1000 images as .txt
files, consist of two columns: the image name .jpg and the text captions.
In this work, the ResNet50 pre-trained model will be used to deal with images,
Keras provides this model and can use it directly. Images features will be pre-computed
and saved to a file as a description of the images dataset and fed to the model. This
optimization needs less time and consumes less memory. Keras.preprocessing.image
will be used to load, preprocess the image size, and reshape. The loaded images inserted
to ResNet50 using: "keras.applications. resnet50". Finally, the extracted images are
saved in a .pkl file, created by pickle, Python module, that contains a byte stream that
represents the object.
First of all, the features of the extractor model is a ResNet50 model pre-trained on
ImageNet dataset, this model features output will be used as input of 2048 vector
elements. The dense layer is used to process these features to produce a 256-element
representation of the photo. Then, the sequence processor is a word embedding dealing
with text part, followed by LSTM with 256 memory units recurrent neural network
layer. It is expected input with 26 words as maximum length. The output for both
previous parts is merged and processed to make the prediction by the Dense layer at the
decoder. 20% dropout is used as a regularization form, Adam optimizer is used, and the
learning rate that is used 0.01. The model trained using a training dataset, which consists
of 6000 images, vocabulary size is 9090, and the maximum sequence length is 26
words. The used activation function is a Relu for all layers, except the final layer uses
softmax to normalize the output. Softmax layer will convert the output to probabilities.
There is a big chance for the model to overfit the training dataset; for that, the
validation dataset is used to observe the training model performance. At the end of
epochs, the model performance will be improved and will save the model in a file.
ModelCheckpoint in Keras is used for observing loss in the model. Initially, ten epochs
will be used to fit the model. Then, start increasing the number of epochs until reaching
the minimum loss value. Running epochs take a long time, ten minutes approximately
for each epochs using modern hardware like Hardware accelerator GPU or TPU.
|Page7
Figure 1: Image captioning DL model.
Figure 1 shows the architecture of the used model in this work for image captions
generator.
|Page8
one or more reference sentences [29]. The output range is between 0 and 1, closer to 1
is better.
4.2. Results:
Training the model starts with run ten epochs and calculate the loss value after each
epoch, the last epoch has the minimum loss value. The total epochs take 84 minutes.
BLEU result is 0.38.
Figure 2 shows the final model's result after training the model for ten epochs and
compute the loss value. The caption output indicates that the model is able to distinguish
the main component or object in the image. After running ten epochs, the best result
of the model is saved and used for prediction, the result is not accurate 100%.
|Page9
Epoch #5 3.37
Epoch #6 3.19
Epoch #7 3.05
Epoch #8 2.90
Epoch #9 2.77
Epoch #10 2.67
Epoch #11 2.57
Epoch #12 2.50
Epoch #13 2.42
Epoch #14 2.36
Epoch #15 2.30
Epoch #16 2.26
Epoch #17 2.17
Epoch #18 2.13
Epoch #19 2.10
Epoch #20 2.00
Noticed that loss values are decreased after running more epochs but still, it is needed
to reduce the values to get more accurate prediction result at testing time. At the epoch
number 20 the loss value reaches 2.
Once more trail, 35 epochs are running again and show how the loss values are
decreased. The minimum value of 1.95 is reached at epoch number 35. The BLEU value
reaches 0.42.
Figure 3 is the result after training the model longer than the first time, the result still
consists of the main object in the image, and here is a standing man in a narrow passage.
Running more epochs helps to reduce loss values, the minimum result is gotten at epoch
number 35. Still, the result is not accurate 100% yet. The aim is to get an accurate and
meaningful description of the content of the target image.
| P a g e 10
Figure 4: Prediction result after running 35 epochs.
Figure 4 shows the result of generated caption using the trained epoch that saved in
epoch number 35. The model can capture more details in the image.
At figure 5, the model starts distinguishing and captures more details in the image more
accurately.
As a final trial, 50 epochs are run and the minimum loss value stabilized at 1.72. Then,
each running epoch has a copy of the trained model, each model used at prediction time
to show which is more accurate.
| P a g e 11
Figure 6: Prediction result using model in epoch # 44.
Figure 6 shows how the model performance is improved after running it for a longer
time and starts getting descriptive caption for the target image.
Using the trained model that is saved at epoch number 44, the result is descriptive. The
result using the trained model at epoch number 49, as shown the result is less
representative, though this epoch has the minimum loss value result.
| P a g e 12
Figures 7 and 8 show how the trained model that saved in the final epoch misleading
some features in the image and produce non-accurate captions for the target images but
is still focused on the main object in them.
After all these trials are noticed that at the training step, the model focus on the main
content or main part of the image. For example, the model at all dogs photos focuses
on the dog whatever what the color or if the dog run, walk, or play.
Table 3 shows the loss values that resulted after running 30 epochs using the testing
dataset which consist of 1000 images non seen before, it clears how the results are
enhanced compared with loss values of the training dataset. The minimum loss is
gotten 0.80 at epoch number 29.
| P a g e 13
Figure 9: Prediction result using testing dataset Figure 10: Prediction result using testing.
Figures 9 and 10 are prediction results using the testing dataset. The predictions of
image captions according to loss values are accurate and described well the images'
content.
This work focuses on image caption generation using the Arabic language.
Dealing with Arabic is very challenging, it differs from the English language, which
needs some different preprocessing steps like remove punctuations and diacritics,
normalize the "Hamza", remove repeating characters, remove foreign characters, and
remove one-character words. Also, light Arabic light stemming is used, which takes an
Arabic text and performs the stemming such as remove length three and length two in
prefixes, and suffixes, remove connective ' 'وif it precedes a word beginning with ''و,
and normalize initial hamza to bare "alif". After preparing the text part from the dataset,
it is saved in a .txt file to use later at training and prediction steps. On the other hand,
images also need to preprocess, all images' features are extracted and saved in a .pkl
file, it takes a long time to complete all images. After that, images and descriptions are
ready to be used to train the model, each epoch has a copy of the trained model and the
loss value is computed. The decreasing of the loss value at the final epoch is very slow
and steady at 1.72 in epoch number 49. Running each epoch takes an average of ten
minutes. The final results after running 50 epochs indicate that the model at first focus
on the main content or main part of the image. For example, the model in all dogs'
photos focuses on the dog, whatever is the color? or if the dog run, walk, or play.
| P a g e 14
Figure 11: Prediction result. Figure 11: Prediction result.
After running 50 epochs and each epoch has a copy of the trained model, some of these
copies are used at testing the model performance, the model copies are used from epoch
number 35 and each is tried at testing. Figures 10 and 11 show descriptive captions for
the target images it can produce captions that describe the dogs and what do in each
image.
The model performance is not stable with all the target images and able to produce
accurate captions even though it is trained for a long. Figures 12 and 13 show how the
model generates not accurate captions to describe the images.
| P a g e 15
Figure 14: Prediction result. Figure 15: prediction result.
Figures 14 and 15 show accurate and representative generated captions for the target
images.
Again figures 16 and 17 show non accurate 100% for generated captions for the target
images at testing time.
| P a g e 16
Figure 18: prediction result. Figure 19: Prediction result.
Figures 18 and 19 are addition examples show the model performance at testing
approach and the generated captions for the targeted images.
Figure 20: Prediction result on testing data. Figure 21: Prediction on testing data.
Figures 20 and 21 are resulted from running two models saved in epoch number 28,
its result shows in figure 21, and epoch number 20, its result shows in figure 20, using
the testing dataset. Noticed that the model was confused about the location or the
nature of the land in the figures.
| P a g e 17
Figure 22: Prediction result on testing data Figure 23: Prediction result on testing data
Figures 22 and 23 are other examples of non-accurate 100% results for captions
prediction using the testing dataset. But it still distinguishes the main content of the
image correctly.
Figure 24: Failed prediction result. Figure 25: Failed prediction result.
| P a g e 18
Figures 24, 25, and 26 are examples of failed predictions result that the model makes
using a testing dataset non seen before. However, the predictions are not wrong
completely, the model able to distinguish the dogs in Figures 24 and 25. Figure 26 is
confused between the young boy and the young girl but can distinguish that he wears a
blue swimming cloth.
Most searches work on the Arabic image captions generator use a combination
of two datasets MS COCO and translated English Flicker dataset not the Arabic version
of the dataset [6]. Also, the works rely on computing the loss values and BLEU scores
to show their results, in addition, to show the predicted captions with the images results.
The provided model in [6] uses RNN-LSTM based language model and CNN and the
BLEU result that got is 46% compared with the result of this work is 42%.
5.2. Limitations:
6. Conclusion:
Image caption generator is one of the deep learning applications, it combines
computer vision and natural language processing concepts. It merges CNN-RNN
architectures, which CNN is used to extract images' features, and RNN is used to deal
with the text part of the dataset. LSTM is used to generate descriptions for the target
images. There are many works for image captions generator in the English language.
The Arabic language is challenging to deal with; because it needs special preprocessing
steps that differ from the English. The Arabic language has many complicated features
to deal with such as the direction of writing from right to left, has many letters that are
not pronounced by many other languages, and compared to English, Arabic language
words are mostly connected.
This work focuses to generate Arabic captions for the images. ResNet50 will be used
for this task, which pre-trained on ImageNet. This network extracts the features from
the images and is saved in a .pkl file to be used later in training the model. LSTM is
used to deal with the text part in the dataset and do some preprocessing steps to make
the text captions ready, which facilitates the model dealing with. At the training step,
50 epochs are run and each epoch has a copy of the trained model and the loss value
will be computed and is saved in a .h5 file to be used and evaluate the performance on
the dataset. On average, running each epoch takes ten minutes using GPU. The
Flickr_8K dataset will be used in this work, it is divided into the train set with 6000
images, validation set with 1000 images, and test set with 1000 images. Each image has
| P a g e 19
three different captions which help the model to train well. The final results indicate
that the model focus on the main content or main part of the image. Then starts
describing the other content, what does the dog do? run, play, catch the ball, what does
the man wear? and what is the color? The minimum loss value in the training phase is
1.72 at epoch number 50, and the minimum loss value in the testing phase is 0.80 at
epoch number 29.
Author name:
Batool M. Alazzam.
| P a g e 20
References
[1] M. Bhikadiya, "Automatic Image Captioning Using Deep Learning," Medium, 2020.
[2] O. Davydova, "Text Preprocessing in Python: Steps, Tools, and Examples," Data
Monsters, 2018.
[3] J. Brownlee, "A Gentle Introduction to Calculating the BLEU Score for Text in Python,"
Machine Learning Mastery , 2017.
[4] Xinxin Zhu, Lixiang Li, Jing Liu, Ziyi Li, Haipeng Peng, Xinxin Niu, "Image captioning with
Triple-Attention and Stack Parallel LSTM," ELSEVER\ Neurocomputing , p. 16, 2018.
[5] Yan Chu ,1 Xiao Yue ,2 Lei Yu,1 Mikhailov Sergei,1 and Zhengkui Wang3 , "Automatic
Image Captioning Based on ResNet50 and LSTM with Soft Attention," Hindawi Wireless
Communications and Mobile Computing , p. 7, 2020.
[6] Huda A. Al-muzaini, Tasniem N. Al-yahya, Hafida Benhidour , "Automatic Arabic Image
Captioning using RNN-LSTM-Based Language Model and CNN," (IJACSA) International
Journal of Advanced Computer Science and Applications, p. 7, 2018.
[7] Philip Kinghorn, Li Zhang, Ling Shao, "A region-based image caption generator with
refined descriptions," ELSIVER, p. 9, 2017.
[8] Ying Hua Tan, Chee Seng Chan, "Phrase-based image caption generator with
hierarchical LSTM network," ELSIVER, p. 15, 2018.
[9] Pranay Mathur∗, Aman Gill†, Aayush Yadav‡, Anurag Mishra§ and Nand Kumar
Bansode¶ , "Camera2Caption:AReal-TimeImageCaptionGenerator," ICCIDS, p. 6, 2017.
[10] Aghasi Poghosyan, Hakob Sarukhanyan , "Long Short-Term Memory with Read-only
Unit in Neural Image Caption Generator," in IEEE, 2018.
[11] Mirza Muhammad Ali Baig, Mian Ihtisham Shah, Muhammad Abdullah Wajahat,
Nauman Zafar and Omar Arif , "Image Caption Generator with Novel Object Injection,"
in IEEE, 2019.
[13] Neeraj Gupta & Anand Singh Jalal , "Integration of textual cues for fine-grained image
captioning using deep CNN and LSTM," in Springer Link, 2019.
| P a g e 21
[14] Shuang Liu, Liang Bai, Yanli Hu, Haoran Wang,
"ImageCaptionGeneratorBasedOnDeepNeuralNetworks," in MATEC Web of
Conferences, 2018.
[15] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan, "Show and Tell:
Lessons learned from the 2015 MSCOCO Image Captioning Challenge," IEEE, p. 12,
2016.
[16] Junhua Mao, Wei Xu & Yi Yang & Jiang Wang & Zhiheng Huang, Alan Yuille , "DEEP
CAPTIONING WITH MULTIMODAL RECURRENT NEURAL NETWORKS (M-RNN)," ICLR
2015, p. 17, 2015.
[18] Mohammed N. A. Ali 1, Guanzheng Tan 1,* and Aamir Hussain 2 , "Bidirectional
Recurrent Neural Network Approach for Arabic Named Entity Recognition," MDPI, p.
12, 2018.
[19] Adnan Souri1, Mohammed Al Achhab2, Badr Eddine Elmohajir3, Abdelali Zbakh4 ,
"Neural network dealing with Arabic language," International Journal of Informatics
and Communication Technology , p. 10, 2020.
[20] Mohammad AL-Smadi Omar Qawasmeh Mahmoud Al-Ayyoub Yaser Jararweh Brij
Gupta, " Deep Recurrent Neural Network vs. Support Vector Machine for Aspect-Based
Sentiment Analysis of Arabic Hotels’ Reviews," CDATA [Journal of Computational
Science], p. 19, 2017.
[22] Songtao Dinga, Shiru Qu, Yuling Xi, Shaohua Wan, "Stimulus-driven and concept-driven
analysis for image caption GENERATION," ELSEVIER, p. 11, 2019.
[23] Wenqiao Zhang1 · Siliang Tang1 ·Jiajie Su1 ·Jun Xiao1 · Yueting Zhuang1, "Tell and
guess: cooperative learning for natural image caption generation with hierarchical
refined attention," Springer, p. 16, 2020.
[24] Li Yao; Atousa Torabi; Kyunghyun Cho; Nicolas Ballas; Christopher Pal; Hugo Larochelle;
Aaron Courville, "Describing Videos by Exploiting Temporal Structure," in IEEE, 2015.
[25] Xianhua Zeng∗, Li Wen, Banggui Liu, Xiaojun Qi, "Deep learning for ultrasound image
caption generation based on object detection," ELSEVIER, p. 10, 2019.
| P a g e 22
[26] "A NEW CNNRNN FRAMEWORK FOR REMOTE SENSING IMAGE CAPTIONING," in IEEE ,
2020.
[27] Saloni Kalra & Alka Leekha, "Survey of convolutional neural networks for image
captioning," Journal of Information and Optimization Sciences, p. 23, 2020.
[30] Huda A. Al-muzaini, Tasniem N. Al-yahya, Hafida Benhidour, "Automatic Arabic Image
Captioning using RNN-LSTM-Based Language Model and CNN," International Journal of
Advanced Computer Science and Applications, p. 7, 2018.
| P a g e 23