0% found this document useful (0 votes)
25 views

Image Captioning Via A Hierarchical Attention Mechanism and Policy Gradient Optimization

The document proposes a hierarchical attention model for image captioning that utilizes both global CNN features and local object features for effective feature representation and reasoning. A GAN with reinforcement learning is also applied to address exposure bias and make generated captions more accurate by measuring consistency with the image content.

Uploaded by

YASH VARDHAN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Image Captioning Via A Hierarchical Attention Mechanism and Policy Gradient Optimization

The document proposes a hierarchical attention model for image captioning that utilizes both global CNN features and local object features for effective feature representation and reasoning. A GAN with reinforcement learning is also applied to address exposure bias and make generated captions more accurate by measuring consistency with the image content.

Uploaded by

YASH VARDHAN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO.

8, AUGUST 2015 1

Image Captioning via a Hierarchical Attention


Mechanism and Policy Gradient Optimization
Shiyang Yan, Yuan Xie, Fangyu Wu, Jeremy S. Smith, Member, IEEE, Wenjin Lu and Bailing Zhang

Abstract—Automatically generating the descriptions of an im- the similarity with that in neural machine translation [16].
age, i.e., image captioning, is an important and fundamental topic Most of these approaches represented the image as a single
in artificial intelligence, which bridges the gap between computer feature vector from the top layer of a pre-trained convolutional
vision and natural language processing. Based on the successful
neural network (CNN) and cascaded recurrent neural network
arXiv:1811.05253v2 [cs.CV] 11 Jan 2019

deep learning models, especially the CNN model and Long Short-
Term Memories (LSTMs) with attention mechanism, we propose (RNN) to generate languages.
a hierarchical attention model by utilizing both of the global CNN In fact, the tasks like image captioning and machine trans-
features and the local object features for more effective feature lation can be considered as a structured output problem where
representation and reasoning in image captioning. The generative the task is to map the input to an output that possesses its
adversarial network (GAN), together with a reinforcement learn-
ing (RL) algorithm, is applied to solve the exposure bias problem own structure, as stated in [17]. An inherent challenge in
in RNN-based supervised training for language problems. In these tasks is the structure of the output is closely related
addition, through the automatic measurement of the consistency to the structure of the input. Hence, a key problem in these
between the generated caption and the image content by the tasks is alignment [17]. Take neural machine translation for
discriminator in the GAN framework and RL optimization, we example, [18] trained a neural model to softly align the output
make the finally generated sentences more accurate and natural.
Comprehensive experiments show the improved performance of to the input for machine translation. Subsequent research [19]
the hierarchical attention mechanism and the effectiveness of our applied the visual attention model to address this problem
RL-based optimization method. Our model achieves state-of-the- in image captioning, with much improvement. The visual
art results on several important metrics in the MSCOCO dataset, attention mechanism is to dynamically select the relevant
using only greedy inference. receptive fields in the CNN features to facilitate the image
Index Terms—Image captioning, Hierarchical attention mech- description generation, which, in other words, is to align the
anism, Generative adversarial network, Reinforcement learning, output words to spatial regions of the source image. In this
Policy gradient paper, we also employ the visual attention mechanism for
image captioning.
I. I NTRODUCTION Nevertheless, natural language often consists of very metic-
ulous descriptions, which correspond to the fine-grained ob-
Naturalistic description of an image is one of the primary
jects of an image. As pointed out by [20], there are certain
goals of computer vision, which has recently received much
limitations of the most existing neural model-based schemes
attention in the field of artificial intelligence recently. It is a
due to the mere use of the global feature representation in the
high-level task and much more complicated than some funda-
image level. Some of the fine-grained objects might not to be
mental recognition tasks, e.g., image classification [1] [2] [3]
recognized by only relying on the global image features. In this
[4], image retrieval [5] [6] [7], object detection and recognition
paper, we propose to use a pre-trained image detection model,
[8] [9] [10] [11]. This requires the system to comprehensively
i.e., Faster RCNN [10], to retrieve the fine-grained image
understand the content of an image and bridge the gap between
features from the top detected objects. These fine-grained
the image and the natural language. Automatically generating
object features, are able to provide complementary information
image descriptions is useful in multimedia retrieval, and image
for the global image representation, which will be proved in
understanding.
the experiments. In terms of the model structure, the object
Some pioneering research has been carried out in gen- features are also processed by a visual attention mechanism,
erating image descriptions [12] [13]. However, as pointed and are added to the original model to form a hierarchical
out in [14], most of these models often rely on hard-coded feature representation and hence it is able to generate more
visual concepts and sentence templates, which limits their meticulous descriptions.
generalization capability. Recently, with the rapid development In addition to the improvement of the image feature rep-
of deep learning in image recognition and natural language resentation, we also consider to improve the current language
processing, the current trend of image captioning approaches model, which is widely used in neural machine translation and
[15] is to follow the encoder-decoder framework, which shares image captioning. An issue with most of the previous language
Shiyang Yan, Yuan Xie and Bailing Zhang are with the Institute of model is the training framework, namely, the RNN using
Advanced Artificial Intelligence in Nanjing, Nanjing, China Maximum Likelihood Estimation (MLE) to generate image
Jeremy S. Smith is with the University of Liverpool. descriptions. As pointed out in [21], the MLE approaches
Fangyu Wu, Wenjin Lu are with the Department of Computer Science and
Software Engineering, Xi’an Jiaotong-Liverpool University, Suzhou, China. suffer from the so-called exposure bias in the inference stage:
Manuscript received; revised the model generates a sequence iteratively and predicts the
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2

next token based on the previously predicted ones that may • The policy gradient algorithm combined with the GAN is
never be observed in the training data. In image description proposed for the training and optimization of the language
generation, the MLE also suffers from a problem that the model, with improvements over MLE training scheme.
generated languages do not correlate well with a human • Through comprehensive experiments, we validate the
assessment of quality [22]. proposed algorithm and comparable results with current
Instead of only relying on the MLE, an alternative scheme state-of-the-art methods are achieved on the MSCOCO
is the generative adversarial network (GAN) [23]. GAN was dataset.
first proposed to generate realistic images. The GAN learns
generative models without explicitly defining a loss function II. R ELATED W ORK
from the target distribution. Instead, GAN introduces a dis-
A. Deep Model-based Image Captioning
criminator network which tries to differentiate real samples
from generated samples. The whole network is trained using Promoted by the recent success of deep learning network in
an adversarial training strategy. One can subsequently build a image recognition tasks and machine translation, the research
discriminator to judge how realistic are the samples generated on generating image description or image captioning has
by the description generator. The role of the caption generator, made remarkable progress [29] [14] [13] [30] [15] [31]. As
in this model, is similar with that of the the generator in the mentioned above, most of the previously proposed approaches
conditional GAN [24], which is conditioned on the image consider the image description generation as a translation
features. process, mainly by borrowing the idea of the encoder-decoder
However, language generation is a discrete process. Directly framework [32] from neural machine translation [16]. Gen-
providing the discrete samples as inputs to the discriminator erally, this paradigm considers a deep CNN model as the
does not allow the gradients to be back propagated through image encoder, which maps the image into a static feature
them. The reinforcement learning (RL) [25] framework pro- representation, and a RNN as a decoder to decode this static
vides a solution to estimate the gradients of the discontinuous representations to an image description. The whole framework
units. The RL framework, when dealing with sequence gener- is trained using supervised learning under MLE. The generated
ation, has the problem of lacking the intermediate reward, as description should be grammatically correct and match the
discussed in [26]. The reward value can only be obtained when content of the image.
the whole sequence is generated. This is not suitable since Specifically, Karpathy et al. [14] proposed an alignment
what we want is the long-term reward of each intermediately model through a multi-modal embedding layer. This model
generated token, so the whole sequence better optimized. is able to align parts of a description with the corresponding
regions of the image, which attracts significant attention. Jia et
In the proposed scheme, the discriminator takes into account
al. [30] proposed a variation of LSTM, called gLSTM, for the
not only the differences between the generated captions and
image captioning task to mainly tackle the problem of losing
the reference captions but also the consistencies between
track of the image content. This model includes the semantic
captions and image features. Through the evaluation of the
information along with the whole image as inputs to generate
discriminator, the networks can better compensate for some
captions. Donahue et al. [31] applied both of the convolutional
unrealistic captions which might be generated under the MLE
layers and recurrent layers to form a Long-term Recurrent
training. However, to deal with the discreteness of language,
Convolutional Network (LRCN) for visual recognition and
we treat the image captioning generator as an agent of RL.
description.
The feedbacks from the discriminator are considered as the
rewards for the generator. To update the parameters of the Bahdanau et al. [18] pointed out that a potential prob-
image description generator in this framework, we consider lem in this approach is that the model should compress
the generator as a stochastic parameterized policy. We train all the necessary information of a source sentence into a
the policy network using Policy Gradient [27], which naturally fixed-length representation. This may make it difficult for
solve the differential difficulties in conventional GAN. Also, to the neural network to cope with long sentences. The static
solve the problem of lacking intermediate rewards, we borrow feature representation in the encoder-decoder framework, for
the idea from the famous “AlphaGo” program [28] in which a both of machine translation and image captioning, cannot
Monte Carlo roll-out strategy is applied to sample the expected automatically retrieve relevant information from the source
long-term reward for an intermediate move. If we consider and thus at last influence the final performance. In neural
the sequence token generation as the the action to be taken in machine translation, Bahdanau et al. [18] proposed a kind
RL, we can apply a similar Monte Carlo roll-out strategy to of soft attention mechanism for machine translation, which
obtain the intermediate rewards. [26] has successfully applied enables the decoder to automatically focus on the relevant
the Monte Carlo roll-out in sequence generation. In this paper, parts of the source sentence. In computer vision, the attention
we use a similar sampling method to deal with intermediate mechanism has long been the focus of much research [18] [33]
rewards during the process of caption generation. [34] since human perception does not tend to process a whole
scene in its entirety at once but applies some mechanisms to
To summarize, our contribution in this paper is threefold:
selectively focus on the information needed. A comprehensive
• We propose a hierarchical attention mechanism to reason study for hard attention bound with reinforcement learning and
on the global features and the local object features for soft attention for the task of image captioning was published
image captioning. by Xu et al. [19].
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3

Yao et al. [35] tackled the video captioning task through [50]. Directly applying a GAN for the language problem is
capturing global temporal structures among video frames with impossible since sequences are composed of discrete elements
a temporal attention mechanism, which makes the model in many application areas such as machine translation and
dynamically focus on the key frames that are more relevant image captioning.
with the predicted word. Attention Models (ATT) developed A possible solution to tackle the discreteness problem of
by You et al. [36] first extracted semantic concept proposals language is to use the Gumbel-Softmax approximation [51]
and fused them with RNNs into hidden states and outputs. This [52]. For instance, Shetty et al. [53] use a GAN to generate
method used K-NN, multi-label ranking to extract semantic more realistic and accurate image descriptions with the aid
concepts or attributes and fused these concepts into one vector of Gumbel-Softmax to deal with the discontinuousness issue
using an attention mechanism. Similarly, Yao et al. [37] in language processing. Another more general solution is to
embedded attributes with image features into a RNN with borrow an idea from the RL framework, in which the feedback
various methods to boost the image captioning performance. from the discriminator is considered as the reward for the
Recently, Chen et al. [38] proposed to combine the spatial language generator. Dai et al. [22] built a model based on
attention and the channel-wise attention mechanism for image conditional GAN to generate diverse and naturalistic image
captioning, with improved results. Alternatively, Li et al. [20] descriptions and paragraphs, which utilizes a policy gradient
proposed a global-local attention mechanism to include local for optimization. Yu et al. [26] proposed a model called
features extracted from the top detected objects from a pre- SeqGAN, which unified the GAN framework and RL learning
trained object detector. Inspired by [20], we also include the problem, this has recently received much attention [54] [55].
local features from top detected objects. However, we build a They propose a three steps training strategy, which includes
hierarchical model whilst they treated local and global features the pre-training the generator, pre-training the discriminator
equivalently. and the final adversarial training. In this paper, inspired by
the SeqGAN, we propose to use a discriminator to judge the
B. Policy Gradient Optimization for Image Captioning fitness of the generated image descriptions with reference to
the image content and apply the policy gradient optimization
Another approach to boost the performance of language technique [27] to train the model. Unlike the original SeqGAN,
tasks is to compensate the so-called exposure bias problem our discriminator not only cares about the differences between
in RNN-based MLE learning. As pointed out in [39], RNNs the target language and model-generated language but also
are trained by MLE, which essentially minimized the KL- considers the coherence of the language with the image
divergence between the distribution of target sequences and the content.
distribution defined by the model. This KL-divergence objec-
tive tends to favour a model that overestimates its smoothness,
which can lead to unrealistic samples [40].
In order to tackle the problems and generate more realistic III. A PPROACH
image descriptions, some researches directly use evaluation
metrics such as BLEU [41], METEOR [42] and ROUGE
[43] as the reward signal and build the model under the In this section, we describe the proposed method based on
RL framework. For instance, Ranzato et al. [44] is the first two parts: the hierarchical attention mechanism and the policy
research using the policy gradient algorithm in a RNN-based gradient optimization algorithm.
sequence model, in which a REINFORCE-based approach was
used to calculate the sentence-level reward and a Monte-Carlo
technique was employed for training. Liu et al. [45] studied
several linear combinations of the evaluation metrics and A. Hierarchical Attention Mechanism
proposed to use a linear combination of SPICE [46] and CIDEr
[47] as the reward signal and apply a policy gradient algorithm The hierarchical attention mechanism consists of two parts:
to optimize the model, with improved results. This research a spatial attention mechanism which corresponds to global
used a Monte-Carlo roll-out strategy to obtain the intermediate CNN features and a local attention mechanism which corre-
reward during the process of description generation. More sponds to object features.
recently, Bahdanau et al. [48], instead of sentence-level reward The spatial attention mechanism is based on the model in
in the training, applied the token-level reward in temporal [19]. Specifically, the model comprises of an encoder and a
difference training for sequence generation. decoder. We use a convolutional neural network pre-trained on
As discussed previously, the GAN [23] estimates a differ- the ImageNet dataset [56] in order to extract a set of convo-
ence measure using a binary classifier, called a discriminator, lutional features. These features, denoted as a = {a1 , ..., aL },
to discriminate between the target samples and generated correspond to certain portions of the 2-D image. We extract
samples. GANs rely on back-propagating these difference convolutional features instead of fully connected ones in order
estimates through the generated samples to train the generator to build a spatial attention mechanism since convolutional
to minimize these differences. Hence, the whole network features have a spatial layout.
in GAN is trained in an adversarial way. The GAN was The Long-short Term Memory (LSTM) network, originally
originally proposed to generate naturalist images [23] [24] [49] proposed by Hochreiter and Schmidhuber in [57], is applied
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4

Fig. 1. The hierarchical attention model structure: The CNN encoder and the object detector extracts the global and local features, respectively. These two
types of features are forwarded to the LSTM models with the global and the local attention mechanisms. The outputs from the two LSTM models are
concatenated and decoded to words.

as the language decoder because of its superior performance can be expressed as in Equation 4. This can be seen as the
in natural language processing. expectation of weighted features maps.

L
it = σ(Wxi ∗ zt + Whi ∗ ht−1 + bi ) X
zt = αt,i ai (4)
ft = σ(Wxf ∗ zt + Whf ∗ ht−1 + bf ) i=1
ot = σ(Wxo ∗ zt + Who ∗ ht−1 + bo )
(1) Then the context vector zt is forwarded to the LSTM
gt = σ(Wxc ∗ zt + Whc ∗ ht−1 + bc )
network to generate captions, as described in Equation 1.
ct = ft · ct−1 + it · gt This soft attention mechanism is able to adaptively select the
ht = ot · φ(ct ) relevant visual parts of the given image features and thus
facilitate the recognition.
In Equation 1, it , ft , ot , ct and ht are the input gate, The local attention mechanism is formulated using object
forget gate, output gate, cell memory and hidden state of a features and another LSTM model. We use a pre-trained object
LSTM network, respectively. gt and ht are the input and the detector to retrieve the top N detected object features, which
output of the LSTM model. zt is the context vector, which are denoted as d = {d1 , ..., dN }. We then use another LSTM
can be processed by the soft attention mechanism and is able model with soft attention to allocate adaptive weights to each
to capture visual information associated with a certain input of these features.
location. The soft attention mechanism has to automatically
allocate adaptive weights for the image locations to facilitate
edti = fatt
d
(di , hdt−1 ) (5)
the task at hand.
where hd indicates the hidden state of the LSTM model for
eti = fatt (ai , ht−1 ) (2) the local attention mechanism.

where ai ∈ {a1 , ..., aL }. Equation 2 actually maps the image d exp(edti )


αti = PL (6)
features from each location, along with information from the d
k=1 exp(etk )
hidden state, into an adaptive weight, which indicates the
importance of each image location for the recognition. Similarly, Equation 6 normalizes the adaptive weights for local
features to a probability value with the Softmax function.
exp(eti )
αti = PL (3) N
k=1 exp(etk )
X
ztd = Concat( d
αt,i di , ht−1 ) (7)
i=1
Then, Equation 3 normalizes the adaptive weights into a
probability value in the range of 0 and 1 using the Softmax Equation 7 demonstrates that the context vector for local
function. Once these weights (summed to 1) are computed, attention model catching information from both the local
we element-wisely multiply the weights vector αt with image features and the global attention mechanism, where Concat
feature vector a and sum them to the context vector zt , which indicates the concatenation operation of the features. This
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5

context vector is then forwarded to a second LSTM model a sequence from the start state S0 to maximize its expected
as described by Equation 8. long-term reward as described by Equation 11:

idt = σ(Wxi
d
∗ ztd + Whi
d
∗ hdt−1 + bdi ) X
ftd = σ(Wxf d
∗ ztd + Whfd
∗ hdt−1 + bdf ) J(θ) = E[RT |s0 , θ] = Gθ (y1 |s0 ) · QG
Dθ (s0 , y1 )
θ
(11)
y1 ∈Y
odt = σ(Wxo d
∗ ztd + Whod
∗ hdt−1 + bdo )
(8)
gtd = σ(Wxc d
∗ ztd + Whcd
∗ hdt−1 + bdc ) where RT is the reward for a complete sequence. QG Dθ (s, y)
θ

d d d d d is the action-value function of a language sequence, which


ct = ft · ct−1 + it · gt
is defined as the expected accumulative reward starting from
hdt = odt · φ(cdt ) state s, taking a certain action, and then following policy Gθ .
The two LSTM models, denoted as LST M G for the global The action-value function is estimated using the REIN-
features and LST M L for the local features are jointly trained FORCE algorithm [58] and considers the probability of being
to map the hierarchical feature representation with language. real generated by the discriminator as a reward, which can be
LST M L is at a higher level, which can be used to decode defined as in Equation 12.
the hidden states for the final outputs. However, the gradient
vanishing problem cannot be avoided if we only use the hidden QG
Dθ (a = yT , s = Y1:T −1 ) = Dθ (Y1:T )
θ
(12)
states from LST M L to decode information. Inspired by [3] in
which a shortcut in network connections is applied to solve the As can be seen in Equation 12, the discriminator only
gradient vanishing problem, we concatenate the hidden states provides a reward for a complete sequence. We should not
from LST M G and LST M L to decode and map the hidden only care about the reward for a complete tokens but also
states to language vectors, which can be seen in Equation 9. the long-term reward for the future time-steps since the long-
term reward is what we actually want. Similar to the game of
houtput
t = Concat(ht , hdt ) Go [28] in which the agent sometimes give up an immediate
logits = Wp houtput
t (9) interest but cares about the final victory, we apply a similar
Monte Carlo roll-out strategy for an intermediate state, i.e.,
P (st |I, s0 , s1 , s2 , ..., st−1 ) = Sof tmax(logits)
an unfinished sequence. We represent an N-time Monte Carlo
In MLE training, if the length of a sentence is T , the loss search as in Equation 13.
function can be formulated as in Equation 10, which is the
1 n N
sum of the log likelihood of each word. Yt+1:T , ..., Yt+1:T , ..., Yt+1:T = M C Gθ (Y1:t ; N )
(13)
M C =∼ M ultinomial(logits)
T
X
Loss = log(p(st |I, s0 , s1 , s2 , ..., si )) (10) n
where Y1:t is the generated sequence tokens and Yt+1:T is
i=0 the Monte Carlo sampled based on a roll-out policy, which, in
our case, is set as the same as the image caption generator for
B. Policy Gradient Optimization convenience. In reality, we can use any policy to perform the
In addition to only using the MLE to train the image caption roll-out operation. logits is the output of the LSTM decoder.
generator, to alleviate the previously discussed exposure bias MC is defined as a sampling procedure from a Multinomial
problem in RNN-based MLE training as discussed previously, distribution.
we also apply a policy gradient optimization algorithm in If there is no intermediate reward, the Monte Carlo roll-out
the RL framework to increase the quality of the generated strategy can sample the future possible tokens N times and
descriptions. average these rewards to achieve the goal of reward estimation,
We feed both of the generated descriptions and the reference which is described in Equation 14.
descriptions to the discriminator. The level of coherence of
the descriptions and image content is calculated by the dot QG
Dθ (a = yt , s = Y1:t−1 ) =
θ

product, which is forwarded to the discriminator, as described N


1 X n n
in Fig. 3. This operation is to consider the coherence between Dθ (Y1:T ), Y1:T ∈ M C Gθ (Y1:t ; N ), f or t < T
N n=1
certain captions (sequences) and corresponding image features,
which is able to make the generated captions more realistic and Dθ (Y1:T ), f or t = T
naturalistic. The reference sequences are labeled as true whilst (14)
the generated sequences are labeled as false during the training The Monte Carlo roll-out strategy can be better visualized
of the discriminator. The model is also a LSTM network with in Fig. 4.
Softmax Cross Entropy loss. Hence, the discriminator outputs Once the reward value from the discriminator is obtained,
the probabilities of a sample being true. These probabilities, it is ready to update the generator. The goal is to maximize
are then considered as the reward signal in the RL framework, the average reward starting from the initial state as defined in
to be utilized in the Policy Gradient algorithm for updating the Equation 15.
parameters of the image caption generator. N
Following [27], the objective of the policy network 1 X
J(θ) = Vθ (s0 |Xi , Yi ) (15)
Gθ (yt |y1:t−1 ) (the image caption generator), is to generate N i=1
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6

Fig. 2. Policy Gradient optimization with a discriminator to evaluate the similarity between the generated sentence and the reference sentence.

Fig. 3. Policy Gradient optimization with a discriminator to evaluate the coherence between the generated sentence and the image contents.

In practice, we can use advanced gradient algorithms such


as RMSprop [59] and Adam [60] in training the caption
generator.
The image caption generator and discriminator are adversar-
ially trained in the framework of GAN [23]. In GAN [24], the
discriminator can pass the gradient directly to the generator.
Due to the discreteness of the sequence generation, we apply
RL to estimate the gradient of the generator in our model.
Specifically, the training strategy is described in Algorithm
1. We initially pre-train the image caption generator using
Fig. 4. Monte Carlo roll-out: We use Monte Carlo sampling to sample tokens MLE. In practice, this is equivalent to the Cross Entropy loss
in the future time steps and average them to obtain the intermediate rewards
so as to optimize the token generated at each time step. [61]. Hence, we can set the pre-training step the same as in
[19]. The trained model is used to generate some captions
which are set as fake samples, which, along with the reference
captions, are fed into the discriminator for training. Similarly,
where N is the number of samples used for training. We the discriminator is also pre-trained for certain steps. The next
can use the Policy Gradient theorem from [27] and write steps are the adversarial training steps, in which the image
the gradient of the objective function (reward signal) as in caption generator and discriminator are trained alternatively
Equation 16. until convergence of the networks.
In addition to the sentence comparison scheme introduced
X
5θ J(θ) = EY1:t−1 ∼Gθ [ 5Gθ (yt |Y1:t−1 ) · QG Dθ (Y1:t−1 , yt )]
θ

yt ∈Y
previously, and shown in Fig. 2, we also employ a scheme
(16) to evaluate the coherence between the generated captions and
Since the expectation can be approximated by sampling, we the image content. Specifically, both of the global features and
can now update the parameters of the image caption generator local object features are processed by average pooling in order
using Equation 17. to obtain fixed-size feature representation, denoted as Vi . The
captions, similar to the sentence comparison scheme, are also
θ ← θ + αh 5θ J(θ) (17) encoded into a fixed-size vector, using a LSTM model, denoted
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7

as Vw . The two vectors Vi and Vw are then dot producted and trained Residual-152 network [3] on the platform of Caffe
forwarded to logistic function to obtain the reward for RL [63], with a dimensionality of 49×2048. We also retrieve local
training, which can be seen in Fig. 3. object features using a Faster RCNN [10] object detection
network pre-trained on the MSCOCO dataset. Specifically, we
Algorithm 1 Image Caption Generation by Adversarial Train- obtain the top K detected object features from the layer of
ing and Reinforcement Learning “FC6” layer of the VGG16 model [2] used in Faster RCNN,
Require: Image Caption Generator Gθ ; Discriminator Dθ . with dimensionality of K × 4096. We build the hierarchical
Pre-training Gθ using MLE by some epoches. attention mechanism and policy gradient optimization on the
Generating negative samples using pre-trained Gθ to train TensorFlow platform [64].
Dθ . 1) Training the Faster RCNN on the MSCOCO dataset:
Pre-training Dθ by some steps. In order to obtain better local object features, we train the
repeat Faster RCNN model on MSCOCO object detection dataset.
for update-generator for 1 step do The model is first pre-trained on the ImageNet object detection
Generate a sequence Y1:T = (y1 , .., yT ). dataset [56]. The MSCOCO object detection dataset shares
for t = 1 to T do the same images with the image caption task. Consequently,
Compute the intermediate reward Q(t) by Monte we keep the same splits with the image caption dataset for
Carlo roll-out. training. The training process on the MSCOCO dataset is
end for almost the same with the pre-training on ImageNet. The initial
Update the parameters θ using Policy Gradient. learning rate is set to 0.001. The momentum of the stochastic
end for gradient descent is set to 0.9 and the weight decay is set to
for update-discriminator for 1 step or 5 steps do 0.0005.
Training discriminator Dθ using reference sequence 2) Language Pre-processing: To pre-process the language,
(True) and generated sequence (Fake) using current the special symbols such as ‘.’, ‘,’, ‘(’, ‘)’ and ‘-’ are replaced
generator. with blank spaces whilst ‘&’ is replaced with ‘and’. Since
end for we set the maximum length of the descriptions as 20 words,
until Convergence we delete the caption references from the original dataset
which are longer than 20. For the vocabulary establishment,
following the open-source code of [14], we include words
IV. E XPERIMENTAL VALIDATION
that occurs more than 5 times in the vocabulary. We map the
A. Dataset Introduction symbol ‘NULL’ to 0, ‘START’ to 1 and ‘END’ to 2.
We conduct our experiments using the MSCOCO dataset 3) Training Details of the Model: The network was first
[62]. To be consistent with the previous researches, we use pre-trained using MLE for 10 epochs. During training, the
the MSCOCO 2014 released version, which includes 123,000 size of the hidden states of the two LSTM models is set as
images. The dataset contains 82,783 images in the training 512. We choose the same size of hidden states of [20] as
set, 40,504 images in the validation set and 40,775 images they achieved satisfactory performance with this size of the
in the test set. As the ground-truth for the MSCOCO test hidden states. We set the batch size as 32 and the learning
set is not available, the validation set is further splited into rate as 0.001 and we use the Adam algorithm [60] to train
a validation subset for model selection and a test subset the network. Subsequently, we train the discriminator for
for local experiments. This is the “Karpathy” split [14]. It 2500 steps, following by an adversarial training scheme, in
utilizes the whole 82,783 training set images for training, which the caption generator and discriminator are trained
and selects 5,000 images for validation and 5,000 images for alternatively until convergence. During the pre-training steps
testing from the official validation set. The standard evaluation of the discriminator and the policy gradient-based adversarial
protocol contains BLEU [41], METEOR [42], CIDEr [47] and training as described previously, the Adam algorithm is also
ROUGE-L [43]. applied. The learning rate for these steps are set as 0.0001.
BLEU is the most popular metric for the performance Following the open-source code of [14], at training time, we
evaluation in machine translation. The metric is only based set the maximum length of the input sequence to 20 words.
on the n-gram statistics. The BLEU-1, BLEU-2, BLEU-3 and During the testing time, alternatively, we set maximum length
BLEU-4 measure the performance of the 1, 2, 3, 4-gram, of a generated symbols as 30 words. During the training of
respectively. METEOR is based on the harmonic mean of the proposed model, we add a trainable word embedding layer
unigram precision and recall, and seeks correlation at the from Google’s TensorFlow platform [64]. All the experiments
corpus level. CIDEr can be used to evaluate the generated are conducted on a server embedded with NVIDIA TITAN X
sentences with human consensus. ROUGE-L measures the GPU and installed with the Ubuntu 14.04 operating system.
common maximum-length subsequence for the target sentence
and the generated sentence.
C. Results
B. Implementation Details 1) Quantitative Evaluation: In this section, a compre-
For all the images in the COCO dataset, we obtain global hensive quantitative evaluation is conducted using different
convolutional features (from the layer “res5c”) using a pre- experimental settings on the MSCOCO dataset.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8

Fig. 5. Visualization of the global attention maps and generated captions. The red color indicates the importance of each region of the image.

attention model under MLE training, which utilizes both of


the global and local attention for image captioning. The results
improve the baseline significantly, which can be seen in Table
I. Specifically, all of the seven evaluation metrics are improved
using our hierarchical attention model.
b) The determination of the number of top detected
objects: To determine the best number k for the top detected
objects in the local attention model, we perform an ablation
study. We extract the 10, 20 and 30 top detected object features
and test them using the hierarchical attention model. The
results can be seen in Table II. With the increase of the number
k from 10 to 30, the performance increases accordingly.
Although the maximum length of our generated sentences is
set as 30, not every word represents an object. Also, intuitively,
there are a maximum 30 objects within an image. Hence, in
Fig. 6. Visualization of the attentive weights on the top 10 detected objects,
the blue boxes indicate the detected objects whilst the labels show the attentive the following experiments, we use the 30 top detected object
weights of the local attention model. features for the local attention model.
c) The performance of Policy Gradient with reward only
from language comparison: Next we start the reinforcement
a) Comparison between the global attention, the local learning steps. We first train the discriminator which only
attention and the hierarchical attention model: We first obtain compares the similarity between the reference sentence and the
the results using only the global attention model, which is generated sentence. Specifically, we follow the model defined
similar to the soft attention model in [19]. Since we use in Fig. 2. The discriminator is first trained in 2500 steps,
advanced CNN features from the Residual-152 model, the which we find sufficient for the discriminator to converge. The
results of BLEU, METEOR, CIDEr and ROUGE-L are all loss curve of the image caption generator is shown in Fig. 8.
satisfactory, and are listed in Table I. Then only the local at- After 2500 steps pre-training the discriminator, the loss of the
tention model using the detected object features from a Faster image caption generator starts to decline, which validates that
RCNN detector is tested, with results which are much lower the policy gradient starts to work. Then we further train the
than those for the global attention model as listed in Table I. generator and discriminator adversarially for another 1 epoch,
One of the possible reasons is that the Faster RCNN only uses and report the results in Table III. We also experimented with
the VGG16 model, which is not as powerful as the Residual- two different settings in the adversarial training steps. The
152 network. Another reason is that the local object features, first setting is to train 1 step for the discriminator, followed
despite the capability to provide complementary information by another step for the generator. Another setting is to train
to the global attention model, can sometimes miss many the discriminator for 5 steps, followed by 1 step training for
important features. Finally, we test our proposed hierarchical the generator. We find the final results of the two setting are
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9

(a) (b) (c) (d)


Ground-truth: Ground-truth: Ground-truth: Ground-truth:
A group of people standing next A yellow and red bus parked in A little boy sitting in front of a The lone adult cow walks on
to a bus under an airplane . a parking lot with other busses. hot dog covered in ketchup. rocks near the beach.
MLE: MLE: MLE: MLE:
A large airplane is parked on the A yellow bus is parked on the A little girl is eating a hot dog. A cow is walking down the street
runway. side of the road. Ours: in the sand.
Ours: Ours: A young boy is eating a hot dog. Ours:
A large airplane is parked on A yellow and red bus parked in A cow is standing on the beach
the runway with people walking a parking lot. next to body of water.
around.

(e) (f) (g) (h)


Ground-truth: Ground-truth: Ground-truth: Ground-truth:
A baseball player swinging a Six cows standing and laying on A fat cat in the living room A giraffe is walking through the
baseball bat during a game. the beach. watching the tv. forest with tall trees.
MLE: MLE: MLE: MLE:
A baseball player is preparing to A group of cows standing on top A cat is sitting in a living room A giraffe is standing in the
swing at a pitch. of a snow covered field. with a television. woods with trees in the back-
Ours: Ours: Ours: ground.
A baseball player is swinging a A group of cows standing on top A cat sitting on the floor watch- Ours:
bat at a ball. of a sandy beach. ing a television. A giraffe standing next to a tree
in a forest.

Fig. 7. Visualization of the generated descriptions: the red color texts indicate the captions generated by our model, which is more accurate and realistic than
blue text captions generated by MLE model. All samples are randomly selected.

TABLE I
C OMPARISON OF IMAGE CAPTIONING USING DIFFERENT ATTENTION MECHANISM RESULTS ON THE MSCOCO DATASET

Methods BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR CIDEr ROUGE-L


Soft Attention [19] 70.7 49.2 34.4 24.3 23.90 - -
Global Attention 70.121 50.304 35.434 25.111 23.658 84.701 54.308
Local Attention 64.059 42.359 28.089 19.033 20.203 56.898 49.861
Hierarchical Attention 72.611 52.769 37.802 27.243 24.731 88.140 56.048

TABLE II
C OMPARISON OF IMAGE CAPTIONING RESULTS ON THE MSCOCO DATASET WITH DIFFERENT NUMBERS OF OBJECTS

Methods BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR CIDEr ROUGE-L


Hierarchical Attention with 10 Objects for Local Attention 70.601 50.423 36.643 25.389 24.633 87.316 55.241
Hierarchical Attention with 20 Objects for Local Attention 72.159 52.498 37.552 26.918 24.725 88.639 55.825
Hierarchical Attention with 30 Objects for Local Attention 72.611 52.769 37.802 27.243 24.731 88.140 56.048

similar, which all slightly improve the MLE training baseline. training. However, this scheme lacks the measurement of the
The reason for the improvement is because the reinforce- similarity between the generated descriptions and the image
ment learning solves the exposure bias problem during MLE contents, which prevents the image caption generator from
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10

TABLE III
C OMPARISON OF IMAGE CAPTIONING RESULTS ON THE MSCOCO DATASET WITH DIFFERENT SETTINGS FOR POLICY GRADIENT (PG) OPTIMIZATION

Methods BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR CIDEr ROUGE-L


MLE training only 72.611 52.769 37.802 27.243 24.731 88.140 56.048
PG with 2500 steps for pre-training D followed by 1 D and 1 G step 72.450 52.845 38.141 27.551 24.543 87.416 55.876
PG with 2500 steps for pre-training D followed by 5 D and 1 G step 72.104 52.739 38.122 27.602 24.928 89.072 56.063

which are grouped into three categories. The first category


corresponds to various methods without external information
and reinforcement learning. The best of them (SCA-CNN-
ResNet) is the spatial and channel-wise attention model [38]
in which both the spatial and channel-wise attention mecha-
nisms are utilized for image captioning. The methods in the
second group use extra information during the training of the
model. For instance, Semantic Attention [36] utilizes rich extra
data from social media to train the visual attribute predictor.
Deep Compositional Captioning (DCC) [66] generates extra
data to prove its unique transfer capability. The third group
corresponds to the reinforcement learning technique. RL with
G-GAN [22] applies conditional GAN and policy gradient to
Fig. 8. The loss curve of the image caption generator during reinforcement generate image descriptions. Although their results on the eval-
learning steps: before 2500 iterations, we pre-train the discriminator. Starting uation metrics are not improved, they prove that the generated
from the 2500 iterations, we start the adversarial training of the generator and
discriminator. The loss value starts to decrease starting from 2500 iterations captions are more diverse and naturalistic. Embedding Reward
as the parameters of the generator begins to be updated. [67] applies a policy network to generate captions and a value
network to evaluate the reward. Additionally, they also apply
advanced inference method called lookahead inference and
generating more naturalistic and diverse descriptions. beam search during testing. They achieve the current state-of-
d) The performance of Policy Gradient with reward from the-art results on the “Karpathy” split. Although we do not use
the measurement of coherence between language and image any external knowledge and any advanced inference technique
content: To train the image caption generator to generate more (including beam search, we use greedy search in all of our
naturalistic and diverse descriptions, we further test the model experiments), we achieve similar results to the current state-
defined in Fig. 2. First we only extract the global features and of-the-art methods (Embedding Reward [67] and SCA-CNN-
perform average pooling, resulting with a feature dimension ResNet [38]), with state-of-the-art results on three important
of 2048. We then use the dot product to measure these image metrics: BLEU-1, METEOR and ROUGE-L and lead other
features and language embedding features by a discriminator, methods significantly.
which can be considered as the reward within the reinforce- 2) Qualitative Evaluation: In addition to the quantitative
ment learning framework. The experimental results from this evaluation using the standard metrics, we qualitatively evaluate
model can be seen in Table IV. the proposed model by visualization. Firstly, we plot some
However, the results from all of the seven metrics are even global attention maps corresponding to each generated words
lower than the MLE training baseline. One possible reason, is as shown in Fig. 5. It is obvious in the figure that the attentive
the measurement of discriminator which only uses the global regions normally correspond with the semantic meaning of the
features, which is not consistent with the hierarchical attention generated word in each time step. Then we choose some ex-
model in the generator side. As can be seen from the Table amples to visualize the local attention weights on the detected
IV, the results from this model are similar to that of global objects, which are shown in Fig. 6. We only retrieve the top 10
attention model, since the reward signal from the discriminator detected objects and corresponding attentive weights obtained
tends to force the generator to produce sentences that only from the local attention mechanism because of limited space
matches the global features. in the figure. The detector can detect some fine-grained ob-
We further build a model exactly like in the one defined jects, which provide complementary information for the global
in Fig. 3. This model includes both of the global image attention mechanism. At last, we show some of the generated
features and the local object features, and thus guarantees sentences using different methods. Specifically, we show the
that the discriminator and the generator are utilizing the same ground-truth sentences, descriptions generated by the MLE
information source. The final results can be seen in Table IV, training-based model and by the proposed model as shown
which outperform all of other experimental settings. in Fig. 7. The text in red are the sentences generated by the
To prove the effectiveness of the proposed method, we proposed model, which are more accurate and naturalist than
compare our final results on the “Karpathy” test split with the MLE-based model, which are shown in blue. Specially,
previously published results, which is shown in Table V. We the proposed model show superior performance in finding
list most of the published results on the “Karpathy” split, the fine-grained properties of the image since the RL model
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11

TABLE IV
C OMPARISON OF IMAGE CAPTIONING RESULTS ON THE MSCOCO DATASET FOR POLICY GRADIENT (PG) OPTIMIZATION WITH DISCRIMINATOR FOR
EVALUATION OF THE COHERENCE BETWEEN LANGUAGE AND IMAGE CONTENT.

Methods BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR CIDEr ROUGE-L


MLE training only 72.611 52.769 37.802 27.243 24.731 88.140 56.048
Global Attention 70.121 50.304 35.434 25.111 23.658 84.701 54.308
PG with similarity of global features (1 D and 1 G step) 72.250 52.290 37.099 26.331 23.815 84.516 55.238
PG with similarity of global features (5 D and 1 G step) 72.234 52.120 36.887 26.065 23.957 84.224 55.244
PG with similarity of global-local features (1 D and 1 G step) 73.036 53.688 39.069 28.551 25.324 92.449 56.539

TABLE V
C OMPARISON OF IMAGE CAPTIONING RESULTS ON THE MSCOCO DATASET WITH PREVIOUS METHODS , WHERE 1 INDICATES EXTERNAL INFORMATION
ARE USED DURING THE TRAINING PROCESS AND 2 MEANS THAT REINFORCEMENT LEARNING IS APPLIED TO OPTIMIZE THE MODEL .

Methods BLEU-1 BLEU-2 BLEU-3 BLEU-4 METEOR CIDEr ROUGE-L


Google NIC [15] 66.6 46.1 32.9 24.6 - - -
m-RNN [29] 67 49 35 25 - - -
BRNN [14] 64.2 45.1 30.4 20.3 - - -
MSR/CMU [65] - - - 19.0 20.4 - -
Spatial Attention [19] 71.8 50.4 35.7 25.0 23.0 - -
gLSTM [30] 67.0 49.1 35.8 26.4 22.7 81.3 -
GLA [20] 56.8 37.2 23.2 14.6 16.6 36.2 41.9
MIXER [44] - - - 29.0 - - -
SCA-CNN-ResNet [38] 71.9 54.8 41.1 31.1 25.0 - -
Semantic Attention1 [36] 70.9 53.7 40.2 30.4 24.3 - -
DCC1 [66] 64.4 - - - 21.0 - -
RL with G-GAN2 [22] - - 30.5 29.7 22.4 79.5 47.5
RL with Embedding Reward2 [67] 71.3 53.9 40.3 30.4 25.1 93.7 52.5
Ours2 73.036 53.688 39.069 28.551 25.324 92.449 56.539

automatically measure the coherence of the sentences and the [3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
image content. For instance, in Fig. 7 (c), the proposed model for image recognition,” in Proceedings of the IEEE conference
successfully determines the gender of the person in the image on computer vision and pattern recognition, 2016, pp. 770–778.
[4] S. Tang, Y. T. Zheng, Y. Wang, and T. S. Chua, “Sparse
whilst the MLE training-based model gets it wrong. ensemble learning for concept detection,” IEEE Transactions
on Multimedia, vol. 14, no. 1, pp. 43–54, Feb 2012.
V. C ONCLUSION [5] C. Kang, S. Xiang, S. Liao, C. Xu, and C. Pan, “Learning
This paper targets the image captioning task, which is consistent feature representation for cross-modal multimedia
a fundamental problem in artificial intelligence. Based on retrieval,” IEEE Transactions on Multimedia, vol. 17, no. 3,
pp. 370–381, March 2015.
the recent successes of deep learning, especially the CNN
[6] S. Bu, Z. Liu, J. Han, J. Wu, and R. Ji, “Learning high-level
feature representation and the LSTM with attention model, feature by deep belief networks for 3-d model retrieval and
the paper proposes the use of a hierarchical attention mecha- recognition,” IEEE Transactions on Multimedia, vol. 16, no. 8,
nism, considering not only the global image features but also pp. 2154–2167, Dec 2014.
detected object features, with improved results. A significant [7] P. Liu, J. M. Guo, C. Y. Wu, and D. Cai, “Fusion of deep
learning and compressed domain features for content-based im-
improvement over the current RNN-based MLE training has
age retrieval,” IEEE Transactions on Image Processing, vol. 26,
also been demonstrated. Specifically, a GAN framework with no. 12, pp. 5706–5717, Dec 2017.
RL optimization for the image captioning task is proposed [8] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature
to generate more accurate and high-quality captions. The hierarchies for accurate object detection and semantic segmen-
discriminator is to evaluate the coherence and consistency tation,” in Proceedings of the IEEE conference on computer
vision and pattern recognition, 2014, pp. 580–587.
between the generated sentences and image content, thus
[9] R. Girshick, “Fast r-cnn,” in Computer Vision (ICCV), 2015
providing the rewards for optimization. The whole model IEEE International Conference on. IEEE, 2015, pp. 1440–
follows a three-step training strategy. Experiments analysis 1448.
confirms the merits of the framework and key contributors the [10] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards
improved performance. Comparable results with current state- real-time object detection with region proposal networks,” in
Advances in neural information processing systems, 2015, pp.
of-the-art methods are achieved using only greedy inference,
91–99.
which proves the effectiveness of the training procedure. [11] S. Tang, Y. Li, L. Deng, and Y. Zhang, “Object localization
based on proposal fusion,” IEEE Transactions on Multimedia,
R EFERENCES vol. 19, no. 9, pp. 2105–2116, 2017.
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classi- [12] G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi,
fication with deep convolutional neural networks,” in Advances A. C. Berg, and T. L. Berg, “Babytalk: Understanding and
in neural information processing systems, 2012, pp. 1097–1105. generating simple image descriptions,” IEEE Transactions on
[2] K. Simonyan and A. Zisserman, “Very deep convolutional Pattern Analysis and Machine Intelligence, vol. 35, no. 12, pp.
networks for large-scale image recognition,” in International 2891–2903, 2013.
Conference on Learning Representations (ICLR), 2015.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12

[13] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, tion,” in Proceedings of the IEEE conference on computer vision
P. Dollár, J. Gao, X. He, M. Mitchell, J. C. Platt et al., “From and pattern recognition, 2015, pp. 2625–2634.
captions to visual concepts and back,” in Proceedings of the [32] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau,
IEEE conference on computer vision and pattern recognition, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase rep-
2015, pp. 1473–1482. resentations using rnn encoder-decoder for statistical machine
[14] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments translation,” arXiv preprint arXiv:1406.1078, 2014.
for generating image descriptions,” in Proceedings of the IEEE [33] V. Mnih, N. Heess, A. Graves et al., “Recurrent models of
Conference on Computer Vision and Pattern Recognition, 2015, visual attention,” in Advances in neural information processing
pp. 3128–3137. systems, 2014, pp. 2204–2212.
[15] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and [34] J. Ba, V. Mnih, and K. Kavukcuoglu, “Multiple object recog-
tell: A neural image caption generator,” in Proceedings of the nition with visual attention,” arXiv preprint arXiv:1412.7755,
IEEE conference on computer vision and pattern recognition, 2014.
2015, pp. 3156–3164. [35] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle,
[16] K. Cho, B. van Merrienboer, D. Bahdanau, and Y. Bengio, “On and A. Courville, “Describing videos by exploiting temporal
the properties of neural machine translation: Encoder–decoder structure,” in Proceedings of the IEEE international conference
approaches,” in Proceedings of SSST-8, Eighth Workshop on on computer vision, 2015, pp. 4507–4515.
Syntax, Semantics and Structure in Statistical Translation, 2014, [36] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, “Image captioning
pp. 103–111. with semantic attention,” in Proceedings of the IEEE Conference
[17] K. Cho, A. Courville, and Y. Bengio, “Describing multimedia on Computer Vision and Pattern Recognition, 2016, pp. 4651–
content using attention-based encoder-decoder networks,” IEEE 4659.
Transactions on Multimedia, vol. 17, no. 11, pp. 1875–1886, [37] T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei, “Boosting image cap-
2015. tioning with attributes,” in 2017 IEEE International Conference
[18] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine transla- on Computer Vision (ICCV), Oct 2017, pp. 4904–4912.
tion by jointly learning to align and translate,” in International [38] L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T.-S.
Conference on Learning Representations (ICLR), 2015. Chua, “Sca-cnn: Spatial and channel-wise attention in convo-
[19] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, lutional networks for image captioning,” in Proceedings of the
R. Zemel, and Y. Bengio, “Show, attend and tell: Neural IEEE Conference on Computer Vision and Pattern Recognition,
image caption generation with visual attention,” in International 2017, pp. 5659–5667.
Conference on Machine Learning, 2015, pp. 2048–2057. [39] A. Goyal, N. R. Ke, A. Lamb, R. D. Hjelm, C. Pal, J. Pineau,
[20] L. Li, S. Tang, Y. Zhang, L. Deng, and Q. Tian, “Gla: Global- and Y. Bengio, “Actual: Actor-critic under adversarial learning,”
local attention for image description,” IEEE Transactions on arXiv preprint arXiv:1711.04755, 2017.
Multimedia, vol. PP, no. 99, pp. 1–1, 2017. [40] I. Goodfellow, “Nips 2016 tutorial: Generative adversarial net-
[21] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled works,” arXiv preprint arXiv:1701.00160, 2016.
sampling for sequence prediction with recurrent neural net- [41] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a
works,” in Advances in Neural Information Processing Systems, method for automatic evaluation of machine translation,” in
2015, pp. 1171–1179. Proceedings of the 40th annual meeting on association for
[22] B. Dai, S. Fidler, R. Urtasun, and D. Lin, “Towards diverse and computational linguistics. Association for Computational
natural image descriptions via a conditional gan,” in Proceed- Linguistics, 2002, pp. 311–318.
ings of the IEEE Conference on Computer Vision and Pattern [42] A. Lavie and A. Agarwal, “Meteor: An automatic metric for mt
Recognition, 2017, pp. 2970–2979. evaluation with improved correlation with human judgments,”
[23] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- in Proceedings of the EMNLP 2011 Workshop on Statistical
Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative Machine Translation, 2005, pp. 65–72.
adversarial nets,” in Advances in neural information processing [43] C.-Y. Lin and E. Hovy, “Automatic evaluation of summaries us-
systems, 2014, pp. 2672–2680. ing n-gram co-occurrence statistics,” in Proceedings of the 2003
[24] M. Mirza and S. Osindero, “Conditional generative adversarial Conference of the North American Chapter of the Association
nets,” arXiv preprint arXiv:1411.1784, 2014. for Computational Linguistics on Human Language Technology-
[25] R. S. Sutton and A. G. Barto, Reinforcement learning: An Volume 1. Association for Computational Linguistics, 2003,
introduction. MIT press Cambridge, 1998, vol. 1, no. 1. pp. 71–78.
[26] L. Yu, W. Zhang, J. Wang, and Y. Yu, “Seqgan: Sequence [44] M. Ranzato, S. Chopra, M. Auli, and W. Zaremba, “Sequence
generative adversarial nets with policy gradient.” in AAAI, 2017, level training with recurrent neural networks,” in International
pp. 2852–2858. Conference on Learning Representations (ICLR), 2016.
[27] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Man- [45] S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy, “Im-
sour, “Policy gradient methods for reinforcement learning with proved image captioning via policy gradient optimization of
function approximation,” in Advances in neural information spider,” in 2017 IEEE International Conference on Computer
processing systems, 2000, pp. 1057–1063. Vision (ICCV), Oct 2017, pp. 873–881.
[28] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van [46] P. Anderson, B. Fernando, M. Johnson, and S. Gould, “Spice:
Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershel- Semantic propositional image caption evaluation,” in European
vam, M. Lanctot et al., “Mastering the game of go with deep Conference on Computer Vision. Springer, 2016, pp. 382–398.
neural networks and tree search,” Nature, vol. 529, no. 7587, [47] R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider:
pp. 484–489, 2016. Consensus-based image description evaluation,” in Proceedings
[29] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille, of the IEEE conference on computer vision and pattern recog-
“Deep captioning with multimodal recurrent neural networks nition, 2015, pp. 4566–4575.
(m-rnn),” arXiv preprint arXiv:1412.6632, 2014. [48] D. Bahdanau, P. Brakel, K. Xu, A. Goyal, R. Lowe, J. Pineau,
[30] X. Jia, E. Gavves, B. Fernando, and T. Tuytelaars, “Guiding A. Courville, and Y. Bengio, “An actor-critic algorithm for
long-short term memory for image caption generation,” arXiv sequence prediction,” in International Conference on Learning
preprint arXiv:1509.04942, 2015. Representations (ICLR), 2017.
[31] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, [49] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Rad-
S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recur- ford, and X. Chen, “Improved techniques for training gans,” in
rent convolutional networks for visual recognition and descrip- Advances in Neural Information Processing Systems, 2016, pp.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 13

2234–2242.
[50] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,”
arXiv preprint arXiv:1701.07875, 2017.
[51] E. Jang, S. Gu, and B. Poole, “Categorical reparameterization
with gumbel-softmax,” arXiv preprint arXiv:1611.01144, 2016.
[52] C. J. Maddison, A. Mnih, and Y. W. Teh, “The concrete distri-
bution: A continuous relaxation of discrete random variables,”
arXiv preprint arXiv:1611.00712, 2016.
[53] R. Shetty, M. Rohrbach, L. A. Hendricks, M. Fritz, and
B. Schiele, “Speaking the same language: Matching machine
to human captions by adversarial training,” in Proceedings of
the IEEE International Conference on Computer Vision (ICCV),
2017.
[54] M. J. Kusner and J. M. Hernández-Lobato, “Gans for sequences
of discrete elements with the gumbel-softmax distribution,”
arXiv preprint arXiv:1611.04051, 2016.
[55] L. Wu, Y. Xia, L. Zhao, F. Tian, T. Qin, J. Lai, and T.-Y.
Liu, “Adversarial neural machine translation,” arXiv preprint
arXiv:1704.06933, 2017.
[56] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Im-
agenet large scale visual recognition challenge,” International
Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
[57] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[58] R. J. Williams, “Simple statistical gradient-following algorithms
for connectionist reinforcement learning,” Machine learning,
vol. 8, no. 3-4, pp. 229–256, 1992.
[59] T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide
the gradient by a running average of its recent magnitude,”
COURSERA: Neural networks for machine learning, vol. 4,
no. 2, pp. 26–31, 2012.
[60] D. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
tion,” in International Conference on Learning Representations
(ICLR), 2015.
[61] P.-T. De Boer, D. P. Kroese, S. Mannor, and R. Y. Rubinstein,
“A tutorial on the cross-entropy method,” Annals of operations
research, vol. 134, no. 1, pp. 19–67, 2005.
[62] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
manan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common
objects in context,” in European conference on computer vision.
Springer, 2014, pp. 740–755.
[63] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-
shick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional
architecture for fast feature embedding,” in Proceedings of the
22nd ACM international conference on Multimedia. ACM,
2014, pp. 675–678.
[64] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro,
G. S. Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow:
Large-scale machine learning on heterogeneous distributed sys-
tems,” arXiv preprint arXiv:1603.04467, 2016.
[65] X. Chen and C. L. Zitnick, “Mind’s eye: A recurrent vi-
sual representation for image caption generation,” in 2015
IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), June 2015, pp. 2422–2431.
[66] L. A. Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney,
K. Saenko, and T. Darrell, “Deep compositional captioning:
Describing novel object categories without paired training data,”
in 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), June 2016, pp. 1–10.
[67] Z. Ren, X. Wang, N. Zhang, X. Lv, and L.-J. Li, “Deep
reinforcement learning-based image captioning with embedding
reward,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2017, pp. 290–298.

You might also like