Pami Im2Show and Tell: Lessons Learned From The 2015 MSCOCO Image Captioning Challenge
Pami Im2Show and Tell: Lessons Learned From The 2015 MSCOCO Image Captioning Challenge
I NTRODUCTION
Vision!
Deep CNN
Language !
Generating!
RNN
A group of people
shopping at an
outdoor market.
!
Fig. 1. NIC, our model, is based end-to-end on a neural network consisting of a vision CNN followed by a language generating RNN. It generates
complete sentences in natural language from an input image, as shown
on the example above.
IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL.XX, NO. XX, MONTH 2016
R ELATED W ORK
M ODEL
In this paper, we propose a neural and probabilistic framework to generate descriptions from images. Recent advances
in statistical machine translation have shown that, given a
powerful sequence model, it is possible to achieve state-ofthe-art results by directly maximizing the probability of the
correct translation given an input sentence in an end-toend fashion both for training and inference. These models
make use of a recurrent neural network which encodes the
variable length input into a fixed dimensional vector, and
uses this representation to decode it to the desired output
sentence. Thus, it is natural to use the same approach where,
given an image (instead of an input sentence in the source
language), one applies the same principle of translating it
into its description.
Thus, we propose to directly maximize the probability
of the correct description given the image by using the
following formulation:
? = arg max
log p(S|I; )
(1)
(I,S)
IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL.XX, NO. XX, MONTH 2016
word prediction
log p(S|I) =
N
X
softmax
(2)
t=0
output
gate f
it
ft
ot
ct
mt
pt+1
=
=
=
=
=
=
(4)
(5)
(6)
input
gate i
forget
gate f
ct
updating
term
x
input
Fig. 2. LSTM: the memory block contains a cell c which is controlled by
three gates. In blue we show the recurrent connections the output m
at time t 1 is fed back to the memory at time t via the three gates; the
cell value is fed back via the forget gate; the predicted word at time t 1
is fed back in addition to the memory output m at time t into the Softmax
for word prediction.
image
log p2(S2)
log pN(SN)
p1
p2
pN
...
LSTM
log p1(S1)
LSTM
ct-1
LSTM
3.1
LSTM
LSTM
memory block
mt
WeS0
WeS1
WeSN-1
S0
S1
SN-1
Fig. 3. LSTM model combined with a CNN image embedder (as defined
in [24]) and word embeddings. The unrolled connections between the
LSTM memories are in blue and they correspond to the recurrent
connections in Figure 2. All LSTMs share the same parameters.
(7)
3.1.1
(8)
(9)
Training
IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL.XX, NO. XX, MONTH 2016
x1
xt
pt+1
= CNN(I)
= We St , t {0 . . . N 1}
= LSTM(xt ), t {0 . . . N 1}
(10)
(11)
(12)
L(I, S) =
N
X
log pt (St ) .
(13)
t=1
Inference
E XPERIMENTS
4.1
Evaluation Metrics
IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL.XX, NO. XX, MONTH 2016
Datasets
train
6000
28000
82783
1M
size
valid.
1000
1000
40504
-
test
1000
1000
1000
40775
-
Results
4.3.1
Training Details
Generation Results
IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL.XX, NO. XX, MONTH 2016
TABLE 1
Scores on the MSCOCO development set for two models: NIC, which
was the model which we developed in [46], and NICv2, which was the
model after we tuned and refined our system for the MSCOCO
competition.
Metric
NIC
NICv2
Random
Nearest Neighbor
Human
BLEU-4
27.7
32.1
4.6
9.9
21.7
METEOR
23.7
25.7
9.0
15.7
25.2
CIDER
85.5
99.8
5.1
36.5
85.4
TABLE 2
BLEU-1 scores. We only report previous work results when available.
SOTA stands for the current state-of-the-art.
Approach
Im2Text [18]
TreeTalk [14]
BabyTalk [3]
Tri5Sem [16]
m-RNN [27]
MNLM [29]5
SOTA
NIC
Human
PASCAL
(xfer)
Flickr
30k
Flickr
8k
SBU
11
19
25
25
59
69
55
56
56
66
68
48
58
51
58
63
70
19
28
and not four, we add back to the human scores the average
difference of having five references instead of four.
Given that the field has seen significant advances in the
last years, we do think it is more meaningful to report BLEU4, which is the standard in machine translation moving
forward. Additionally, we report metrics shown to correlate
better with human evaluations in Table 14 . Despite recent
efforts on better evaluation metrics [39], our model fares
strongly versus human raters. However, when evaluating
our captions using human raters (see Section 4.3.6), our
model fares much more poorly, suggesting more work is
needed towards better metrics. For a more detailed description and comparison of our results on the MSCOCO dataset,
and other interesting human metrics, see Section 5. In that
section, we detail the lessons learned from extra tuning of
our model w.r.t. the original model which was submitted in
a previous version of this manuscript [46] (NIC in Table 1)
versus the latest version for the competition (NICv2 in
Table 1).
IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL.XX, NO. XX, MONTH 2016
TABLE 4
Recall@k and median rank on Flickr8k.
Image Annotation
Image Search
R@1 R@10 Med r R@1 R@10 Med r
DeFrag [22] 13
44
14
10
43
15
m-RNN [27] 15
49
11
12
42
15
MNLM [29] 18
55
8
13
52
10
NIC
20
61
6
19
64
5
Approach
TABLE 5
Recall@k and median rank on Flickr30k.
Image Annotation
Image Search
R@1 R@10 Med r R@1 R@10 Med r
DeFrag [22] 16
55
8
10
45
13
m-RNN [27] 18
51
10
13
42
16
MNLM [29] 23
63
5
17
57
8
NIC
17
56
7
17
57
7
Approach
ranking scores, using the set of testing captions as candidates to rank given a test image. The approach that works
best on these metrics (MNLM), specifically implemented a
ranking-aware loss. Nevertheless, NIC is doing surprisingly
well on both ranking tasks (ranking descriptions given
images, and ranking images given descriptions), as can be
seen in Tables 4 and 5. Note that for the Image Annotation
task, we normalized our scores similar to what [27] used.
4.3.6 Human Evaluation
Figure 4 shows the result of the human evaluations of the
descriptions provided by NIC, as well as a reference system
and groundtruth on various datasets. We can see that NIC
is better than the reference system, but clearly worse than
the groundtruth, as expected. This shows that BLEU is not
a perfect metric, as it does not capture well the difference
between NIC and human descriptions assessed by raters.
Examples of rated images can be seen in Figure 5. It is
interesting to see, for instance in the second image of the
first column, how the model was able to notice the frisbee
given its size.
4.3.7 Analysis of Embeddings
In order to represent the previous word St1 as input to the
decoding LSTM producing St , we use word embedding vectors [36], which have the advantage of being independent of
the size of the dictionary (contrary to a simpler one-hotencoding approach). Furthermore, these word embeddings
can be jointly trained with the rest of the model. It is
remarkable to see how the learned representations have
captured some semantic from the statistics of the language.
Table 6 shows, for a few example words, the nearest other
words found in the learned embedding space.
Note how some of the relationships learned by the model
will help the vision component. Indeed, having horse,
pony, and donkey close to each other will encourage the
CNN to extract features that are relevant to horse-looking
animals. We hypothesize that, in the extreme case where
we see very few examples of a class (e.g., unicorn), its
proximity to other word embeddings (e.g., horse) should
provide a lot more information that would be completely
lost with more traditional bag-of-words based approaches.
Neighbors
van, cab, suv, vehicule, jeep
toddler, gentleman, daughter, son
road, streets, highway, freeway
pony, donkey, pig, goat, mule
computers, pc, crt, chip, compute
LENGE
Metrics
IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL.XX, NO. XX, MONTH 2016
TABLE 7
Pearson correlation and human rankings found in the MSCOCO official
website competition table for several automatic metrics (using 40
ground truth captions in the test set).
CIDER
METEOR
ROUGE
BLEU-4
Human Rank
6
3
11
13
TABLE 8
A summary of all the improvements which we introduced for the
MSCOCO competition. The reported improvements are on BLEU-4, but
similar improvements are consistent across all the metrics.
Technique
Better Image Model [24]
Beam Size Reduction
Fine-tuning Image Model
Scheduled Sampling [47]
Ensembles
BLEU-4 Improvement
2
2
1
1.5
1.5
at the time, known as GoogleLeNet [48], which had 22 layers, and was the winner of the 2014 ImageNet competition.
Later on, an even better approach was proposed in [24] and
included a new method, called Batch Normalization, to better
normalize each layer of a neural network with respect to the
current batch of examples, so as to be more robust to nonlinearities. The new approach got significant improvement
on the ImageNet task (going from 6.67% down to 4.8% top-5
error) and the MSCOCO image captioning task, improving
BLEU-4 by 2 points absolute.
5.2.2 Image Model Fine Tuning
In the original set of experiments, to avoid overfitting we
initialized the image convolutional network with a pretrained model (we first used GoogleLeNet, then switched
to the better Batch Normalization model), but then fixed its
parameters and only trained the LSTM part of the model on
the MS COCO training set.
IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL.XX, NO. XX, MONTH 2016
Scheduled Sampling
Ensembling
Ensembles [49] have long been known to be a very simple yet effective way to improve performance of machine
learning systems. In the context of deep architectures, one
only needs to train separately multiple models on the same
task, potentially varying some of the training conditions,
and aggregating their answers at inference time. For the
competition, we created an ensemble of 5 models trained
with Scheduled Sampling and 10 models trained with finetuning the image model. The resulting model was submitted
to the competition, and it further improved our results by
1.5 BLEU-4 points.
Competition Results
Google [46]
MSR Captivator [34]
m-RNN [28]
MSR [23]
m-RNN (2) [28]
Human
CIDER
0.943
0.931
0.917
0.912
0.886
0.854
IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL.XX, NO. XX, MONTH 2016
10
Fig. 6. A selection of evaluation images, comparing the captions obtained by our original model (InitialModel) and the model submitted to the
competition (BestModel).
TABLE 10
Human generated scores of the top five competition submissions.
Google [46]
MSR [23]
MSR Captivator [34]
Montreal/Toronto [31]
Berkeley LRCN [30]
Human
M1
0.273
0.268
0.250
0.262
0.246
0.638
M2
0.317
0.322
0.301
0.272
0.268
0.675
M3
4.107
4.137
4.149
3.932
3.924
4.836
M4
2.742
2.662
2.565
2.832
2.786
3.428
M5
0.233
0.234
0.233
0.197
0.204
0.352
Rank
1st
1st
3rd
3rd
5th
1st
C ONCLUSION
IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL.XX, NO. XX, MONTH 2016
ACKNOWLEDGMENTS
We would like to thank Geoffrey Hinton, Ilya Sutskever,
Quoc Le, Vincent Vanhoucke, and Jeff Dean for useful
discussions on the ideas behind the paper, and the write
up.
R EFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
11
IEEE TRANSACTION ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL.XX, NO. XX, MONTH 2016
Samy Bengio (PhD in computer science, University of Montreal, 1993) is a research scientist at Google since 2007. Before that, he was
senior researcher in statistical machine learning at IDIAP Research Institute since 1999. His
most recent research interests are in machine
learning, in particular deep learning, large scale
online learning, image ranking and annotation,
music and speech processing. He is action editor of the Journal of Machine Learning Research
and on the editorial board of the Machine Learning Journal. He was associate editor of the journal of computational
statistics, general chair of the Workshops on Machine Learning for
Multimodal Interactions (MLMI2004-2006), programme chair of the International Conference on Learning Representations (ICLR2015-2016),
programme chair of the IEEE Workshop on Neural Networks for Signal
Processing (NNSP2002), chair of BayLearn (2012-2015), and several
times on the programme committee of international conferences such
as NIPS, ICML, ECML and ICLR. More information can be found on his
website: http://bengio.abracadoudou.com.
12
Dumitru Erhan (PhD in computer science, University of Montreal, 2011) is a software engineer
at Google since 2012. Before that, he was scientist at Yahoo! Labs from 2011 to 2012. His
research interests span the intersection of deep
learning, computer vision and natural language.
In particular, he is interested in efficient models
for understanding what and where is in image, as
well as for answering arbitrary questions about
them. More information at http://dumitru.ca.