Deep Visual-Semantic Alignments For Generating Image Descriptions
Deep Visual-Semantic Alignments For Generating Image Descriptions
Abstract
representations so that semantically similar concepts across N words (encoded in a 1-of-k representation) and trans-
the two modalities occupy nearby regions of the space. forms each one into an h-dimensional vector. However, the
representation of each word is enriched by a variably-sized
3.1.1 Representing images context around that word. Using the index t = 1 . . . N to
Following prior work [29, 24], we observe that sentence de- denote the position of a word in a sentence, the precise form
scriptions make frequent references to objects and their at- of the BRNN is as follows:
tributes. Thus, we follow the method of Girshick et al. [17]
to detect objects in every image with a Region Convolu- xt = Ww It (2)
tional Neural Network (RCNN). The CNN is pre-trained on et = f (We xt + be ) (3)
ImageNet [6] and finetuned on the 200 classes of the Ima-
hft= f (et + Wf hft1 + bf ) (4)
geNet Detection Challenge [45]. Following Karpathy et al.
[24], we use the top 19 detected locations in addition to the hbt= f (et + Wb hbt+1 + bb ) (5)
whole image and compute the representations based on the st = f (Wd (hft + hbt ) + bd ). (6)
pixels Ib inside each bounding box as follows:
v = Wm [CNNc (Ib )] + bm , (1) Here, It is an indicator column vector that has a single one
at the index of the t-th word in a word vocabulary. The
weights Ww specify a word embedding matrix that we ini-
where CNN(Ib ) transforms the pixels inside bounding box
tialize with 300-dimensional word2vec [41] weights and
Ib into 4096-dimensional activations of the fully connected
keep fixed due to overfitting concerns. However, in prac-
layer immediately before the classifier. The CNN parame-
tice we find little change in final performance when these
ters c contain approximately 60 million parameters. The
vectors are trained, even from random initialization. Note
matrix Wm has dimensions h 4096, where h is the size
that the BRNN consists of two independent streams of pro-
of the multimodal embedding space (h ranges from 1000-
cessing, one moving left to right (hft ) and the other right to
1600 in our experiments). Every image is thus represented
left (hbt ) (see Figure 3 for diagram). The final h-dimensional
as a set of h-dimensional vectors {vi | i = 1 . . . 20}.
representation st for the t-th word is a function of both the
word at that location and also its surrounding context in the
3.1.2 Representing sentences sentence. Technically, every st is a function of all words in
To establish the inter-modal relationships, we would like the entire sentence, but our empirical finding is that the final
to represent the words in the sentence in the same h- word representations (st ) align most strongly to the visual
dimensional embedding space that the image regions oc- concept of the word at that location (It ).
cupy. The simplest approach might be to project every in-
We learn the parameters We , Wf , Wb , Wd and the respec-
dividual word directly into this embedding. However, this
tive biases be , bf , bb , bd . A typical size of the hidden rep-
approach does not consider any ordering and word context
resentation in our experiments ranges between 300-600 di-
information in the sentence. An extension to this idea is
mensions. We set the activation function f to the rectified
to use word bigrams, or dependency tree relations as pre-
linear unit (ReLU), which computes f : x 7 max(0, x).
viously proposed [24]. However, this still imposes an ar-
bitrary maximum size of the context window and requires
the use of Dependency Tree Parsers that might be trained on 3.1.3 Alignment objective
unrelated text corpora.
We have described the transformations that map every im-
To address these concerns, we propose to use a Bidirec- age and sentence into a set of vectors in a common h-
tional Recurrent Neural Network (BRNN) [46] to compute dimensional space. Since the supervision is at the level of
the word representations. The BRNN takes a sequence of entire images and sentences, our strategy is to formulate an
image-sentence score as a function of the individual region-
word scores. Intuitively, a sentence-image pair should have
a high matching score if its words have a confident support
in the image. The model of Karpathy et a. [24] interprets the
dot product viT st between the i-th region and t-th word as a
measure of similarity and use it to define the score between
image k and sentence l as:
XX
Skl = max(0, viT st ). (7)
tgl igk
Figure 5. Example alignments predicted by our model. For every test image above, we retrieve the most compatible test sentence and
visualize the highest-scoring region for each word (before MRF smoothing described in Section 3.1.4) and the associated scores (viT st ).
We hide the alignments of low-scoring words to reduce clutter. We assign each region an arbitrary color.
mentation of DeFrag). Compared to other work that uses Therefore, we report results on a subset of 1,000 images
AlexNets, our full model shows consistent improvement. and the full set of 5,000 test images for future comparisons.
Our simpler cost function improves performance. We Note that the 5000 images numbers are lower since Re-
strive to better understand the source of our performance. call@K is a function of test set size.
First, we removed the BRNN and used dependency tree re- Qualitative. As can be seen from example groundings in
lations exactly as described in Karpathy et al. [24] (Our Figure 5, the model discovers interpretable visual-semantic
model: DepTree edges). The only difference between this correspondences, even for small or relatively rare objects
model and Our reimplementation of DeFrag is the new, such as an accordion. These would be likely missed by
simpler cost function introduced in Section 3.1.3. We see models that only reason about full images.
that our formulation shows consistent improvements.
Learned region and word vector magnitudes. An ap-
BRNN outperforms dependency tree relations. Further- pealing feature of our model is that it learns to modulate
more, when we replace the dependency tree relations with the magnitude of the region and word embeddings. Due
the BRNN we observe additional performance improve- to their inner product interaction, we observe that repre-
ments. Since the dependency relations were shown to work sentations of visually discriminative words such as kayak-
better than single words and bigrams [24], this suggests that ing, pumpkins have embedding vectors with higher mag-
the BRNN is taking advantage of contexts longer than two nitudes, which in turn translates to a higher influence on
words. Furthermore, our method does not rely on extracting the image-sentence score. Conversely, stop words such as
a Dependency Tree and instead uses the raw words directly. now, simply, actually, but are mapped near the origin,
MSCOCO results for future comparisons. We are not which reduces their influence. See more analysis in supple-
aware of other published ranking results on MSCOCO. mentary material.
Flickr8K Flickr30K MSCOCO 2014
Model B-1 B-2 B-3 B-4 B-1 B-2 B-3 B-4 B-1 B-2 B-3 B-4 METEOR CIDEr
Nearest Neighbor 48.0 28.1 16.6 10.0 15.7 38.3
Mao et al. [38] 58 28 23 55 24 20
Google NIC [54] 63 41 27 66.3 42.3 27.7 18.3 66.6 46.1 32.9 24.6
LRCN [8] 58.8 39.1 25.1 16.5 62.8 44.2 30.4
MS Research [12] 21.1 20.7
Chen and Zitnick [5] 14.1 12.6 19.0 20.4
Our model 57.9 38.3 24.5 16.0 57.3 36.9 24.0 15.7 62.5 45.0 32.1 23.0 19.5 66.0
Table 2. Evaluation of full image predictions on 1,000 test images. B-n is BLEU score that uses up to n-grams. High is good in all columns.
For future comparisons, our METEOR/CIDEr Flickr8K scores are 16.7/31.8 and the Flickr30K scores are 15.3/24.7.
Figure 6. Example sentences generated by the multimodal RNN for test images. We provide many more examples on our project page.
4.2. Generated Descriptions: Fulframe evaluation Here, we annotate each test image with a sentence of the
We now evaluate the ability of our RNN model to describe most similar training set image as determined by L2 norm
images and regions. We first trained our Multimodal RNN over VGGNet [47] fc7 features. Table 2 shows that the Mul-
to generate sentences on full images with the goal of veri- timodal RNN confidently outperforms this retrieval method.
fying that the model is rich enough to support the mapping Hence, even with 113,000 train set images in MSCOCO
from image data to sequences of words. For these full im- the retrieval approach is inadequate. Additionally, the RNN
age experiments we use the more powerful VGGNet image takes only a fraction of a second to evaluate per image.
features [47]. We report the BLEU [44], METEOR [7] and
Comparison to other work. Several related models have
CIDEr [53] scores computed with the coco-caption
been proposed in Arxiv preprints since the original submis-
code [4] 2 . Each method evaluates a candidate sentence
sion of this work. We also include these in Table 2 for com-
by measuring how well it matches a set of five reference
parison. Most similar to our model is Vinyals et al. [54].
sentences written by humans.
Unlike this work where the image information is commu-
Qualitative. The model generates sensible descriptions of nicated through a bias term on the first step, they incorpo-
images (see Figure 6), although we consider the last two rate it as a first word, they use a more powerful but more
images failure cases. The first prediction man in black complex sequence learner (LSTM [20]), a different CNN
shirt is playing a guitar does not appear in the training set. (GoogLeNet [51]), and report results of a model ensemble.
However, there are 20 occurrences of man in black shirt Donahue et al. [8] use a 2-layer factored LSTM (similar
and 60 occurrences of is paying guitar, which the model in structure to the RNN in Mao et al. [38]). Both models
may have composed to describe the first image. In general, appear to work worse than ours, but this is likely in large
we find that a relatively large portion of generated sentences part due to their use of the less powerful AlexNet [28] fea-
(60% with beam size 7) can be found in the training data. tures. Compared to these approaches, our model prioritizes
This fraction decreases with lower beam size; For instance, simplicity and speed at a slight cost in performance.
with beam size 1 this falls to 25%, but the performance also
deteriorates (e.g. from 0.66 to 0.61 CIDEr). 4.3. Generated Descriptions: Region evaluation
Multimodal RNN outperforms retrieval baseline. Our We now train the Multimodal RNN on the correspondences
first comparison is to a nearest neighbor retrieval baseline. between image regions and snippets of text, as inferred by
the alignment model. To support the evaluation, we used
2 https://github.com/tylin/coco-caption Amazon Mechanical Turk (AMT) to collect a new dataset
Figure 7. Example region predictions. We use our region-level multimodal RNN to generate text (shown on the right of each image) for
some of the bounding boxes in each image. The lines are grounded to centers of bounding boxes and the colors are chosen arbitrarily.
of region-level annotations that we only use at test time. The Model B-1 B-2 B-3 B-4
labeling interface displayed a single image and asked anno- Human agreement 61.5 45.2 30.1 22.0
tators (we used nine per image) to draw five bounding boxes Nearest Neighbor 22.9 10.5 0.0 0.0
and annotate each with text. In total, we collected 9,000 text RNN: Fullframe model 14.2 6.0 2.2 0.0
RNN: Region level model 35.2 23.0 16.1 14.8
snippets for 200 images in our MSCOCO test split (i.e. 45
snippets per image). The snippets have an average length of Table 3. BLEU score evaluation of image region annotations.
2.3 words. Example annotations include sports car, el- 4.4. Limitations
derly couple sitting, construction site, three dogs on
Although our results are encouraging, the Multimodal RNN
leashes, chocolate cake. We noticed that asking anno-
model is subject to multiple limitations. First, the model can
tators for grounded text snippets induces language statistics
only generate a description of one input array of pixels at a
different from those in full image captions. Our region an-
fixed resolution. A more sensible approach might be to use
notations are more comprehensive and feature elements of
multiple saccades around the image to identify all entities,
scenes that would rarely be considered salient enough to be
their mutual interactions and wider context before generat-
included in a single sentence sentence about the full image,
ing a description. Additionally, the RNN receives the image
such as heating vent, belt buckle, and chimney.
information only through additive bias interactions, which
Qualitative. We show example region model predictions are known to be less expressive than more complicated mul-
in Figure 7. To reiterate the difficulty of the task, consider tiplicative interactions [50, 20]. Lastly, our approach con-
for example the phrase table with wine glasses that is sists of two separate models. Going directly from an image-
generated on the image on the right in Figure 7. This phrase sentence dataset to region-level annotations as part of a sin-
only occurs in the training set 30 times. Each time it may gle model trained end-to-end remains an open problem.
have a different appearance and each time it may occupy a
few (or none) of our object bounding boxes. To generate 5. Conclusions
this string for the region, the model had to first correctly We introduced a model that generates natural language de-
learn to ground the string and then also learn to generate it. scriptions of image regions based on weak labels in form of
a dataset of images and sentences, and with very few hard-
Region model outperforms full frame model and rank- coded assumptions. Our approach features a novel ranking
ing baseline. Similar to the full image description task, we model that aligned parts of visual and language modalities
evaluate this data as a prediction task from a 2D array of through a common, multimodal embedding. We showed
pixels (one image region) to a sequence of words and record that this model provides state of the art performance on
the BLEU score. The ranking baseline retrieves training image-sentence ranking experiments. Second, we described
sentence substrings most compatible with each region as a Multimodal Recurrent Neural Network architecture that
judged by the BRNN model. Table 3 shows that the region generates descriptions of visual data. We evaluated its per-
RNN model produces descriptions most consistent with our formance on both fullframe and region-level experiments
collected data. Note that the fullframe model was trained and showed that in both cases the Multimodal RNN outper-
only on full images, so feeding it smaller image regions forms retrieval baselines.
deteriorates its performance. However, its sentences are
also longer than the region model sentences, which likely Acknowledgements.
negatively impacts the BLEU score. The sentence length We thank Justin Johnson and Jon Krause for helpful com-
is non-trivial to control for with an RNN, but we note that ments and discussions. We gratefully acknowledge the sup-
the region model also outperforms the fullframe model on port of NVIDIA Corporation with the donation of the GPUs
all other metrics: CIDEr 61.6/20.3, METEOR 15.8/13.3, used for this research. This research is partially supported
ROUGE 35.1/21.0 for region/fullframe respectively. by an ONR MURI grant, and NSF ISS-1115313.
References [18] S. Gould, R. Fulton, and D. Koller. Decomposing a scene
into geometric and semantically consistent regions. In Com-
[1] A. Barbu, A. Bridge, Z. Burchill, D. Coroian, S. Dickin- puter Vision, 2009 IEEE 12th International Conference on,
son, S. Fidler, A. Michaux, S. Mussman, S. Narayanaswamy, pages 18. IEEE, 2009.
D. Salvi, et al. Video in sentences out. arXiv preprint [19] A. Gupta and P. Mannem. From image annotation to im-
arXiv:1204.2742, 2012. age description. In Neural information processing. Springer,
[2] K. Barnard, P. Duygulu, D. Forsyth, N. De Freitas, D. M. 2012.
Blei, and M. I. Jordan. Matching words and pictures. JMLR,
[20] S. Hochreiter and J. Schmidhuber. Long short-term memory.
2003.
Neural computation, 9(8):17351780, 1997.
[3] Y. Bengio, H. Schwenk, J.-S. Senecal, F. Morin, and J.-L.
[21] M. Hodosh, P. Young, and J. Hockenmaier. Framing image
Gauvain. Neural probabilistic language models. In Innova-
description as a ranking task: data, models and evaluation
tions in Machine Learning. Springer, 2006.
metrics. Journal of Artificial Intelligence Research, 2013.
[4] X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dol-
lar, and C. L. Zitnick. Microsoft coco captions: Data collec- [22] R. JeffreyPennington and C. Manning. Glove: Global vec-
tion and evaluation server. arXiv preprint arXiv:1504.00325, tors for word representation. 2014.
2015. [23] Y. Jia, M. Salzmann, and T. Darrell. Learning cross-modality
[5] X. Chen and C. L. Zitnick. Learning a recurrent vi- similarity for multinomial data. In ICCV, 2011.
sual representation for image caption generation. CoRR, [24] A. Karpathy, A. Joulin, and L. Fei-Fei. Deep fragment em-
abs/1411.5654, 2014. beddings for bidirectional image sentence mapping. arXiv
[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- preprint arXiv:1406.5679, 2014.
Fei. Imagenet: A large-scale hierarchical image database. In [25] R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying
CVPR, 2009. visual-semantic embeddings with multimodal neural lan-
[7] M. Denkowski and A. Lavie. Meteor universal: Language guage models. arXiv preprint arXiv:1411.2539, 2014.
specific translation evaluation for any target language. In [26] R. Kiros, R. S. Zemel, and R. Salakhutdinov. Multimodal
Proceedings of the EACL 2014 Workshop on Statistical Ma- neural language models. ICML, 2014.
chine Translation, 2014. [27] C. Kong, D. Lin, M. Bansal, R. Urtasun, and S. Fidler. What
[8] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, are you talking about? text-to-image coreference. In CVPR,
S. Venugopalan, K. Saenko, and T. Darrell. Long-term recur- 2014.
rent convolutional networks for visual recognition and de- [28] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
scription. arXiv preprint arXiv:1411.4389, 2014. classification with deep convolutional neural networks. In
[9] D. Elliott and F. Keller. Image description using visual de- NIPS, 2012.
pendency representations. In EMNLP, pages 12921302, [29] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg,
2013. and T. L. Berg. Baby talk: Understanding and generating
[10] J. L. Elman. Finding structure in time. Cognitive science, simple image descriptions. In CVPR, 2011.
14(2):179211, 1990.
[30] P. Kuznetsova, V. Ordonez, A. C. Berg, T. L. Berg, and
[11] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and
Y. Choi. Collective generation of natural image descriptions.
A. Zisserman. The pascal visual object classes (voc) chal-
In ACL, 2012.
lenge. International Journal of Computer Vision, 88(2):303
[31] P. Kuznetsova, V. Ordonez, T. L. Berg, U. C. Hill, and
338, June 2010.
Y. Choi. Treetalk: Composition and compression of trees
[12] H. Fang, S. Gupta, F. Iandola, R. Srivastava, L. Deng,
for image descriptions. Transactions of the Association for
P. Dollar, J. Gao, X. He, M. Mitchell, J. Platt, et al.
Computational Linguistics, 2(10):351362, 2014.
From captions to visual concepts and back. arXiv preprint
arXiv:1411.4952, 2014. [32] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-
[13] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, based learning applied to document recognition. Proceed-
C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every pic- ings of the IEEE, 86(11):22782324, 1998.
ture tells a story: Generating sentences from images. In [33] L.-J. Li and L. Fei-Fei. What, where and who? classifying
ECCV. 2010. events by scene and object recognition. In ICCV, 2007.
[14] L. Fei-Fei, A. Iyer, C. Koch, and P. Perona. What do we [34] L.-J. Li, R. Socher, and L. Fei-Fei. Towards total scene un-
perceive in a glance of a real-world scene? Journal of vision, derstanding: Classification, annotation and segmentation in
7(1):10, 2007. an automatic framework. In Computer Vision and Pattern
[15] S. Fidler, A. Sharma, and R. Urtasun. A sentence is worth a Recognition, 2009. CVPR 2009. IEEE Conference on, pages
thousand pixels. In CVPR, 2013. 20362043. IEEE, 2009.
[16] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, [35] S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi. Com-
T. Mikolov, et al. Devise: A deep visual-semantic embed- posing simple image descriptions using web-scale n-grams.
ding model. In NIPS, 2013. In CoNLL, 2011.
[17] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea- [36] D. Lin, S. Fidler, C. Kong, and R. Urtasun. Visual semantic
ture hierarchies for accurate object detection and semantic search: Retrieving videos via complex textual queries. 2014.
segmentation. In CVPR, 2014.
[37] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- [54] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show
manan, P. Dollar, and C. L. Zitnick. Microsoft coco: Com- and tell: A neural image caption generator. arXiv preprint
mon objects in context. arXiv preprint arXiv:1405.0312, arXiv:1411.4555, 2014.
2014. [55] Y. Yang, C. L. Teo, H. Daume III, and Y. Aloimonos.
[38] J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille. Explain Corpus-guided sentence generation of natural images. In
images with multimodal recurrent neural networks. arXiv EMNLP, 2011.
preprint arXiv:1410.1090, 2014. [56] B. Z. Yao, X. Yang, L. Lin, M. W. Lee, and S.-C. Zhu. I2t:
[39] C. Matuszek*, N. FitzGerald*, L. Zettlemoyer, L. Bo, and Image parsing to text description. Proceedings of the IEEE,
D. Fox. A Joint Model of Language and Perception for 98(8):14851508, 2010.
Grounded Attribute Learning. In Proc. of the 2012 Interna- [57] M. Yatskar, L. Vanderwende, and L. Zettlemoyer. See no
tional Conference on Machine Learning, Edinburgh, Scot- evil, say no evil: Description generation from densely la-
land, June 2012. beled images. Lexical and Computational Semantics, 2014.
[40] T. Mikolov, M. Karafiat, L. Burget, J. Cernocky, and S. Khu-
danpur. Recurrent neural network based language model. In [58] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. From im-
INTERSPEECH, 2010. age descriptions to visual denotations: New similarity met-
[41] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and rics for semantic inference over event descriptions. TACL,
J. Dean. Distributed representations of words and phrases 2014.
and their compositionality. In NIPS, 2013. [59] W. Zaremba, I. Sutskever, and O. Vinyals. Recurrent neu-
[42] M. Mitchell, X. Han, J. Dodge, A. Mensch, A. Goyal, ral network regularization. arXiv preprint arXiv:1409.2329,
A. Berg, K. Yamaguchi, T. Berg, K. Stratos, and H. Daume, 2014.
III. Midge: Generating image descriptions from computer [60] C. L. Zitnick, D. Parikh, and L. Vanderwende. Learning the
vision detections. In EACL, 2012. visual interpretation of sentences. ICCV, 2013.
[43] V. Ordonez, G. Kulkarni, and T. L. Berg. Im2text: Describ-
ing images using 1 million captioned photographs. In NIPS,
2011.
[44] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a
method for automatic evaluation of machine translation. In
Proceedings of the 40th annual meeting on association for
computational linguistics, pages 311318. Association for
Computational Linguistics, 2002.
[45] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
A. C. Berg, and L. Fei-Fei. Imagenet large scale visual recog-
nition challenge, 2014.
[46] M. Schuster and K. K. Paliwal. Bidirectional recurrent neural
networks. Signal Processing, IEEE Transactions on, 1997.
[47] K. Simonyan and A. Zisserman. Very deep convolutional
networks for large-scale image recognition. arXiv preprint
arXiv:1409.1556, 2014.
[48] R. Socher and L. Fei-Fei. Connecting modalities: Semi-
supervised segmentation and annotation of images using un-
aligned text corpora. In CVPR, 2010.
[49] R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y.
Ng. Grounded compositional semantics for finding and de-
scribing images with sentences. TACL, 2014.
[50] I. Sutskever, J. Martens, and G. E. Hinton. Generating text
with recurrent neural networks. In ICML, 2011.
[51] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabi-
novich. Going deeper with convolutions. arXiv preprint
arXiv:1409.4842, 2014.
[52] T. Tieleman and G. E. Hinton. Lecture 6.5-rmsprop: Divide
the gradient by a running average of its recent magnitude.,
2012.
[53] R. Vedantam, C. L. Zitnick, and D. Parikh. Cider:
Consensus-based image description evaluation. CoRR,
abs/1411.5726, 2014.
6. Supplementary Material
glass of wine
red bus
yellow bus
closeup of zebra
sprinkled donut
wooden chair
shiny laptop
Figure 9. Examples of highest scoring regions for queried snippets of text, on 5,000 images of our MSCOCO test set.
bird flying in the sky
closeup of fruit
bowl of fruit
straw hat
Figure 10. Examples of highest scoring regions for queried snippets of text, on 5,000 images of our MSCOCO test set.
Figure 12. Additional examples of captions on the level of full images. Green: Human ground truth. Red: Top-scoring sentence from
training set. Blue: Generated sentence.
Figure 13. Additional examples of region captions on the test set of Flickr30K.