1603.09016v2
1603.09016v2
Microsoft Research
arXiv:1603.09016v2 [cs.CV] 31 Mar 2016
{ktran,xiaohe}@microsoft.com∗
Abstract
1. Introduction
Image captioning is a fundamental task in Artificial In-
telligence which describes objects, attributes, and relation-
“A small boat in Ha-Long Bay.”
ship in an image, in a natural language form. It has many
applications such as semantic image search, bringing visual Figure 1: Rich captions enabled by entity recognition
intelligence to chatbots, or helping visually-impaired peo-
ple to see the world around them. Recently, image caption-
ing has received much interest from the research community by a deep multimodal semantic model.
(see [23, 24, 25, 6, 7, 12, 10]).
However, while significant progress have been reported
The leading approaches can be categorized into two [25, 23, 6, 7], most of the systems in literature are evaluated
streams. One stream takes an end-to-end, encoder-decoder on academic benchmarks, where the experiments are based
framework adopted from machine translation. For instance, on test images collected under a controlled environment
[23] used a CNN to extract high level image features and which have similar distribution to the training examples. It
then fed them into a LSTM to generate caption. [24] went is unclear how these systems perform on open-domain im-
one step further by introducing the attention mechanism. ages.
The other stream applies a compositional framework. For
Furthermore, most of the image captioning systems only
example, [7] divided the caption generation into several
describe generic visual content without identifying key en-
parts: word detector by a CNN, caption candidates genera-
tities. The entities, such as celebrities and landmarks, are
tion by a maximum entropy model, and sentence re-ranking
important pieces in our common sense and knowledge. In
∗ Corresponding authors many situations (e.g., Figure 1), the entities are the key in-
1
Figure 2: Illustration of our image caption pipeline.
formation in an image. trained separately and integrated in the main pipeline. The
In addition, most of the literature report results in au- main components include
tomatic metrics such as BLEU [18], METEOR [1], and • a deep residual network-based vision model that de-
CIDEr [22]. Although these metrics are handy for fast tects a broad range of visual concepts,
development and tuning, there exists a substantial dis- • a language model for candidates generation and a deep
crepancy between these metrics and human’s judgment multimodal semantic model for caption ranking,
[5, 14, 4]. Their correlation to humans judgment could be • an entity recognition model that identifies celebrities
even weaker when evaluating captions with entity informa- and landmarks,
tion integrated. • and a classifier for estimating the confidence score for
In this paper, we present a captioning system for open each output caption.
domain images. We take a compositional approach by start- Figure 2 gives an overview of our image captioning system.
ing from one of the state-of-the-art image captioning frame-
work [7]. To address the challenges when describing im- 2.1. Vision model using deep residual network
ages in the wild, we enriched the visual model by detecting Deep residual networks (ResNets) [11] consist of many
a boarder range of visual concepts and recognizing celebri- stacked “Residual Units”. Each residual unit (Fig. 3) can
ties and landmarks for caption generation (see examples in be expressed in a general form:
Figure 1). Further, in order to provide graceful handling for
images that are difficult to describe, we built a confidence yl = h(xl ) + F(xl , Wl ),
model to estimate a confidence score for the caption output xl+1 = f (yl ),
based on the vision and text features, and provide a back-off
caption for these difficult cases. We also developed an effi- where xl and xl+1 are input and output of the l-th unit, and
cient engine that integrates these components and generates F is a residual function. In [11], h(xl ) = xl is an iden-
the caption within one second end-to-end on a 4-core CPU. tify mapping and f is a ReLU [17] function. ResNets that
In order to measure the quality of the caption from the are over 100-layer deep have shown state-of-the-art accu-
humans perspective, we carried out a series of human eval- racy for several challenging recognition tasks on ImageNet
uations through crowd souring, and report results based [19] and MS COCO [16] competitions. The central idea
on human’s judgments. Our experimental results show of ResNets is to learn the additive residual function F with
that the proposed system outperforms a previous state-of- respect to h(xl ), with a key choice of using an identity map-
the-art system [7] significantly on both in-domain dataset ping h(xl ) = xl . This is realized by attaching an identity
(MS COCO [15]), and out-of-domain datasets (Adobe-MIT skip connection (“shortcut”).
FiveK [3] and a dataset consisting randomly sampled im-
ages from Instagram 1 .) Notably, we improved the human Training. In order to address the open domain challenge,
satisfaction rate by 94.9% relatively on the most challeng- we trained two classifiers. The first classifier was trained
ing Instagram dataset. on MS COCO training data, for 700 visual concepts. And
the second one was trained on an image set crawled from
2. Model architecture commercial image search engines, for 1.5K visual objects.
The training started from a 50-layer ResNet, pre-trained on
Following Fang et al. [7], we decomposed the image
ImageNet 1K benchmark. To handle multiple-label classi-
caption system into independent components, which are
fication, we use sigmod output layer without softmax nor-
1 Instagram data: https://gist.github.com/Anonymous malization.
Figure 3: A residual unit. Here xl /xl+1 is the input/output
feature to the l-th Residual Unit. Weight, BN, ReLU are
linear convolution, batch normalization [9], and Rectified Figure 4: Illustration of deep multimodal semantic model
Linear Unit [17] layers.
ing, the data consists of a set of image/caption pairs. The
Testing. To make the testing efficient, we apply all con- loss function minimized during training represents the neg-
volution layers on the input image once to get a feature map ative log posterior probability of the caption given the cor-
(typically non-square) and perform average pooling and sig- responding image. The image model reuses the last pooling
moid output layers. Not only our network provides more layer extracted in the word detection model, as described
accurate predictions than VGG [21], which is used in many in section 2.1, as feature vector and stacks one more fully-
caption systems [7, 24, 12], it is also order of magnitude connected layer with Tanh non-linearity on top of this rep-
faster. The typical runtime of our ResNet is 200ms on a resentation to obtain a final representation of the same size
desktop CPU (single core only). as the last layer of the text model. We learn the parameters
in this additional layer during DMSM training. The text
2.2. Language and semantic ranking model model is based on a one-dimensional convolutional neu-
ral network similar to [20]. The DMSM similarity score
Unlike many recent works [23, 24, 12] that use is used as the main signal for ranking the captions, together
LSTM/GRU (so called gated recurrent neural network or with other signals including language model score, caption
GRNN) for caption generation, we follow [7] to use a max- length, number of detected words covered in the caption,
imum entropy language model (MELM) together with a etc.
deep multimodal similarity model (DMSM) in our caption In our system, the dimension is set to be 1000 for the
pipeline. While MELM does not perform as well as GRNN global vision vector and the global text vector, respectively.
in terms of perplexity, this disadvantage is remedied by The MELM and the DMSM are both trained on the MS
DMSM. Devlin et al. [5] shows that while MELM+DMSM COCO dataset [15]. Similar to [8], character-level word
gives the same BLEU score as GRNN, it performs signifi- hashing is used to reduce the dimension of the vocabulary.
cantly better than GRNN in terms of human judgment. The
results from the MS COCO 2015 captioning challenge2 also 2.3. Celebrity and landmark recognition
show that the MELM+DMSM based entry [7] gives top per-
formance in the official human judgment, tying with another The breakthrough in deep learning makes it possible to
entry using LSTM. recognize visual entities such as celebrities and landmarks
In the MELM+DMSM based framework, the MELM is and link the recognition result to a knowledge base such as
used together with beam search as a candidate caption gen- Freebase [2]. We believe providing entity-level recognition
erator. Similar to the text-only deep structured semantic results in image captions will bring valuable information to
model (DSSM) [8, 20], The DMSM is illustrated in Fig- end users.
ure 4, which consists of a pair of neural networks, one for The key challenge to develop a good entity recognition
mapping each input modality to a common semantic space. model with wide coverage is collecting high quality training
These two neural networks are trained jointly[7]. In train- data. To address this problem, we followed and generalized
the idea presented in [26] which leverages duplicate image
2 http://mscoco.org/dataset/#captions-leaderboard detection and name list matching to collect celebrity im-
a knowledge base. This implies that data collection and vi-
sual model learning are two closely coupled problems. To
Typical Convolutional Neural Network: AlexNet, VGG, ResNet, etc.
Chloë
address this challenge, we took an iterative approach. That
Grace
Moretz is, we first collected a training set for about 10K landmarks
…
selected from a knowledge base to train a CNN model for
10K landmarks. Then we leveraged a validation dataset to
evaluate whether an landmark is visually recognizable, and
remove from the training set those landmarks which have
convolutional + pooling layers fully connected layers very low prediction accuracy. After several iterations of
N-class prediction data cleaning and visual model learning, we ended up with
a model for about 5K landmarks.
Figure 5: Illustration of deep neural network-based large-
scale celebrity recognition 2.4. Confidence estimation
We developed a logistic regression model to estimate
a confidence score for the caption output. The input fea-
ages. In particular, we ground the entity recognition prob- tures include the DMSM’s vision and caption vectors, each
lem on a knowledge base, which brings in several advan- of size 1000, coupled with the language model score,
tages. First, each entity in a knowledge base is unique and the length of the caption, the length-normalized language
clearly defined without unambiguity, making it possible to model score, the logarithm of the number of tags covered in
develop a large scale entity recognition system. Second, the caption, and the DMSM score.
each entity normally has multiple properties (e.g. gender, The confidence model is trained on 2.5K image-caption
occupation for people, and location, longitude/latitude for pairs, with human labels on the quality (excellent, good,
landmark), providing rich and valuable information for data bad, embarrassing). The images used in the training data
collecting, cleaning, multi-task learning, and image descrip- is a mix of 750 COCO, 750 MIT, and 950 Instagram im-
tion. ages in a held-out set.
We started with a text-based approach similar to [26]
but using entities that are catalogued in the knowledge base 3. Evaluation
rather than celebrity names for high precision image and
entity matching. To further enlarge the coverage, we also We conducted a series of human evaluation experiments
scrape commercial image search engines for more entities through CrowdFlower, a crowd sourcing platform with
and check the consistency of faces in the search result to good quality control3 . The human evaluation experiments
remove outliers or discard those entities with too many out- are set up such as for each pair of image and generated cap-
lier faces. After these two stages, we ended up with a large- tion, the caption is rated on a 4-point scale: Excellent, Good,
scale face image dataset for a large set of celebrities. Bad, or Embarrassing by three different judges. In the eval-
uation, we specify for the Judges that Excellent means that
To recognize a large set of celebrities, we resorted to
the caption contains all of the important details presented
deep convolutional neural network (CNN) to learn an ex-
in the picture; Good means that the caption contains some
treme classification model, as shown in Figure 5. Training
instead of all the important details presented in the picture
a network for a large set of classes is not a trivial task. It
and no errors; Bad means the caption may be misleading
is hard to see the model converge even after a long run due
(e.g., contains errors, or miss the gist of the image); and
to the large number of categories. To address this problem,
Embarrassing means that the caption is totally wrong, or
we started from training a small model using AlexNet [13]
may upset the owner or subject of the image.
for 500 celebrities, each of which has a sufficient number
In order to evaluate the captioning performance for im-
of face images. Then we used this pre-trained model to ini-
ages in the wild, we created a dataset from Instagram.
tialize the full model of a large set of celebrities. The whole
Specifically, we collected 100 popular Instagram accounts
training process follows the standard setting as described in
on the web, and for each account we constructed a query
[13]. After the training is finished, we use the final model
with the account name plus “instagram”, e.g. “iamdiddy
to predict celebrities in images by setting a high threshold
instagram”, to scrape the top 100 images from Bing image
for the final softmax layer output to ensure a high precision
search. And finally we obtained a dataset of about 10K im-
celebrity recognition rate.
ages from Instagram, with a wide range of coverage on per-
We applied a similar process for landmark recognition. sonal photos. About 12.5% of images in this Instagram set
One key difference is that it is not straightforward to iden- contain entities that are recognizable by our entity recog-
tify a list of landmarks that are visually recognizable al-
though it is easy to get a list of landmarks or attractions from 3 http://www.crowdflower.com/
nition model (mostly are celebrities). Meanwhile, we also System Excel Good Bad Emb
reported results on 1000 random samples of the COCO val- Fang et al. 40.6% 26.8% 28.8% 3.8%
idation set and 1000 random samples of the MIT test set, Ours (Basic) 51.4% 22.0% 23.6% 3.0%
Since the MELM and the DMSM are both trained on the Ours (Basic+Confi.) 51.8% 23.4% 22.5% 2.3%
COCO training set, the results on the COCO test set and
the MIT test set represent the performance on in-domain Table 1: Human evaluation on 1K random samples of the
images and out-of-domain images, respectively. COCO val-test set
We communicated with the authors of Fang et al. [7],
one of the two winners of the MS COCO 2015 Caption- System Excel Good Bad Emb
ing Challenge, to obtain the caption output of our test im- Fang et al. 17.8% 18.5% 55.8% 7.9%
ages from their system. For our system, we evaluated three Ours (Basic) 23.9% 21.0% 49.0% 6.1%
different settings: Basic with no confidence thresholding Ours (Basic+Confi.) 28.2% 27.5% 39.3% 5.0%
and no entity recognition, Basic+Confi. with confidence
thresholding but no entity recognition, and Full with both Table 2: Human evaluation on 1K random samples of the
confidence thresholding and entity recognition on. For Ba- MIT test set
sic+Confi. and Full, we use templates such as “this image is
about ${top visual concept}”, or “a picture of ${entity}” if System Excel Good Bad Emb
entity recognizer fires, instead of the caption generated by Fang et al. 12.0% 13.4% 63.0% 11.6%
the language model, whenever the confidence score is be- Ours (Basic) 15.1% 16.4% 60.0% 8.4%
low 0.25. The results are presented in Tables 1, 2, and 3. Ours (Basic+Confi.) 23.3% 24.6% 47.0% 5.1%
Since the COCO and MIT images were collected in such a Ours (Full) 25.4% 24.1% 45.3% 5.2%
way that does not surface entites, we do not report Full in
Tables 1 and 2. Table 3: Human evaluation on Instagram test set, which
contains 1380 random images from the 10K Instagram im-
As shown in the results, we have significantly improved
ages that we scraped.
the performance over a previous state-of-the-art system in
terms of human evaluation. Specifically, the in-domain
evaluation results as reported in Table 1 show that, com-
6, we show a bunch of images randomly sampled from the
pared to the baseline by Fang et al., our Basic system re-
Instagram test set. For each image, we also show the cap-
duces the Bad and Embarrassing rates combined by 6.0%.
tions generated by the baseline system (above, in green) and
Moreover, our system significantly improves the portion
our Full system (below, in blue), respectively.
of captions that are rated as Excellent by more than 10%,
We further investigated the distribution of confidence
mainly thanks to the deep residual network based vision
scores in each of the Excellent, Good, Bad, and Embarrass-
model, plus refinement of the parameters of the engine and
ing category on the Instagram test set using the Basic set-
other components. Integrating confidence classifier to the
ting. The means and the standard deviations are reported in
system helps reduce the Bad and Embarrassing rates fur-
Table 4. We observed that in general the confidence scores
ther.
align with the human judgements well. Therefore, based
The results on the out-of-domain MIT test set are pre- on the confidence score, more sophisticated solutions could
sented in Table 2. We observed similar degree of improve- be developed to handle difficult images and achieve a better
ments by using the new vision model. More interestingly, user experience.
the confidence classifier helps significantly on this dataset. We also want to point out that, integrating the entity
E.g., the rate of Satisfaction, a combination of Excellent and in the caption greatly improves the user experience, which
Good, is further improved by more than 10%. might not be fully reflected in the 4-point rating. For ex-
Instagram data set contains many images that are filtered ample, for the first image in the second row of Figure 6,
images or handcrafted abstract pictures, which are difficult the baseline gives a caption “a man wearing a suit and tie”,
for the current caption system to process (see examples in while our system produces “Ian Somerhalder wearing a suit
Figure 6). In the Instagram domain, the results in Table and tie” thanks to the entity recognition model. Although
3 shows that both the baseline and our Basic system per- both caption outputs are rated as Excellent, the latter pro-
form quite poorly, scoring a Satisfaction rate of 25.4% and vides much richer information than the baseline.
31.5%, respectively. However, by integrating confidence
classifier in the system, we improve the Satisfaction rate to 4. Conclusion
47.9%. The Satisfaction rate is further improved to 49.5%
after integrating the entity recognition model, representing This paper presents a new state-of-the-art image caption
a 94.9% relative improvement over the baseline. In Figure system with respect to human evaluation. To encourage
Figure 6: Qualitative results of images randomly sampled from the Instagram test set, with Fang2015 caption in green (above)
and our system’s caption in blue (below) for each image.
Excel Good Bad Emb [9] S. Ioffe and C. Szegedy. Batch normalization: Accelerating
mean 0.59 0.51 0.26 0.20 deep network training by reducing internal covariate shift. In
stdev 0.21 0.23 0.21 0.19 ICML, 2015.
[10] J. Johnson, A. Karpathy, and L. Fei-Fei. Densecap: Fully
Table 4: mean and standard deviation of confidence scores convolutional localization networks for dense captioning.
in each category, measured on the Instagram test set under arXiv preprint arXiv:1511.07571, 2015.
the Basic setting. [11] S. R. J. S. Kaiming He, Xiangyu Zhang. Deep residual learn-
ing for image recognition. In Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition, 2016.
reproducibility and facilitate further research, we have de- [12] A. Karpathy and L. Fei-Fei. Deep visual-semantic align-
ployed our system and made it publicly accessible. ments for generating image descriptions. In Proceedings
of the IEEE Conference on Computer Vision and Pattern
5. Acknowledgments Recognition, pages 3128–3137, 2015.
The authors are grateful to Li Deng, Jacob Devlin, De- [13] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
classification with deep convolutional neural networks. In
long Fu, Ryan Galgon, Jianfeng Gao, Yandong Guo, Ted
Advances in neural information processing systems, pages
Hart, Yuxiao Hu, Ece Kamar, Anirudh Koul, Allison Light, 1097–1105, 2012.
Margaret Mitchell, Yelong Shen, Lucy Vanderwende, and
[14] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg,
Geoffrey Zweig for valuable discussions. and T. L. Berg. Baby talk: Understanding and generating im-
age descriptions. In Proceedings of the 24th CVPR. Citeseer,
References 2011.
[1] A. Agarwal and A. Lavie. Meteor, m-bleu and m-ter: Evalua- [15] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
tion metrics for high-correlation with human rankings of ma- manan, P. Dollár, and C. L. Zitnick. Microsoft coco: Com-
chine translation output. In Proceedings of the Third Work- mon objects in context. In Computer Vision–ECCV 2014,
shop on Statistical Machine Translation, pages 115–118. As- pages 740–755. Springer, 2014.
sociation for Computational Linguistics, 2008. [16] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
[2] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. manan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Com-
Freebase: a collaboratively created graph database for struc- mon objects in context. In ECCV. 2014.
turing human knowledge. In Proceedings of the 2008 ACM [17] V. Nair and G. E. Hinton. Rectified linear units improve
SIGMOD international conference on Management of data, restricted boltzmann machines. In ICML, pages 807–814,
pages 1247–1250. ACM, 2008. 2010.
[3] V. Bychkovsky, S. Paris, E. Chan, and F. Durand. Learn- [18] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a
ing photographic global tonal adjustment with a database of method for automatic evaluation of machine translation. In
input / output image pairs. In The Twenty-Fourth IEEE Con- Proceedings of the 40th annual meeting on association for
ference on Computer Vision and Pattern Recognition, 2011. computational linguistics, pages 311–318. Association for
[4] C. Callison-Burch and M. Osborne. Re-evaluating the role of Computational Linguistics, 2002.
bleu in machine translation research. In In EACL. Citeseer,
[19] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
2006.
S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
[5] J. Devlin, H. Cheng, H. Fang, S. Gupta, L. Deng, X. He,
A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual
G. Zweig, and M. Mitchell. Language models for image
Recognition Challenge. International Journal of Computer
captioning: The quirks and what works. arXiv preprint
Vision (IJCV), 115(3):211–252, 2015.
arXiv:1505.01809, 2015.
[20] Y. Shen, X. He, J. Gao, L. Deng, and G. Mesnil. A latent
[6] J. Donahue, L. Anne Hendricks, S. Guadarrama,
semantic model with convolutional-pooling structure for in-
M. Rohrbach, S. Venugopalan, K. Saenko, and T. Dar-
formation retrieval. In ACM International Conference on In-
rell. Long-term recurrent convolutional networks for visual
formation and Knowledge Management, 2014.
recognition and description. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, [21] K. Simonyan and A. Zisserman. Very deep convolutional
pages 2625–2634, 2015. networks for large-scale image recognition. arXiv preprint
[7] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, arXiv:1409.1556, 2014.
P. Dollár, J. Gao, X. He, M. Mitchell, J. C. Platt, et al. From [22] R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider:
captions to visual concepts and back. In Proceedings of the Consensus-based image description evaluation. In Proceed-
IEEE Conference on Computer Vision and Pattern Recogni- ings of the IEEE Conference on Computer Vision and Pattern
tion, pages 1473–1482, 2015. Recognition, pages 4566–4575, 2015.
[8] P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck. [23] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and
Learning deep structured semantic models for web search tell: A neural image caption generator. In Proceedings of the
using clickthrough data. In ACM International Conference IEEE Conference on Computer Vision and Pattern Recogni-
on Information and Knowledge Management, 2013. tion, pages 3156–3164, 2015.
[24] K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov,
R. Zemel, and Y. Bengio. Show, attend and tell: Neural im-
age caption generation with visual attention. arXiv preprint
arXiv:1502.03044, 2015.
[25] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo. Im-
age captioning with semantic attention. arXiv preprint
arXiv:1603.03925, 2016.
[26] X. Zhang, L. Zhang, X.-J. Wang, and H.-Y. Shum. Find-
ing celebrities in billions of web images. Multimedia, IEEE
Transactions on, 14(4):995–1007, 2012.