0% found this document useful (0 votes)

5 views

1603.09016v2

The document presents a novel image captioning system designed to generate high-quality captions for images in the wild, addressing challenges such as out-of-domain data handling and low latency. It incorporates a deep vision model for detecting visual concepts, an entity recognition model for identifying celebrities and landmarks, and a confidence model to assess caption reliability. Experimental results demonstrate that this system significantly outperforms previous state-of-the-art methods on both in-domain and out-of-domain datasets.

Uploaded by

Ahmed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

1603.09016v2

Uploaded by

Ahmed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Rich Image Captioning in the Wild

Kenneth Tran, Xiaodong He, Lei Zhang, Jian Sun

Cornelia Carapcea, Chris Thrasher, Chris Buehler, Chris Sienkiewicz

Microsoft Research
arXiv:1603.09016v2 [cs.CV] 31 Mar 2016

{ktran,xiaohe}@microsoft.com∗

Abstract

We present an image caption system that addresses new

challenges of automatically describing images in the wild.
The challenges include high quality caption quality with re-
spect to human judgments, out-of-domain data handling,
and low latency required in many applications. Built on
top of a state-of-the-art framework, we developed a deep
vision model that detects a broad range of visual concepts,
an entity recognition model that identifies celebrities and “Sasha Obama, Malia Obama, Michelle Obama, Peng Liyuan et
al. posing for a picture with Forbidden City in the background.”
landmarks, and a confidence model for the caption out-
put. Experimental results show that our caption engine out-
performs previous state-of-the-art systems significantly on
both in-domain dataset (i.e. MS COCO) and out-of-domain
datasets.

1. Introduction
Image captioning is a fundamental task in Artificial In-
telligence which describes objects, attributes, and relation-
“A small boat in Ha-Long Bay.”
ship in an image, in a natural language form. It has many
applications such as semantic image search, bringing visual Figure 1: Rich captions enabled by entity recognition
intelligence to chatbots, or helping visually-impaired peo-
ple to see the world around them. Recently, image caption-
ing has received much interest from the research community by a deep multimodal semantic model.
(see [23, 24, 25, 6, 7, 12, 10]).
However, while significant progress have been reported
The leading approaches can be categorized into two [25, 23, 6, 7], most of the systems in literature are evaluated
streams. One stream takes an end-to-end, encoder-decoder on academic benchmarks, where the experiments are based
framework adopted from machine translation. For instance, on test images collected under a controlled environment
[23] used a CNN to extract high level image features and which have similar distribution to the training examples. It
then fed them into a LSTM to generate caption. [24] went is unclear how these systems perform on open-domain im-
one step further by introducing the attention mechanism. ages.
The other stream applies a compositional framework. For
Furthermore, most of the image captioning systems only
example, [7] divided the caption generation into several
describe generic visual content without identifying key en-
parts: word detector by a CNN, caption candidates genera-
tities. The entities, such as celebrities and landmarks, are
tion by a maximum entropy model, and sentence re-ranking
important pieces in our common sense and knowledge. In
∗ Corresponding authors many situations (e.g., Figure 1), the entities are the key in-

1
Figure 2: Illustration of our image caption pipeline.

formation in an image. trained separately and integrated in the main pipeline. The
In addition, most of the literature report results in au- main components include
tomatic metrics such as BLEU [18], METEOR [1], and • a deep residual network-based vision model that de-
CIDEr [22]. Although these metrics are handy for fast tects a broad range of visual concepts,
development and tuning, there exists a substantial dis- • a language model for candidates generation and a deep
crepancy between these metrics and human’s judgment multimodal semantic model for caption ranking,
[5, 14, 4]. Their correlation to humans judgment could be • an entity recognition model that identifies celebrities
even weaker when evaluating captions with entity informa- and landmarks,
tion integrated. • and a classifier for estimating the confidence score for
In this paper, we present a captioning system for open each output caption.
domain images. We take a compositional approach by start- Figure 2 gives an overview of our image captioning system.
ing from one of the state-of-the-art image captioning frame-
work [7]. To address the challenges when describing im- 2.1. Vision model using deep residual network
ages in the wild, we enriched the visual model by detecting Deep residual networks (ResNets) [11] consist of many
a boarder range of visual concepts and recognizing celebri- stacked “Residual Units”. Each residual unit (Fig. 3) can
ties and landmarks for caption generation (see examples in be expressed in a general form:
Figure 1). Further, in order to provide graceful handling for
images that are difficult to describe, we built a confidence yl = h(xl ) + F(xl , Wl ),
model to estimate a confidence score for the caption output xl+1 = f (yl ),
based on the vision and text features, and provide a back-off
caption for these difficult cases. We also developed an effi- where xl and xl+1 are input and output of the l-th unit, and
cient engine that integrates these components and generates F is a residual function. In [11], h(xl ) = xl is an iden-
the caption within one second end-to-end on a 4-core CPU. tify mapping and f is a ReLU [17] function. ResNets that
In order to measure the quality of the caption from the are over 100-layer deep have shown state-of-the-art accu-
humans perspective, we carried out a series of human eval- racy for several challenging recognition tasks on ImageNet
uations through crowd souring, and report results based [19] and MS COCO [16] competitions. The central idea
on human’s judgments. Our experimental results show of ResNets is to learn the additive residual function F with
that the proposed system outperforms a previous state-of- respect to h(xl ), with a key choice of using an identity map-
the-art system [7] significantly on both in-domain dataset ping h(xl ) = xl . This is realized by attaching an identity
(MS COCO [15]), and out-of-domain datasets (Adobe-MIT skip connection (“shortcut”).
FiveK [3] and a dataset consisting randomly sampled im-
ages from Instagram 1 .) Notably, we improved the human Training. In order to address the open domain challenge,
satisfaction rate by 94.9% relatively on the most challeng- we trained two classifiers. The first classifier was trained
ing Instagram dataset. on MS COCO training data, for 700 visual concepts. And
the second one was trained on an image set crawled from
2. Model architecture commercial image search engines, for 1.5K visual objects.
The training started from a 50-layer ResNet, pre-trained on
Following Fang et al. [7], we decomposed the image
ImageNet 1K benchmark. To handle multiple-label classi-
caption system into independent components, which are
fication, we use sigmod output layer without softmax nor-
1 Instagram data: https://gist.github.com/Anonymous malization.
Figure 3: A residual unit. Here xl /xl+1 is the input/output
feature to the l-th Residual Unit. Weight, BN, ReLU are
linear convolution, batch normalization [9], and Rectified Figure 4: Illustration of deep multimodal semantic model
Linear Unit [17] layers.
ing, the data consists of a set of image/caption pairs. The
Testing. To make the testing efficient, we apply all con- loss function minimized during training represents the neg-
volution layers on the input image once to get a feature map ative log posterior probability of the caption given the cor-
(typically non-square) and perform average pooling and sig- responding image. The image model reuses the last pooling
moid output layers. Not only our network provides more layer extracted in the word detection model, as described
accurate predictions than VGG [21], which is used in many in section 2.1, as feature vector and stacks one more fully-
caption systems [7, 24, 12], it is also order of magnitude connected layer with Tanh non-linearity on top of this rep-
faster. The typical runtime of our ResNet is 200ms on a resentation to obtain a final representation of the same size
desktop CPU (single core only). as the last layer of the text model. We learn the parameters
in this additional layer during DMSM training. The text
2.2. Language and semantic ranking model model is based on a one-dimensional convolutional neu-
ral network similar to [20]. The DMSM similarity score
Unlike many recent works [23, 24, 12] that use is used as the main signal for ranking the captions, together
LSTM/GRU (so called gated recurrent neural network or with other signals including language model score, caption
GRNN) for caption generation, we follow [7] to use a max- length, number of detected words covered in the caption,
imum entropy language model (MELM) together with a etc.
deep multimodal similarity model (DMSM) in our caption In our system, the dimension is set to be 1000 for the
pipeline. While MELM does not perform as well as GRNN global vision vector and the global text vector, respectively.
in terms of perplexity, this disadvantage is remedied by The MELM and the DMSM are both trained on the MS
DMSM. Devlin et al. [5] shows that while MELM+DMSM COCO dataset [15]. Similar to [8], character-level word
gives the same BLEU score as GRNN, it performs signifi- hashing is used to reduce the dimension of the vocabulary.
cantly better than GRNN in terms of human judgment. The
results from the MS COCO 2015 captioning challenge2 also 2.3. Celebrity and landmark recognition
show that the MELM+DMSM based entry [7] gives top per-
formance in the official human judgment, tying with another The breakthrough in deep learning makes it possible to
entry using LSTM. recognize visual entities such as celebrities and landmarks
In the MELM+DMSM based framework, the MELM is and link the recognition result to a knowledge base such as
used together with beam search as a candidate caption gen- Freebase [2]. We believe providing entity-level recognition
erator. Similar to the text-only deep structured semantic results in image captions will bring valuable information to
model (DSSM) [8, 20], The DMSM is illustrated in Fig- end users.
ure 4, which consists of a pair of neural networks, one for The key challenge to develop a good entity recognition
mapping each input modality to a common semantic space. model with wide coverage is collecting high quality training
These two neural networks are trained jointly[7]. In train- data. To address this problem, we followed and generalized
the idea presented in [26] which leverages duplicate image
2 http://mscoco.org/dataset/#captions-leaderboard detection and name list matching to collect celebrity im-
a knowledge base. This implies that data collection and vi-
sual model learning are two closely coupled problems. To
Typical Convolutional Neural Network: AlexNet, VGG, ResNet, etc.

Chloë
address this challenge, we took an iterative approach. That
Grace
Moretz is, we first collected a training set for about 10K landmarks
…
selected from a knowledge base to train a CNN model for
10K landmarks. Then we leveraged a validation dataset to
evaluate whether an landmark is visually recognizable, and
remove from the training set those landmarks which have
convolutional + pooling layers fully connected layers very low prediction accuracy. After several iterations of
N-class prediction data cleaning and visual model learning, we ended up with
a model for about 5K landmarks.
Figure 5: Illustration of deep neural network-based large-
scale celebrity recognition 2.4. Confidence estimation
We developed a logistic regression model to estimate
a confidence score for the caption output. The input fea-
ages. In particular, we ground the entity recognition prob- tures include the DMSM’s vision and caption vectors, each
lem on a knowledge base, which brings in several advan- of size 1000, coupled with the language model score,
tages. First, each entity in a knowledge base is unique and the length of the caption, the length-normalized language
clearly defined without unambiguity, making it possible to model score, the logarithm of the number of tags covered in
develop a large scale entity recognition system. Second, the caption, and the DMSM score.
each entity normally has multiple properties (e.g. gender, The confidence model is trained on 2.5K image-caption
occupation for people, and location, longitude/latitude for pairs, with human labels on the quality (excellent, good,
landmark), providing rich and valuable information for data bad, embarrassing). The images used in the training data
collecting, cleaning, multi-task learning, and image descrip- is a mix of 750 COCO, 750 MIT, and 950 Instagram im-
tion. ages in a held-out set.
We started with a text-based approach similar to [26]
but using entities that are catalogued in the knowledge base 3. Evaluation
rather than celebrity names for high precision image and
entity matching. To further enlarge the coverage, we also We conducted a series of human evaluation experiments
scrape commercial image search engines for more entities through CrowdFlower, a crowd sourcing platform with
and check the consistency of faces in the search result to good quality control3 . The human evaluation experiments
remove outliers or discard those entities with too many out- are set up such as for each pair of image and generated cap-
lier faces. After these two stages, we ended up with a large- tion, the caption is rated on a 4-point scale: Excellent, Good,
scale face image dataset for a large set of celebrities. Bad, or Embarrassing by three different judges. In the eval-
uation, we specify for the Judges that Excellent means that
To recognize a large set of celebrities, we resorted to
the caption contains all of the important details presented
deep convolutional neural network (CNN) to learn an ex-
in the picture; Good means that the caption contains some
treme classification model, as shown in Figure 5. Training
instead of all the important details presented in the picture
a network for a large set of classes is not a trivial task. It
and no errors; Bad means the caption may be misleading
is hard to see the model converge even after a long run due
(e.g., contains errors, or miss the gist of the image); and
to the large number of categories. To address this problem,
Embarrassing means that the caption is totally wrong, or
we started from training a small model using AlexNet [13]
may upset the owner or subject of the image.
for 500 celebrities, each of which has a sufficient number
In order to evaluate the captioning performance for im-
of face images. Then we used this pre-trained model to ini-
ages in the wild, we created a dataset from Instagram.
tialize the full model of a large set of celebrities. The whole
Specifically, we collected 100 popular Instagram accounts
training process follows the standard setting as described in
on the web, and for each account we constructed a query
[13]. After the training is finished, we use the final model
with the account name plus “instagram”, e.g. “iamdiddy
to predict celebrities in images by setting a high threshold
instagram”, to scrape the top 100 images from Bing image
for the final softmax layer output to ensure a high precision
search. And finally we obtained a dataset of about 10K im-
celebrity recognition rate.
ages from Instagram, with a wide range of coverage on per-
We applied a similar process for landmark recognition. sonal photos. About 12.5% of images in this Instagram set
One key difference is that it is not straightforward to iden- contain entities that are recognizable by our entity recog-
tify a list of landmarks that are visually recognizable al-
though it is easy to get a list of landmarks or attractions from 3 http://www.crowdflower.com/
nition model (mostly are celebrities). Meanwhile, we also System Excel Good Bad Emb
reported results on 1000 random samples of the COCO val- Fang et al. 40.6% 26.8% 28.8% 3.8%
idation set and 1000 random samples of the MIT test set, Ours (Basic) 51.4% 22.0% 23.6% 3.0%
Since the MELM and the DMSM are both trained on the Ours (Basic+Confi.) 51.8% 23.4% 22.5% 2.3%
COCO training set, the results on the COCO test set and
the MIT test set represent the performance on in-domain Table 1: Human evaluation on 1K random samples of the
images and out-of-domain images, respectively. COCO val-test set
We communicated with the authors of Fang et al. [7],
one of the two winners of the MS COCO 2015 Caption- System Excel Good Bad Emb
ing Challenge, to obtain the caption output of our test im- Fang et al. 17.8% 18.5% 55.8% 7.9%
ages from their system. For our system, we evaluated three Ours (Basic) 23.9% 21.0% 49.0% 6.1%
different settings: Basic with no confidence thresholding Ours (Basic+Confi.) 28.2% 27.5% 39.3% 5.0%
and no entity recognition, Basic+Confi. with confidence
thresholding but no entity recognition, and Full with both Table 2: Human evaluation on 1K random samples of the
confidence thresholding and entity recognition on. For Ba- MIT test set
sic+Confi. and Full, we use templates such as “this image is
about ${top visual concept}”, or “a picture of ${entity}” if System Excel Good Bad Emb
entity recognizer fires, instead of the caption generated by Fang et al. 12.0% 13.4% 63.0% 11.6%
the language model, whenever the confidence score is be- Ours (Basic) 15.1% 16.4% 60.0% 8.4%
low 0.25. The results are presented in Tables 1, 2, and 3. Ours (Basic+Confi.) 23.3% 24.6% 47.0% 5.1%
Since the COCO and MIT images were collected in such a Ours (Full) 25.4% 24.1% 45.3% 5.2%
way that does not surface entites, we do not report Full in
Tables 1 and 2. Table 3: Human evaluation on Instagram test set, which
contains 1380 random images from the 10K Instagram im-
As shown in the results, we have significantly improved
ages that we scraped.
the performance over a previous state-of-the-art system in
terms of human evaluation. Specifically, the in-domain
evaluation results as reported in Table 1 show that, com-
6, we show a bunch of images randomly sampled from the
pared to the baseline by Fang et al., our Basic system re-
Instagram test set. For each image, we also show the cap-
duces the Bad and Embarrassing rates combined by 6.0%.
tions generated by the baseline system (above, in green) and
Moreover, our system significantly improves the portion
our Full system (below, in blue), respectively.
of captions that are rated as Excellent by more than 10%,
We further investigated the distribution of confidence
mainly thanks to the deep residual network based vision
scores in each of the Excellent, Good, Bad, and Embarrass-
model, plus refinement of the parameters of the engine and
ing category on the Instagram test set using the Basic set-
other components. Integrating confidence classifier to the
ting. The means and the standard deviations are reported in
system helps reduce the Bad and Embarrassing rates fur-
Table 4. We observed that in general the confidence scores
ther.
align with the human judgements well. Therefore, based
The results on the out-of-domain MIT test set are pre- on the confidence score, more sophisticated solutions could
sented in Table 2. We observed similar degree of improve- be developed to handle difficult images and achieve a better
ments by using the new vision model. More interestingly, user experience.
the confidence classifier helps significantly on this dataset. We also want to point out that, integrating the entity
E.g., the rate of Satisfaction, a combination of Excellent and in the caption greatly improves the user experience, which
Good, is further improved by more than 10%. might not be fully reflected in the 4-point rating. For ex-
Instagram data set contains many images that are filtered ample, for the first image in the second row of Figure 6,
images or handcrafted abstract pictures, which are difficult the baseline gives a caption “a man wearing a suit and tie”,
for the current caption system to process (see examples in while our system produces “Ian Somerhalder wearing a suit
Figure 6). In the Instagram domain, the results in Table and tie” thanks to the entity recognition model. Although
3 shows that both the baseline and our Basic system per- both caption outputs are rated as Excellent, the latter pro-
form quite poorly, scoring a Satisfaction rate of 25.4% and vides much richer information than the baseline.
31.5%, respectively. However, by integrating confidence
classifier in the system, we improve the Satisfaction rate to 4. Conclusion
47.9%. The Satisfaction rate is further improved to 49.5%
after integrating the entity recognition model, representing This paper presents a new state-of-the-art image caption
a 94.9% relative improvement over the baseline. In Figure system with respect to human evaluation. To encourage
Figure 6: Qualitative results of images randomly sampled from the Instagram test set, with Fang2015 caption in green (above)
and our system’s caption in blue (below) for each image.
Excel Good Bad Emb [9] S. Ioffe and C. Szegedy. Batch normalization: Accelerating
mean 0.59 0.51 0.26 0.20 deep network training by reducing internal covariate shift. In
stdev 0.21 0.23 0.21 0.19 ICML, 2015.
[10] J. Johnson, A. Karpathy, and L. Fei-Fei. Densecap: Fully
Table 4: mean and standard deviation of confidence scores convolutional localization networks for dense captioning.
in each category, measured on the Instagram test set under arXiv preprint arXiv:1511.07571, 2015.
the Basic setting. [11] S. R. J. S. Kaiming He, Xiangyu Zhang. Deep residual learn-
ing for image recognition. In Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition, 2016.
reproducibility and facilitate further research, we have de- [12] A. Karpathy and L. Fei-Fei. Deep visual-semantic align-
ployed our system and made it publicly accessible. ments for generating image descriptions. In Proceedings
of the IEEE Conference on Computer Vision and Pattern
5. Acknowledgments Recognition, pages 3128–3137, 2015.
The authors are grateful to Li Deng, Jacob Devlin, De- [13] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
classification with deep convolutional neural networks. In
long Fu, Ryan Galgon, Jianfeng Gao, Yandong Guo, Ted
Advances in neural information processing systems, pages
Hart, Yuxiao Hu, Ece Kamar, Anirudh Koul, Allison Light, 1097–1105, 2012.
Margaret Mitchell, Yelong Shen, Lucy Vanderwende, and
[14] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. C. Berg,
Geoffrey Zweig for valuable discussions. and T. L. Berg. Baby talk: Understanding and generating im-
age descriptions. In Proceedings of the 24th CVPR. Citeseer,
References 2011.
[1] A. Agarwal and A. Lavie. Meteor, m-bleu and m-ter: Evalua- [15] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
tion metrics for high-correlation with human rankings of ma- manan, P. Dollár, and C. L. Zitnick. Microsoft coco: Com-
chine translation output. In Proceedings of the Third Work- mon objects in context. In Computer Vision–ECCV 2014,
shop on Statistical Machine Translation, pages 115–118. As- pages 740–755. Springer, 2014.
sociation for Computational Linguistics, 2008. [16] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
[2] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. manan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Com-
Freebase: a collaboratively created graph database for struc- mon objects in context. In ECCV. 2014.
turing human knowledge. In Proceedings of the 2008 ACM [17] V. Nair and G. E. Hinton. Rectified linear units improve
SIGMOD international conference on Management of data, restricted boltzmann machines. In ICML, pages 807–814,
pages 1247–1250. ACM, 2008. 2010.
[3] V. Bychkovsky, S. Paris, E. Chan, and F. Durand. Learn- [18] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a
ing photographic global tonal adjustment with a database of method for automatic evaluation of machine translation. In
input / output image pairs. In The Twenty-Fourth IEEE Con- Proceedings of the 40th annual meeting on association for
ference on Computer Vision and Pattern Recognition, 2011. computational linguistics, pages 311–318. Association for
[4] C. Callison-Burch and M. Osborne. Re-evaluating the role of Computational Linguistics, 2002.
bleu in machine translation research. In In EACL. Citeseer,
[19] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
2006.
S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
[5] J. Devlin, H. Cheng, H. Fang, S. Gupta, L. Deng, X. He,
A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual
G. Zweig, and M. Mitchell. Language models for image
Recognition Challenge. International Journal of Computer
captioning: The quirks and what works. arXiv preprint
Vision (IJCV), 115(3):211–252, 2015.
arXiv:1505.01809, 2015.
[20] Y. Shen, X. He, J. Gao, L. Deng, and G. Mesnil. A latent
[6] J. Donahue, L. Anne Hendricks, S. Guadarrama,
semantic model with convolutional-pooling structure for in-
M. Rohrbach, S. Venugopalan, K. Saenko, and T. Dar-
formation retrieval. In ACM International Conference on In-
rell. Long-term recurrent convolutional networks for visual
formation and Knowledge Management, 2014.
recognition and description. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, [21] K. Simonyan and A. Zisserman. Very deep convolutional
pages 2625–2634, 2015. networks for large-scale image recognition. arXiv preprint
[7] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, arXiv:1409.1556, 2014.
P. Dollár, J. Gao, X. He, M. Mitchell, J. C. Platt, et al. From [22] R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider:
captions to visual concepts and back. In Proceedings of the Consensus-based image description evaluation. In Proceed-
IEEE Conference on Computer Vision and Pattern Recogni- ings of the IEEE Conference on Computer Vision and Pattern
tion, pages 1473–1482, 2015. Recognition, pages 4566–4575, 2015.
[8] P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck. [23] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and
Learning deep structured semantic models for web search tell: A neural image caption generator. In Proceedings of the
using clickthrough data. In ACM International Conference IEEE Conference on Computer Vision and Pattern Recogni-
on Information and Knowledge Management, 2013. tion, pages 3156–3164, 2015.
[24] K. Xu, J. Ba, R. Kiros, A. Courville, R. Salakhutdinov,
R. Zemel, and Y. Bengio. Show, attend and tell: Neural im-
age caption generation with visual attention. arXiv preprint
arXiv:1502.03044, 2015.
[25] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo. Im-
age captioning with semantic attention. arXiv preprint
arXiv:1603.03925, 2016.
[26] X. Zhang, L. Zhang, X.-J. Wang, and H.-Y. Shum. Find-
ing celebrities in billions of web images. Multimedia, IEEE
Transactions on, 14(4):995–1007, 2012.

Carrier Alarm Codes
100% (10)
Carrier Alarm Codes
4 pages
The Story of Prometheus and Epimetheus
No ratings yet
The Story of Prometheus and Epimetheus
3 pages
Ste Research-8 Q1 Melc-2 Week-1
75% (4)
Ste Research-8 Q1 Melc-2 Week-1
8 pages
A Survey of Evolution of Image Captioning PDF
No ratings yet
A Survey of Evolution of Image Captioning PDF
18 pages
Hybrid_Image_Captioning_Model
No ratings yet
Hybrid_Image_Captioning_Model
6 pages
RP Springer
No ratings yet
RP Springer
10 pages
Major Report Final
No ratings yet
Major Report Final
40 pages
Design of Machine Learning Algorithms For Object Captioning
No ratings yet
Design of Machine Learning Algorithms For Object Captioning
45 pages
Image Caption Generator
No ratings yet
Image Caption Generator
2 pages
Ref12
No ratings yet
Ref12
7 pages
Image Captioning
No ratings yet
Image Captioning
17 pages
Image_Captioning_-_A_Deep_Learning_Approach_Using_CNN_and_LSTM_Network
No ratings yet
Image_Captioning_-_A_Deep_Learning_Approach_Using_CNN_and_LSTM_Network
6 pages
TSP_CMC_53245
No ratings yet
TSP_CMC_53245
18 pages
Image Caption Bot With Keras and Speech Generation For
No ratings yet
Image Caption Bot With Keras and Speech Generation For
7 pages
Image Caption Generator Research Paper
No ratings yet
Image Caption Generator Research Paper
4 pages
Image Caption Generator Using Deep Learning
No ratings yet
Image Caption Generator Using Deep Learning
9 pages
Image Captioning
No ratings yet
Image Captioning
8 pages
118-presentation
No ratings yet
118-presentation
26 pages
PGCON Paper Final
No ratings yet
PGCON Paper Final
4 pages
fang2015
No ratings yet
fang2015
10 pages
A Novel Approach of Image Caption Generator Using Deep Learning
No ratings yet
A Novel Approach of Image Caption Generator Using Deep Learning
6 pages
ROHAN PRASAD FinalProjectReport - Rohan Gamer
No ratings yet
ROHAN PRASAD FinalProjectReport - Rohan Gamer
39 pages
IJIEMR March 2023 COPY RIGHT (2 Files Merged)
No ratings yet
IJIEMR March 2023 COPY RIGHT (2 Files Merged)
8 pages
Generating_Caption_From_Images_Using_Flickr_Image_Dataset
No ratings yet
Generating_Caption_From_Images_Using_Flickr_Image_Dataset
7 pages
Base Paper
No ratings yet
Base Paper
6 pages
Building A Voice Based Image Caption Generator With Deep Learning
No ratings yet
Building A Voice Based Image Caption Generator With Deep Learning
6 pages
Detection and Recognition of Objects in Image Caption Generator System A Deep Learning Approach
No ratings yet
Detection and Recognition of Objects in Image Caption Generator System A Deep Learning Approach
3 pages
DW & Caption Generator - Paper 1
No ratings yet
DW & Caption Generator - Paper 1
6 pages
Project Review
No ratings yet
Project Review
12 pages
Image Captioning Generator Using Deep Machine Learning
No ratings yet
Image Captioning Generator Using Deep Machine Learning
3 pages
A_Novel_Approach_of_Image_Caption_Generator_using_Deep_Learning
No ratings yet
A_Novel_Approach_of_Image_Caption_Generator_using_Deep_Learning
6 pages
(IJCST-V11I4P7) :dr. T. S. Suganya, Mrs. M. Divya, T. Santhosh Kumar, K. Prem Kumar
No ratings yet
(IJCST-V11I4P7) :dr. T. S. Suganya, Mrs. M. Divya, T. Santhosh Kumar, K. Prem Kumar
4 pages
Fin Irjmets1681386363
No ratings yet
Fin Irjmets1681386363
5 pages
Image Captioning: Department of Computer Science University of Engineering & Technology Taxila
No ratings yet
Image Captioning: Department of Computer Science University of Engineering & Technology Taxila
10 pages
Visual Image Caption Generator 38
No ratings yet
Visual Image Caption Generator 38
6 pages
Semantic Compositional Networks For Visual Captioning
No ratings yet
Semantic Compositional Networks For Visual Captioning
13 pages
Image Caption Generator Using Deep Learning
No ratings yet
Image Caption Generator Using Deep Learning
5 pages
Image Captioning Model Using Attention and Object
No ratings yet
Image Captioning Model Using Attention and Object
17 pages
Automatic Image Captioning Using Neural Networks
No ratings yet
Automatic Image Captioning Using Neural Networks
9 pages
2305.02932v2
No ratings yet
2305.02932v2
6 pages
Project Report Image Captioning Models Prakhar Dhyani
No ratings yet
Project Report Image Captioning Models Prakhar Dhyani
8 pages
Show Attend and Tell
No ratings yet
Show Attend and Tell
10 pages
Image Captioning Using Deep Learning Mait
No ratings yet
Image Captioning Using Deep Learning Mait
8 pages
I Image Caption Generation Using Contextual Information Fusion With Bi-LSTM-s
No ratings yet
I Image Caption Generation Using Contextual Information Fusion With Bi-LSTM-s
10 pages
Image Caption Generator Using AI: Review - 1
No ratings yet
Image Caption Generator Using AI: Review - 1
9 pages
Image Caption Generation
No ratings yet
Image Caption Generation
8 pages
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-Text Dataset For Automatic Image Captioning
No ratings yet
Conceptual Captions: A Cleaned, Hypernymed, Image Alt-Text Dataset For Automatic Image Captioning
10 pages
Image_Caption_Generation_using_Deep_Neural_Networks
No ratings yet
Image_Caption_Generation_using_Deep_Neural_Networks
3 pages
Image Captioning - A Deep Learning Approach
No ratings yet
Image Captioning - A Deep Learning Approach
4 pages
ImagecaptionusingCNNandLSTM
No ratings yet
ImagecaptionusingCNNandLSTM
11 pages
Acd
No ratings yet
Acd
15 pages
2501
No ratings yet
2501
6 pages
Aic - 2022 - 35 2 - Aic 35 2 Aic210172 - Aic 35 Aic210172
No ratings yet
Aic - 2022 - 35 2 - Aic 35 2 Aic210172 - Aic 35 Aic210172
19 pages
From Show To Tell: A Survey On Image Captioning
No ratings yet
From Show To Tell: A Survey On Image Captioning
22 pages
Paper 17881
No ratings yet
Paper 17881
6 pages
Image Caption Generator Report
No ratings yet
Image Caption Generator Report
27 pages
ppt(ankitveer)
No ratings yet
ppt(ankitveer)
18 pages
he2017
No ratings yet
he2017
8 pages
IJNRD2309143
No ratings yet
IJNRD2309143
11 pages
DL 20i0551 Project Proposal
No ratings yet
DL 20i0551 Project Proposal
3 pages
Image Caption Generator PCL
No ratings yet
Image Caption Generator PCL
19 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Conceptual Dependency Theory: Fundamentals and Applications
From Everand
Conceptual Dependency Theory: Fundamentals and Applications
Fouad Sabry
No ratings yet
De_Nardin_A_One-Shot_Learning_Approach_To_Document_Layout_Segmentation_of_Ancient_WACV_2024_paper
No ratings yet
De_Nardin_A_One-Shot_Learning_Approach_To_Document_Layout_Segmentation_of_Ancient_WACV_2024_paper
10 pages
Deng_Z_Zero-shot_Style_Transfer_via_Attention_Reweighting_CVPR_2024_paper
No ratings yet
Deng_Z_Zero-shot_Style_Transfer_via_Attention_Reweighting_CVPR_2024_paper
11 pages
Survey_of_Road_Anomalies_Detection_Methods-2022-10-30-07-32
No ratings yet
Survey_of_Road_Anomalies_Detection_Methods-2022-10-30-07-32
22 pages
Lecture5
No ratings yet
Lecture5
102 pages
CH2 Software Processes
No ratings yet
CH2 Software Processes
30 pages
Dense_Video_Captioning_CVPR_2024_paper_جيدة
No ratings yet
Dense_Video_Captioning_CVPR_2024_paper_جيدة
10 pages
Science: Quarter 2 - Module 3
100% (4)
Science: Quarter 2 - Module 3
20 pages
Crossword Puzzle - Compressed
No ratings yet
Crossword Puzzle - Compressed
33 pages
Deber 8io 2
No ratings yet
Deber 8io 2
7 pages
Ibong Adarna
No ratings yet
Ibong Adarna
22 pages
EPO IRENA Offshore Wind Patent Insight Report 2023
No ratings yet
EPO IRENA Offshore Wind Patent Insight Report 2023
59 pages
AI Research Report
No ratings yet
AI Research Report
13 pages
تقيم الخزانات تأشير
No ratings yet
تقيم الخزانات تأشير
16 pages
An International Analysis of IT Service Management Benefits and Performance Measurement
No ratings yet
An International Analysis of IT Service Management Benefits and Performance Measurement
35 pages
Predicate Adjective
No ratings yet
Predicate Adjective
27 pages
Corrected Anbessa Debeso Seminar On Anthrax
No ratings yet
Corrected Anbessa Debeso Seminar On Anthrax
32 pages
Warning Against Extremism by Shaikh Saleh Aalus Shaykh
100% (2)
Warning Against Extremism by Shaikh Saleh Aalus Shaykh
54 pages
CSE 390a: Intro To Shell Scripting
No ratings yet
CSE 390a: Intro To Shell Scripting
22 pages
Grade 12 2025 Pre Mid Year Examination Paper 1 Final
No ratings yet
Grade 12 2025 Pre Mid Year Examination Paper 1 Final
11 pages
Exercises II Linear Programming, Duality, Sensitivity Analysis
No ratings yet
Exercises II Linear Programming, Duality, Sensitivity Analysis
2 pages
Descartes Vs Locke
100% (1)
Descartes Vs Locke
3 pages
Port Management Case Studies-Unctad
100% (2)
Port Management Case Studies-Unctad
52 pages
The Harbinger
No ratings yet
The Harbinger
4 pages
Vistas-The Enemy
No ratings yet
Vistas-The Enemy
4 pages
O Level Module 1 M1 R5 Information Technology & Tools and Network
No ratings yet
O Level Module 1 M1 R5 Information Technology & Tools and Network
192 pages
Cricket Training Manual 2019
No ratings yet
Cricket Training Manual 2019
24 pages
Case Study Pneumatic
100% (7)
Case Study Pneumatic
16 pages
IMS Proschool IFRS Ebook
No ratings yet
IMS Proschool IFRS Ebook
143 pages
Advantages of HVDC Over HVAC Transmission - EEP
No ratings yet
Advantages of HVDC Over HVAC Transmission - EEP
5 pages
Distance Learning Challenges On The Use of Self-Learning Module
No ratings yet
Distance Learning Challenges On The Use of Self-Learning Module
14 pages
Example Proposal Cover Letter - Contract
No ratings yet
Example Proposal Cover Letter - Contract
6 pages
Listening Practice For The First Semester 2020
No ratings yet
Listening Practice For The First Semester 2020
4 pages
Epocast 1652-A/B: Advanced Materials
No ratings yet
Epocast 1652-A/B: Advanced Materials
3 pages

1603.09016v2

Uploaded by

1603.09016v2

Uploaded by

Rich Image Captioning in the Wild

Kenneth Tran, Xiaodong He, Lei Zhang, Jian Sun

We present an image caption system that addresses new

You might also like