HUSE Multi Input Deep Learning Model
HUSE Multi Input Deep Learning Model
Pradyumna Narayana1 , Aniket Pednekar1 , Abishek Krishnamoorthy∗2 , Kazoo Sone1 , Sugato Basu1
1
Google, 2 Georgia Institute of Technology
{pradyn,aniketvp,sone,sugato}@google.com, [email protected]
arXiv:1911.05978v1 [cs.CV] 14 Nov 2019
1
to learn universal embeddings.
Shih tzu puppy on a
swing.
2. Incorporation of semantic information into a universal
embedding space.
3. A novel semantic embedding method that doesn’t in-
volve projection onto semantic embedding space.
4. State-of-the-art retrieval, hierarchical precision, and
classification results on UPMC Food-101 dataset.
2. Related Work
Golden gate bridge
Classification
Embeddings
classes have a lower edge weight compared to two seman- where I is the indicator function.
tically different classes. This semantic graph is used to
regularize the universal embedding space. 3.4.2 Semantic Similarity
More formally, we define the semantic graph as G =
(V, E), where V = {v1 , v2 , ..., vK } represents the set of K To make the learned universal embedding space semantically
classes and E represent the edges between any two classes. meaningful, where embeddings corresponding to two seman-
Let ψ(·) represent the function that extracts embeddings of a tically similar classes are closer than the embeddings corre-
class name. The adjacency matrix A = {Aij }K i,j=1 of graph
sponding to two semantically different classes, we regularize
G contains non-negative weights associated with each edge, the embedding space using the semantic graph discussed
such that: in Section 3.3. This semantic graph regularization enforces
Aij = d(ψ(vi ), ψ(vj )), (4) additional constraint that the distance between any two em-
beddings is equal to the edge weight of their correspond-
where d is cosine distance as specified in Section 3.1. ing classes in semantic graph. As the semantically similar
classes have smaller edge weight compared to semantically
3.4. Learning Algorithms
different classes, the regularization forces semantically simi-
To learn HUSE, the network architecture presented in lar classes to be closer to each other.
Section 3.2 is trained using the following loss: We use the following graph regularization loss to learn
semantic similarity.
L = αLclassification + βLgraph + γLgap (5)
N N
where Lclassification , Lgraph , Lgap correspond to losses 1 XX
that will make the learned embedding space to have the Lgraph = (d(φ(xim ), φ(xjn )) − Aij )2 (9)
N 2 m=1 n=1
three properties discussed in Section 3.1 and α, β and γ cor-
respond to weights to control the influence of these three However, it is hard to satisfy the constraint that all pairs
losses. These three losses are discussed in more detail below. of embeddings must adhere to the graph G. So, we relax
this constraint by adding a margin so that the regularization • Textual: We use BERT embeddings1 [8] to obtain a
is enforced on semantic classes which are closer than the representation of the text. Similar to Devlin et al. [8],
margin and make other embedding pairs at least as large as we concatenate the embeddings from the last four layers
the margin. for each token and then average all token embeddings
for a document. In addition to the BERT encoding, we
( also calculate TF-IDF for the text features to provide
ij 1 if Aij < ζ and d(φ(xim ), φ(xjn )) < ζ a notion of importance for each word in relation to
σmn = (10)
0 otherwise the whole corpus by following the pre-processing steps
similar to [36].
After relaxing the constraint, the resulting loss is As the number of tokens in each instance of UPMC
Food-101 are far greater than the maximum tokens
(512) that pre-trained BERT support, we extract 512
N N
1 X X ij salient tokens from the text. To extract the salient tokens
Lgraph = σ (d(φ(xim ), φ(xjn )) − Aij )2
N 2 m=1 n=1 mn for a single example, we consider every sentence in the
(11) text as a separate document to build a corpus. We then
use this corpus as the input to a TF-IDF model [28] and
the most important 512 tokens are extracted.
3.4.3 Cross Modal Gap
To reduce the cross modal gap, the distance between image 4.1.2 Semantic Graph:
and text embeddings corresponding to the same instance
We construct a semantic graph based on the embeddings
should be minimized. The following loss function is used to
extracted from the class names as discussed in Section 3.4.2.
achieve instance level similarity.
As class names often contain more than a single word (e.g.,
N
apple pie), we use Universal Sentence Encoder [5] that pro-
1 X vide sentence level embedding to extract embeddings of
Lgap = d(φI (pn ), φT (qn )) (12)
N n=1 class names. To build the semantic graph, each class name is
treated as a vertex and the cosine distance between universal
4. Experiments sentence encoder embeddings of two class names is treated
as edge weight. This semantic graph is used in the graph
We test our methodology on the UPMC Food-101 dataset. loss stated by Equation 9.
As HUSE incorporates semantic information derived from
class labels when learning universal embeddings, we need
4.1.3 Training Process
a large multimodal classification dataset. UPMC Food-101
dataset [36] is a very large multimodal classification dataset The image tower consists of 5 hidden layers of 512 hidden
containing around 101 food categories. The dataset has units each and text tower consists of 2 hidden layers of 512
about 100,000 textual recipes and their associated images. hidden units each. A dropout of 0.15 is used between all
Apple pie, baby back ribs, strawberry shortcake, tuna tartare hidden layers of both towers. The network is trained using
are some examples of food categories in this dataset. the RMSProp optimizer with a learning rate of 1.6192e-05
and momentum set to 0.9 with random batches of 1024
4.1. Implementation Details for 250,000 steps. These hyperparameters are chosen to
This section discusses the visual and textual features used maximize the image and text classification accuracies on the
by HUSE, followed by the methodology used to construct validation set of UPMC Food-101 dataset.
semantic graph from class labels. Finally, we discuss the 4.2. Baselines
network parameters and training process.
To test the effectiveness of HUSE, we compare it to the
following state-of-the-art methods for cross-modal retrieval
4.1.1 Feature Representation and visual semantic embeddings. As visual semantic meth-
• Visual: We extract pretrained Graph-Regularized Im- ods are modeled only for image embeddings, we extend
age Semantic Embeddings (Graph-RISE) of size 64 them to have textual embeddings as well. The input feature
from individual images [15]. Graph-RISE embeddings representation, number of hidden layers, hidden layer size
are trained on 260 million images to discriminate 40 and other training parameters of baselines are set similar to
million ultra-fine-grained semantic labels using a large- the HUSE model.
scale neural graph learning framework. 1 https://github.com/google-research/bert
1. Triplet: This baseline uses the triplet loss with semi to HUSE. In addition, we also use separate classification
hard online learning to decrease the distance between models on image and text modalities of UPMC Food-101 as
embeddings corresponding to similar classes, while in- another baseline for classification task.
creasing the distance between embeddings correspond-
ing to different classes [31]. 4.3. Retrieval Task
2. CME: This method learns Cross-Modal Embeddings The universal embedding space learned by HUSE can be
by maximizing the cosine similarity between positive used for both in-modal and cross-modal retrieval. This sec-
image-text pairs, and minimizing it between all non- tion quantifies the performance of HUSE on two in-modal
matching image-text pairs. CME uses additional classi- retrieval tasks (Image2Image and Text2Text) and two cross-
fication loss for semantic regularization [29]. modal retrieval tasks (Image2Text and Text2Image) by com-
paring it to the baselines discussed in Section 4.2. Given a
3. AdaMine: AdaMine uses a double-triplet scheme to query (image or text), the retrieval task is to retrieve the cor-
align instance-level and semantic-level embeddings [4]. responding image or text from the same class. For evaluation,
we consider each item in the test set as a query (for instance,
4. DeViSE*2 : DeViSE maps pre-trained image embed-
an image), and we rank the other candidates according to
dings onto word embeddings of class labels learned
the cosine distance between the query embedding and the
from text corpora [11] using skip-gram language model.
candidate embeddings. The results are evaluated using the
We extend DeViSE to support text by mapping pre-
recall percentage at top K (R@K), over all queries in the test
trained text embeddings onto word embeddings of class
set. The R@K corresponds to the percentage of queries for
labels. To have a fair comparison, we use the class label
which the matching item is ranked among the top K closest
embeddings used to construct the semantic graph in-
results.
stead of learning word embeddings of class labels using
Table 1 shows the R@1, R@5, R@10 results on 4 re-
skip-gram language model as in the original paper.
trieval tasks. The results show that HUSE is outperforming
5. HIE*: Hierarchy-based Image Embeddings maps im- all the baselines across all retrieval tasks in all measures
ages to class label embeddings and uses an additional on UPMC Food-101 dataset. The more evident gains on
classification loss [3]. We extend this model to support cross-modal retrieval tasks compared to in-modal retrieval
text embeddings by mapping text to class centroids and show that HUSE is able to address media gap more effec-
using an additional classification loss on text. Although tively than other methods. Moreover, the significant gains
the original paper deterministically calculates the class on R@1 metric compared to R@5, R@10 metrics show that
embeddings based on the hierarchy, we use the class HUSE is learning instance level semantics better than the
label embeddings used to construct the semantic graph other baselines. The first three entries in Table 1 correspond
for fair comparison. to methods that don’t include semantic information and the
other entries correspond to semantic embedding methods.
6. HUSE-P: This baseline has the similar architecture as In general, the methods that uses the semantic information
that of HUSE and is meant to disentangle the architec- are performing better than the methods that are not using
tural choices of HUSE from the graph loss. HUSE-P semantic information. This shows that the semantic infor-
maps universal embedding space to class label embed- mation is important even when class level retrieval tasks are
ding space by projecting image and text embeddings to considered.
class label embeddings similar to DeViSE* and HIE*. HUSE greatly outperforms all the baselines that doesn’t
HUSE-P uses the following projection loss instead of use semantic information in all measures. The gains are even
graph loss in equation 11. more significant on cross-modal retrieval tasks and this can
N be attributed to the shared classification layer and instance
1 X loss used by HUSE that forces the embeddings from both
Lproj = d(φI (pin ), ψ(vi )) + d(φT (qni ), ψ(vi ))
N n=1 modalities to be more aligned. CME is the best performing
(13) model among the baselines that doesn’t use semantic infor-
mation and it’s architecture is compartively more similar to
We also compare the classification accuracy of HUSE HUSE as both models have a classification loss and instance
to the previous published results by Wang et al. [36] and loss. However, CME has separate classification layers for
Kiela et al. [16] on UPMC Food-101 dataset. As CME, HIE* image and text and it’s instance level loss minimizes the
and HUSE-P baselines discussed above also has classifica- distance between embeddings corresponding to an instance
tion layer, we also compare their classification accuracies while maximizing the distance between embeddings of dif-
2 * indicates that the original model is extended to support text embed- ferent instances, even if they belong to the same class. This
dings is different to the instance loss used by HUSE that only min-
Image2Image Image2Text Text2Image Text2Text
R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
Triplet 0.469 0.685 0.746 0.282 0.541 0.632 0.171 0.410 0.539 0.484 0.705 0.783
CME 0.660 0.790 0.821 0.430 0.552 0.592 0.090 0.313 0.458 0.802 0.880 0.901
AdaMine 0.191 0.387 0.499 0.042 0.160 0.266 0.038 0.134 0.221 0.215 0.350 0.447
DeViSE* 0.656 0.793 0.830 0.537 0.698 0.748 0.220 0.438 0.539 0.543 0.710 0.775
HIE* 0.649 0.792 0.832 0.590 0.712 0.751 0.430 0.721 0.800 0.786 0.870 0.897
HUSE-P 0.668 0.794 0.828 0.328 0.579 0.666 0.689 0.826 0.853 0.830 0.894 0.913
HUSE 0.685 0.806 0.835 0.702 0.763 0.778 0.803 0.889 0.903 0.875 0.917 0.927
Table 1: Retrieval performance of various methods on four retrieval tasks of UPMC Food-101 dataset. The entries above
the dashed line correspond to the methods without semantic information and the entries below it correspond to methods that
incorporate semantic information. The best value per column is set in bold font.
imizes the distance between embeddings corresponding to better semantic embedding space than other baselines. Even
an instance. AdaMine is the least performing baseline and when the performance of the best method in each retrieval
it can be attributed to the double triplet loss the method is task is compared to HUSE, the performance improvements
using. The instance level triplet loss pushes the embeddings shows the superiority of HUSE in integrating semantic infor-
corresponding to different instances away although they be- mation into the embedding space. Moreover, these gains are
long to same class, whereas the class level triplet loss tries even more prominent in cross-modal retrieval tasks (0.055
to bring them closer. The polarity between these two loses to 0.251) than in-modal retrieval taks (0.016 to 0.061). This
makes AdaMine hard to optimize. performance improvements can be attributed to the fact that
The results also show that HUSE is significantly out- HUSE doesn’t confine the universal embedding space to
performing the baselines that use semantic information, al- class label embedding space. Among three baselines that use
though the performance difference is small compared to the the semantic information, HUSE-P is incorporating more
methods that don’t use semantic information. HUSE still semantic information than the other two methods as mea-
outshines these methods on cross-modal retrieval tasks. The sured by HP@K. The only difference between HUSE and
semantic embedding baselines uses a fixed class label em- HUSE-P is that HUSE-P uses projection loss similar to De-
bedding space and learn mappings to project image and text ViSE* and HIE*. As all of these three baselines maps image
to that space. HUSE, on the other hand, has the flexibility and text embeddings to class label space, the architectural
of learning an embedding space that is completely different choice of using a shared classification layer and an instance
from class label embedding space, but has the same semantic loss are effectively reducing the media gap resulting in better
distances as the class label embedding space. This flexibil- performance.
ity allows HUSE to learn better universal embedding space
resulting in better retrieval performance. 4.5. Classification Task
As HUSE is trained with classification objective, it is
4.4. Semantic Quality
natural to apply the model for classification task. For an
As HUSE incorporates semantic information into the image, text pair, HUSE returns separate classification scores
embedding space, this section evaluates the quality of the for image and text. These softmax scores are fused together
semantic embedding space learned by HUSE. To measure using simple weighted averaging. Table 3 reports image,
this, we employ a hierarchical precision@K (HP@K) metric text and fusion classification accuracies of HUSE and other
similar to [11] that measures the accuracy of model predic- baselines along with the previous reported results on UPMC
tions with respect to a given semantic hierarchy. We create Food-101 dataset.
a taxonomy for the UPMC-Food 101 classes based on the HUSE achieved an accuracy of 92.3% on UPMC Food-
WordNet ontology [10] and use it to calculate HP@K by 101 dataset, outperforming the previous state-of-the-art by
generating a set of classes from the semantic hierarchy for 1.5%. Moreover, the previous state-of-the-art model used
each k, and computing the fraction of the model’s k predic- complex gated attention method for fusing image and text
tions that overlap with the class set. The HP@K values of channels [16]. We, on the other hand, use a simple weighted
all the examples in test set are averaged and are reported in averaging to fuse softmax scores from image and text chan-
Table 2. HUSE outperforms all baselines by a significant nels. At the individual channel level, HUSE’s image ac-
margin on HP@K metric showing that HUSE is learning curacy outperformed the previous best by a large margin
Image to Image Image to Text Text to Image Text to Text
H@2 H@5 H@10 H@2 H@5 H@10 H@2 H@5 H@10 H@2 H@5 H@10
Triplet 0.530 0.592 0.674 0.383 0.493 0.604 0.261 0.386 0.524 0.520 0.581 0.669
CME 0.031 0.086 0.167 0.033 0.077 0.162 0.034 0.079 0.166 0.031 0.085 0.167
AdaMine 0.164 0.252 0.376 0.096 0.214 0.350 0.084 0.179 0.309 0.235 0.296 0.397
DeViSE* 0.667 0.709 0.759 0.683 0.742 0.799 0.157 0.233 0.339 0.350 0.412 0.503
HIE* 0.682 0.718 0.768 0.664 0.721 0.783 0.522 0.607 0.698 0.800 0.820 0.854
HUSE-P 0.689 0.722 0.768 0.656 0.729 0.776 0.573 0.632 0.763 0.825 0.836 0.871
HUSE 0.705 0.740 0.787 0.741 0.784 0.831 0.824 0.848 0.878 0.884 0.897 0.916
Table 2: Hierarchical Precision of various methods on four retrieval tasks of UPMC Food-101.
Image Text Fusion P, where we replace the graph regularization loss with the
projection loss to map the image and text embeddings onto
Wang et al. [36] 40.2 82.0 85.1 the class label embeddings. We see that the performance of
Kiela et al. [16] 56.7 88.0 90.8 HUSE-P is inferior to HUSE. More interestingly, we see the
Separate Models 72.4 87.2 91.9 performance regression on text and fusion accuracies com-
CME 72.4 78.8 88.1 pared to the baseline of separate models. These results show
HIE* 73.5 80.2 88.5 that mapping the universal embedding space to class em-
HUSE-P 73.1 83.9 89.6 bedding space degrades the resulting embeddings, whereas
HUSE 73.8 87.3 92.3 constraining the universal embedding space to have the same
semantic distance as class embedding space improves the
Table 3: Classification accuracy of UPMC Food-101 dataset.
performance.
Based on these results, we can say that constraining the
embedding space for cross-modal retrieval will decrease
(17.1% gain) while performing slightly worse (0.7% loss)
the classification performance compared to unconstrained
on text channels.
classification models. Adding semantic information to the
However, the state-of-the-art classification results on this
embedding space boosts the classification performance. In-
dataset can’t be entirely attributed to HUSE. This is because
stead of mapping the universal embedding space to class
we are using different image (Graph-RISE [15]) and text
label embedding space, allowing the universal embedding
(TFIDF+BERT [8]) embeddings compared to the previous
space to have the same semantic distance as the class label
state-of-the-art. To disentangle the contribution of embed-
embedding space significantly improves the classification
dings and HUSE architecture on the classification accuracies,
performance beyond the unconstrained classification models.
we trained “separate” image classification and text classi-
fication models using the same hyperparameters as HUSE.
These models are simple classification models without any
5. Conclusion
semantic regularization and have the hidden layers and hid- We proposed a novel architecture, HUSE, to learn a uni-
den dimensions similar to image and text tower of HUSE. versal embedding space that incorporates semantic infor-
We see that these separate models achieve the fusion accu- mation. Unlike the previous methods that maps image and
racy of 91.9% outperforming the previous best by 1.1%. The text embeddings to a constant class label embedding space,
HUSE architecture further improves this score by another HUSE learns a new universal embedding space that still has
0.4%. The majority of the gains HUSE achieved on image the same semantic distance as the class label embedding
channel can be attributed to the better image embeddings, space. These less constrained universal embeddings outper-
yet HUSE improved this further by another 1.4%. On the formed several other baselines on multiple retrieval tasks.
text channel, the performance of HUSE and the separate clas- Moreover, the embedding space learned by HUSE has more
sification model are similar. Unlike separate classification semantic information than the other baselines as measure by
models, HUSE imposes additional constraints on image and HP@K metric. A shared classification layer used by HUSE
text channels by making them share a single classification for both image and text embeddings and the instance loss
layer and on the embedding space by including semantic reduced the media gap and resulted in superior cross-modal
information. However, these constraints didn’t regress the performance. Moreover, HUSE also achieved state-of-the-
classification accuracies of HUSE, but improved them. art classification accuracy of 92.3% on UPMC Food-101
Table 3 also reports the classification accuracy of HUSE- dataset outperforming the previous best by 1.5%.
References [12] Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana
Lazebnik. A multi-view embedding space for modeling
[1] Galen Andrew, Raman Arora, Jeff Bilmes, and Karen internet images, tags, and their semantics. International
Livescu. Deep canonical correlation analysis. In Inter- journal of computer vision, 106(2):210–233, 2014. 2
national conference on machine learning, pages 1247–
1255, 2013. 2 [13] Jiuxiang Gu, Jianfei Cai, Shafiq R Joty, Li Niu, and
Gang Wang. Look, imagine and match: Improving
[2] John Arevalo, Thamar Solorio, Manuel Montes-
textual-visual cross-modal retrieval with generative
y Gómez, and Fabio A González. Gated multi-
models. In Proceedings of the IEEE Conference on
modal units for information fusion. arXiv preprint
Computer Vision and Pattern Recognition, pages 7181–
arXiv:1702.01992, 2017. 2
7189, 2018. 2
[3] Björn Barz and Joachim Denzler. Hierarchy-based im-
[14] Yan Huang, Qi Wu, Chunfeng Song, and Liang Wang.
age embeddings for semantic image retrieval. In 2019
Learning semantic concepts and order for image and
IEEE Winter Conference on Applications of Computer
sentence matching. In Proceedings of the IEEE Con-
Vision (WACV), pages 638–647. IEEE, 2019. 1, 2, 3, 6
ference on Computer Vision and Pattern Recognition,
[4] Micael Carvalho, Rémi Cadène, David Picard, Laure pages 6163–6171, 2018. 2
Soulier, Nicolas Thome, and Matthieu Cord. Cross-
modal retrieval in the cooking context: Learning se- [15] Da-Cheng Juan, Chun-Ta Lu, Zhen Li, Futang Peng,
mantic text-image embeddings. In The 41st Interna- Aleksei Timofeev, Yi-Ting Chen, Yaxi Gao, Tom
tional ACM SIGIR Conference on Research & Devel- Duerig, Andrew Tomkins, and Sujith Ravi. Graph-
opment in Information Retrieval, pages 35–44. ACM, RISE: Graph-Regularized Image Semantic Embedding.
2018. 1, 2, 6 2019. 3, 5, 8
[5] Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, [16] Douwe Kiela, Edouard Grave, Armand Joulin, and
Nicole Limtiaco, Rhomni St John, Noah Constant, Tomas Mikolov. Efficient large-scale multi-modal clas-
Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, sification. In Thirty-Second AAAI Conference on Artifi-
et al. Universal sentence encoder. arXiv preprint cial Intelligence, 2018. 2, 6, 7, 8
arXiv:1803.11175, 2018. 4, 5 [17] Ryan Kiros, Ruslan Salakhutdinov, and Richard S
[6] Ju Yong Chang and Kyoung Mu Lee. Large margin Zemel. Unifying visual-semantic embeddings with
learning of hierarchical semantic similarity for image multimodal neural language models. arXiv preprint
classification. Computer Vision and Image Understand- arXiv:1411.2539, 2014. 1, 2
ing, 132:3–11, 2015. 3 [18] Dong Li, Hsin-Ying Lee, Jia-Bin Huang, Shengjin
[7] Jia Deng, Alexander C Berg, and Li Fei-Fei. Hierarchi- Wang, and Ming-Hsuan Yang. Learning structured
cal semantic indexing for large scale image retrieval. semantic embeddings for visual recognition. arXiv
In CVPR 2011, pages 785–792. IEEE, 2011. 3 preprint arXiv:1706.01237, 2017. 3, 4
[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and [19] Pradyumna Narayana. Improving gesture recognition
Kristina Toutanova. Bert: Pre-training of deep bidirec- through spatial focus of attention. PhD thesis, Colorado
tional transformers for language understanding. arXiv State University. Libraries, 2018. 2
preprint arXiv:1810.04805, 2018. 5, 8
[20] Pradyumna Narayana, J Ross Beveridge, and Bruce A
[9] Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Draper. Interacting hidden markov models for video un-
Sanja Fidler. Vse++: Improved visual-semantic embed- derstanding. International Journal of Pattern Recogni-
dings. arXiv preprint arXiv:1707.05612, 2(7):8, 2017. tion and Artificial Intelligence, 32(11):1855020, 2018.
1, 2 2
[10] C Fellbaum. Wordnet: Wiley online library. The Ency- [21] Pradyumna Narayana, J Ross Beveridge, and Bruce A
clopedia of Applied Linguistics, 1998. 7 Draper. Analyzing multi-channel networks for gesture
recognition. In 2019 International Joint Conference on
[11] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Neural Networks (IJCNN), pages 1–8. IEEE, 2019. 2
Bengio, Jeff Dean, Marc’Aurelio Ranzato, and Tomas
Mikolov. Devise: A deep visual-semantic embedding [22] Pradyumna Narayana, J Ross Beveridge, and Bruce A
model. In Advances in neural information processing Draper. Continuous gesture recognition through se-
systems, pages 2121–2129, 2013. 2, 3, 4, 6, 7 lective temporal fusion. In 2019 International Joint
Conference on Neural Networks (IJCNN), pages 1–8. [32] Yale Song and Mohammad Soleymani. Polysemous
IEEE, 2019. 2 visual-semantic embedding for cross-modal retrieval.
arXiv preprint arXiv:1906.04402, 2019. 2
[23] Pradyumna Narayana, Ross Beveridge, and Bruce A
Draper. Gesture recognition: Focus on the hands. In [33] Nakul Verma, Dhruv Mahajan, Sundararajan Sellaman-
Proceedings of the IEEE Conference on Computer Vi- ickam, and Vinod Nair. Learning hierarchical similarity
sion and Pattern Recognition, pages 5235–5244, 2018. metrics. In 2012 IEEE conference on computer vision
2 and pattern recognition, pages 2280–2287. IEEE, 2012.
3
[24] Shah Nawaz, Alessandro Calefati, Muhammad Kamran
Janjua, Muhammad Umer Anwaar, and Ignazio Gallo. [34] Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and
Learning fused representations for large-scale multi- Heng Tao Shen. Adversarial cross-modal retrieval. In
modal classification. IEEE Sensors Letters, 3(1):1–4, Proceedings of the 25th ACM international conference
2018. 2 on Multimedia, pages 154–162. ACM, 2017. 2
[25] Yuxin Peng, Xin Huang, and Yunzhen Zhao. An [35] Liwei Wang, Yin Li, and Svetlana Lazebnik. Learning
overview of cross-media retrieval: Concepts, method- deep structure-preserving image-text embeddings. In
ologies, benchmarks, and challenges. IEEE Trans- Proceedings of the IEEE conference on computer vision
actions on circuits and systems for video technology, and pattern recognition, pages 5005–5013, 2016. 1, 2
28(9):2372–2385, 2017. 1, 3
[36] Xin Wang, Devinder Kumar, Nicolas Thome, Matthieu
[26] Jeffrey Pennington, Richard Socher, and Christopher Cord, and Frederic Precioso. Recipe recognition with
Manning. Glove: Global vectors for word representa- large multimodal food dataset. In 2015 IEEE Interna-
tion. In Proceedings of the 2014 conference on empiri- tional Conference on Multimedia & Expo Workshops
cal methods in natural language processing (EMNLP), (ICMEW), pages 1–6. IEEE, 2015. 2, 5, 6, 8
pages 1532–1543, 2014. 4
[37] Jennifer Williams, Ramona Comanescu, Oana Radu,
[27] Nikhil Rasiwasia, Jose Costa Pereira, Emanuele and Leimin Tian. Dnn multimodal fusion techniques
Coviello, Gabriel Doyle, Gert RG Lanckriet, Roger for predicting video sentiment. In Proceedings of
Levy, and Nuno Vasconcelos. A new approach to cross- Grand Challenge and Workshop on Human Multimodal
modal multimedia retrieval. In Proceedings of the 18th Language (Challenge-HML), pages 64–72, 2018. 2
ACM international conference on Multimedia, pages
[38] Fei Yan and Krystian Mikolajczyk. Deep correlation
251–260. ACM, 2010. 2
for matching images and text. In Proceedings of the
[28] Radim Řehůřek and Petr Sojka. Software Framework IEEE conference on computer vision and pattern recog-
for Topic Modelling with Large Corpora. In Pro- nition, pages 3441–3450, 2015. 2
ceedings of the LREC 2010 Workshop on New Chal-
[39] Zhicheng Yan, Hao Zhang, Robinson Piramuthu, Vig-
lenges for NLP Frameworks, pages 45–50, Valletta,
nesh Jagadeesh, Dennis DeCoste, Wei Di, and Yizhou
Malta, May 2010. ELRA. http://is.muni.cz/
Yu. Hd-cnn: hierarchical deep convolutional neural
publication/884893/en. 5
networks for large scale visual recognition. In Proceed-
[29] Amaia Salvador, Nicholas Hynes, Yusuf Aytar, Javier ings of the IEEE international conference on computer
Marin, Ferda Ofli, Ingmar Weber, and Antonio Tor- vision, pages 2740–2748, 2015. 3
ralba. Learning cross-modal embeddings for cooking
[40] Keren Ye and Adriana Kovashka. Advise: Symbolism
recipes and food images. In Proceedings of the IEEE
and external knowledge for decoding advertisements.
conference on computer vision and pattern recognition,
In Proceedings of the European Conference on Com-
pages 3020–3028, 2017. 2, 6
puter Vision (ECCV), pages 837–855, 2018. 1, 2
[30] Nikolaos Sarafianos, Xiang Xu, and Ioannis A Kaka- [41] Bin Zhao, Fei Li, and Eric P Xing. Large-scale category
diaris. Adversarial representation learning for text-to- structure aware image categorization. In Advances in
image matching. arXiv preprint arXiv:1908.10534, Neural Information Processing Systems, pages 1251–
2019. 1, 2 1259, 2011. 3
[31] Florian Schroff, Dmitry Kalenichenko, and James [42] Liangli Zhen, Peng Hu, Xu Wang, and Dezhong Peng.
Philbin. Facenet: A unified embedding for face recog- Deep supervised cross-modal retrieval. In Proceed-
nition and clustering. In Proceedings of the IEEE con- ings of the IEEE Conference on Computer Vision and
ference on computer vision and pattern recognition, Pattern Recognition, pages 10394–10403, 2019. 1
pages 815–823, 2015. 6