0% found this document useful (0 votes)
7 views

In Factuality

This paper proposes a new method to enhance the reasoning capabilities of a multi-modal pretrained model for visual question answering by integrating relevant facts extracted from an external knowledge base. The method is evaluated on the KVQA dataset and outperforms competitive baselines by 19%, achieving new state-of-the-art results.

Uploaded by

1692124482
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

In Factuality

This paper proposes a new method to enhance the reasoning capabilities of a multi-modal pretrained model for visual question answering by integrating relevant facts extracted from an external knowledge base. The method is evaluated on the KVQA dataset and outperforms competitive baselines by 19%, achieving new state-of-the-art results.

Uploaded by

1692124482
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

In Factuality: Efficient Integration of Relevant Facts for Visual Question

Answering
Peter Vickers* and Nikolaos Aletras* and Emilio Monti+ and Loı̈c Barrault*
* +
Department of Computer Science Amazon United Kingdom
University of Sheffield [email protected]
{pgjvickers1, n.aletras, l.barrault}
@sheffield.ac.uk

Abstract state-of-the-art results in a wide range of natural


language inference tasks and benchmarks such as
Visual Question Answering (VQA) methods Natural Language Inference (Bowman et al., 2015).
aim at leveraging visual input to answer ques-
(Rajani et al., 2019) uses pretraining on a domain-
tions that may require complex reasoning over
entities. Current models are trained on la- specific dataset to improve CommonsenseQA by
belled data that may be insufficient to learn 10% absolute accuracy. Tamborrino et al. (2020)
complex knowledge representations. In this develop an improved training objective to improve
paper, we propose a new method to enhance COPA by 10% absolute accuracy.
the reasoning capabilities of a multi-modal Bouraoui et al. (2020) find that BERT is capable
pretrained model (Vision+Language BERT) of relational induction, whilst Broscheit (2019);
by integrating facts extracted from an ex-
Petroni et al. (2020) find that BERT stores non-
ternal knowledge base. Evaluation on the
KVQA dataset benchmark demonstrates that
trivial world-knowledge.
our method outperforms competitive baselines Previous work has argued that restriction to a
by 19%, achieving new state-of-the-art results. uni-modal context may itself impair reasoning per-
We also perform an extensive analysis high- formance (Barsalou, 2008; Li et al., 2020). In a bi-
lighting the limitations of our best performing modal Vision + Language (V+L) context, datasets
model through an ablation study. such as CLEVR and GQA allow for the evaluation
of both model reasoning and language grounding.
1 Introduction Within this setting, Ding et al. (2020) and Lu et al.
Visual Question Answering (VQA) is a popular (2020) show that appropriate neural models trained
multi-modal task of answering a question about on large quantities of data can exhibit accurate rea-
an image. It tracks both inter-modal interactions soning.
and reasoning capabilities of models (Wang et al., In this paper, we propose a new method of ap-
2017; Marino et al., 2019). Recent studies have plying a massively pretrained V+L BERT model
tested compositional reasoning (Johnson et al., (Chen et al., 2020) to the KVQA task (Shah et al.,
2016; Hudson and Manning, 2019) and the inte- 2019). Our method is able to learn a set of rea-
gration of external knowledge (Wang et al., 2017, soning types (confirming findings in Ding et al.
2016; Shah et al., 2019; Marino et al., 2019) for (2020)) but can increase performance even more by
VQA. In this paper, we address Knowledge-aware incorporating external factual information. KVQA
VQA (KVQA) (Shah et al., 2019)1 , defined as a answers require attending to a knowledge base,
VQA task where it is not reasonable to expect a allowing us to quantify the contribution of both
model without access to a knowledge base to be explicit and implicit knowledge extracted from su-
able to answer the questions in the test set. pervised training data. We also quantify the degree
In a uni-modal textual context, both synthetic to which corpus bias makes certain question types
dataset (Kassner et al., 2020) and task-driven (Ding harder, and outline how future datasets may be bet-
et al., 2020) studies of neural models have shown ter balanced.
significant competence at symbolic reasoning. This Our contributions are as follows:
is encouraging, as neural pretrained Language • We perform factual integration into a V+L
Models such as BERT (Devlin et al., 2019) achieve BERT-based model architecture VQA, leading
1
For data, examples, and licence information, please see to 19.1% accuracy improvement over previous
https://malllabiisc.github.io/resources/kvqa/ baselines on KVQA.

468
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics
and the 11th International Joint Conference on Natural Language Processing (Short Papers), pages 468–475
August 1–6, 2021. ©2021 Association for Computational Linguistics
• We evaluate our model’s reasoning capabili- Recent work has introduced methods to incorpo-
ties through an ablation study, proposing ex- rate visual information to create Vision+Language
planations for poor performance on certain BERT models through joint multimodal embed-
question types as well as highlighting our dings (Chen et al., 2020; Su et al., 2019; Lu et al.,
model’s strong preference for text and facts 2019). First, image and text are embedded into the
over the image modality. same space, and then Transformer networks are ap-
plied as in the standard BERT model (Devlin et al.,
• We conduct a bias study of the KVQA dataset, 2019).
revealing both strengths and potential im-
provements for future VQA datasets. Our work is most similar to that of Shah et al.
(2019) since the same preprocessing pipeline is
2 Related Work used. However, our system does not use a memory
network, and instead relies on on a BERT-based
VQA tasks explicitly encourage grounded reason- model (UNITER, see section 3) to model the rela-
ing (Antol et al., 2015), with emphasis on a variety tionship between question, facts, and image with
of sub-domains, such as commonsense (Zellers self-attention layers.
et al., 2019), compositionality and grounding (Suhr
et al., 2020), factual reasoning (Wang et al., 2017)
or external knowledge reasoning (Wang et al.,
2016; Marino et al., 2019; Shah et al., 2019).
State-of-the-art systems for external knowledge 3 Methodology
VQA are based on Memory networks (MemNet,
(Weston et al., 2014)). In Shah et al. (2019), the
facts are extracted from the Knowledge Graph (KG)
To answer KVQA with Neural models, we first take
by considering the visual (from image) and eventu-
the V+L BERT model UNITER (Chen et al., 2020)
ally textual (from Wikipedia caption) entities. They
with the highest score on the commonsense VQA
are then embedded using a Bi-LSTM encoder and
task, VCR (Zellers et al., 2019).
fed into the memory. After the question is embed-
ded in a similar way, the resulting representation is
In order to allow UNITER to accept external KG
used to query the memory by soft attention. Sev-
facts, we cast these facts to a textual form ‘Entity1
eral stacked memory layers are used to better model
Relation Entity2 ’. To keep the input facts count
multi-hop facts.
small, we perform a conditional search of the KG.
Wang et al. (2016, 2017) introduce two datasets, The KVQA task consists in finding a∗ :
KB-VQA and FVQA respectively, and address the
task with systems that perform searches in a visual
knowledge graph formed from the image and a KB.
The question is first mapped to a query of the form
a∗ = argmax p(a|q, i, K) ≈ argmax p(a|q, i, ki,q ) (1)
〈visual object, relationship, answer source〉, which a∈A a∈A

is then used to extract the supporting facts from the


KB. They report improved results when compared
to systems using LSTM, SVM and hierarchical
co-attention (Lu et al., 2016). where a∗ is the correct answer out of candidate
In Marino et al. (2019), the OK-VQA is pre- set A; and q, i, and K are a question, image and
sented with some baseline results obtained with knowledge base, respectively. As shown, we may
MUTAN (Ben-younes et al., 2017), a multimodal reduce the KG through a conditional search to find
tensor-based Tucker decomposition which models the relevant subset of facts ki,q .
interactions between visual (from CNN) and tex-
tual (from RNN) representations. Those systems To define the subset ki,q , we follow Shah et al.
exhibit rather low performance compared to those (2019) in extracting all facts from the knowledge
obtained on standard VQA, demonstrating that the base that are up to two hops from any entities de-
corpus requires external knowledge to be solved tected by the textual entity linking or the face de-
correctly. tection.

469
Micro-F1 of 0.686.
Normalised image location facts are generated
from these detections, such as ‘Barack Obama at 42
78’, which would indicate that the centre bounding
box for Barack Obama is at normalised (0-100) po-
sition x=42, y=78 of the image. We use the names
of identified entities to query Shah et al.’s 2019
reduced Wikidata graph (Vrandečić and Krötzsch,
2014) up to two hops. The extracted facts are fi-
nally cast to the form ‘subject relation object’.

3.2 Reasoning Stage


The neural model we use, UNITER, is pre-
trained on MS COCO (Lin et al., 2014), Visual
Genome (Krishna et al., 2016), Conceptual Cap-
tions (Sharma et al., 2018), and SBU Captions (Or-
donez et al., 2011). It is a multi-task system that
is trained on performing Masked Language Mod-
eling, Image-Text Matching, and Masked Region
Modeling (Chen et al., 2020).

4 Experimental Setup
We select the KVQA dataset for two reasons: to
Figure 1: Our Model our knowledge, it is the largest external knowledge
dataset (with 183k questions), and the questions
Our model, as presented in section 2 consists are annotated with their reasoning types. We use
of two stages: preprocessing, which implements accuracy as the evaluation metric and provide re-
relevant fact extraction, and reasoning, which se- sults over both the entire dataset and also for each
lects an answer from the question, facts, and image question type as provided in the KVQA dataset.
features. The baseline systems for KVQA are those pre-
sented in (Shah et al., 2019) and discussed in sec-
3.1 Preprocessing Stage tion 2. The first baseline is a stacked BLSTM
encoder, operating over question and facts. This
For preprocessing and fact acquisition, we broadly
system has an overall accuracy of 48.0% . The
reproduce the fact and feature extraction process
second is the MemNet architecture and has the pre-
used in Shah et al. (2019). We perform object
viously highest performing baseline accuracy at
detection with the Faster R-CNN network (Ren
50.2%.
et al., 2017). A seven-dimensional normalised size
and location vector is concatenated with the Faster We use the UNITER BASE pretrained model
R-CNN features. available at the ChenRocks GitHub repository2
with custom classification layers (MLP +softmax
For person detection, we use MTCNN (Zhang
output layer). For task training, we merge retrieved
et al., 2016) and Facenet (Schroff et al., 2015) mod-
facts with the question, dividing each statement
els, pretrained on the MS-celeb-1M (Guo et al.,
with the ‘[SEP]’ token, following research that
2016) dataset, to generate 128-dimensional em-
indicates that this token induces partitioning and
beddings. We predict names by nearest-neighbour
pipelining of information across attention layers
comparison with the KVQA reference dataset. We
(Clark et al., 2019). The textual input stream is to-
treat the name identification as a multi-class clas-
kenised with the HuggingFace ‘bert-base-uncased’
sification problem, achieving a Micro-F1 of 0.539.
tokeniser (Wolf et al., 2020). We set the maximum
Since this is lower than reported in Shah et al.
WordPiece sequences length to 412, the maximum
(2019), we follow them in applying a textual entity
visual objects count to 100, the learning rate to
linker (van Hulst et al., 2020) over supplied im-
2
age descriptions. This setup achieves a per-image https://github.com/ChenRocks/UNITER

470
Model Question Type Q+F+I Q+F Q+I F+I Q F I
1-Hop 65.7 65.7 32.4 3.9 32.4 3.8 4.5
Entropy
Question Type MemNet UNITER 1-Hop Counting 78.0 78.0 30.3 0.0 30.3 0.0 0.0
(Base 2) 1-Hop Subtraction 28.9 28.6 28.8 0.8 30.3 0.6 6.5
1-Hop 61.0 65.7 7.8 Boolean 94.6 94.6 55.2 1.3 55.2 1.0 10.5
Comparison 90.4 90.4 38.7 1.0 38.7 0.9 10.7
1-Hop Counting - 78.0 1.4
Counting 79.4 79.4 66.1 0.6 65.9 0.4 1.4
1-Hop Subtraction - 28.6 4.3 Intersection 79.4 79.4 61.0 0.4 60.6 0.3 0.0
Boolean 75.1 94.6 1.1 Multi-Entity 77.1 77.1 41.3 0.8 41.2 0.7 6.4
Comparison 50.5 90.4 2.1 Multi-Hop 87.9 87.9 29.0 0.8 28.9 0.8 0.0
Multi-Relation 75.2 75.2 25.1 3.0 25.0 3.0 2.5
Counting 49.5 79.4 2.3 Spatial 21.2 21.2 0.0 13.0 0.0 13.0 0.0
Intersection 72.5 79.4 1.2 Subtraction 34.4 34.4 1.3 1.0 0.9 0.7 0.0
Multi-Entity 43.5 77.1 3.3 Overall 69.3 69.3 31.6 3.1 31.5 3.0 3.6
Multi-Hop 53.2 87.9 3.7
Multi-Relation 45.2 75.2 7.1 Table 2: Ablation Study of Information. Q=Question,
Spatial 48.1 21.2 11.5 I=Image, F=Facts. Image refers to the Image feature
Subtraction 40.5 34.4 6.0 stream. Results are expressed as % accuracy by ques-
Overall 50.2 69.3 7.6 tion type.

Table 1: Results in terms of % accuracy of the consid-


ered systems break down into question types along with crease from MemNet (-26.7%). This question type
the question types distribution (last column). requires two-hop reasoning where the second hop is
a numerical operation of the form argmin(xi − yi ).
y
8 × 10−5 and use AdamW (Loshchilov and Hutter, Both of these have been shown to be problematic
2017) as optimizer. Once preprocessing is com- for BERT (Kassner et al., 2020; Geva et al., 2020).
pleted, we train the UNITER model with the cross-
entropy objective function for 80,000 iterations, 6 Analysis
which we empirically found to guarantee conver- UNITER performs well at the reasoning tasks in
gence. general, with the most surprising result being that it
apparently does better at multi-hop reasoning than
5 Results
one-hop. We believe that this can be explained
Table 1 shows the results of our system (UNITER), by the presence of unbalanced distribution of an-
using a question label break-down similar to Shah swer types in the dataset perturbing the results (see
et al. (2019). Overall, we observe that our system Table 1). We discuss this in Section 6.1.
outperforms the previous baseline MemNet setting In order to better understand the reasoning ca-
(see ‘World+WikiCap+ORG’ in Shah et al. (2019)) pability of our model and the impact of each input
with an absolute improvement of 19%. modality, we perform an inference time ablation
Our results show that UNITER is learning to study, presented in Table 2.
perform reasoning more accurately than MemNet Ablation of Image features (column ‘Q+F’) does
in all but two cases. In the question types involv- not change the performance, suggesting that the
ing multiple entities (‘Multi-Entity’, ‘Multi-Hop’, model is not attending to image features. To con-
‘Multi-Relation’), the increase is the greatest, sug- firm this hypothesis, we performed an experiment
gesting that UNITER is able to robustly learn these with adversarial images, obtaining very similar re-
reasoning here. We speculate that stacked self- sults for each question type and the same overall
attention layers in BERT are able to better attend score (69.30%). We explain this behaviour by the
to the many involved entities than MemNet. fact that the preprocessing pipeline extracts all the
We now discuss the performance of our model required information as explicit facts which the
on its weakest categories, namely ‘Subtraction’ and model prefers over the more ambiguous visual fea-
‘Spatial’. The poor performance on ‘Subtraction’ tures. We leave a deeper analysis for further work.
questions confirms previous results that BERT-like An interesting case is the ‘Spatial’ questions,
models require specialised pretraining for numer- where facts alone are able to correctly answer 13%
ical reasoning tasks (Geva et al., 2020). In the of the questions. This is likely the result of the
case of our model specifically, we note the lack of answers to this question type being entities present
numerical reasoning tasks in UNITER’s pretrain- in the facts. Again, we observe that the model is
ing regime. ‘Spatial’ is the model’s least accurate not able to learn this information from the visual
question type (21.4%) and the biggest absolute de- features.

471
Train Ablation Adversarial Modality* importance of a modality to solving the task.
Question Type Q+I Q I F
1-Hop 47.09 38.5 65.9 31.3 Through comparing train and inference ablation
1-Hop Counting 66.1 61.5 75.2 50.5 of facts (‘Q+I’ column of Table 3 and of Table 2)
1-Hop Subtraction 29.4 29.7 28.1 26.2
we observe that when facts are unavailable at train
Boolean 83.9 67.3 94.1 57.5
Comparison 83.4 60.3 90.6 47.8 time, the model attends to images to obtain 47.0%
Counting 75.4 75.2 78.9 70.2 accuracy, which is 15.4% more than the 31.6% ob-
Intersection 67.6 67.9 76.8 61.2 tained by the corresponding inference time ablation.
Multi-Entity 69.4 57.2 76.4 47.6
Multi-Hop 56.5 50.2 87.9 38.4 This indicates that the visual modality can provide
Multi-Relation 47.3 38.9 75.2 28.3 useful information for this task.
Spatial 3.3 1.2 21.1 0.0
Subtraction 2.1 2.6 39.2 1.6
We observe a similar trend in the fact and im-
Overall 47.0 40.8 69.3 32.8 age ablation setting (‘Q’ column of Table 3 and
of Table 2) that the model is able to greater lever-
Table 3: Further Ablation and Adversarial Studies. age questions to make accurate predictions when
*Adversarial Modality indicates that the sample from
additional modalities are never available.
that modality was randomly assigned from the entire
data split We also perform adversarial checks, where ran-
dom images or facts from the data split are pre-
sented at inference time. These align closely with
6.1 Bias Studies the ablation study, with adversarial images (Col-
We briefly discuss the corpus bias, a well-known umn ‘I’ of Table 3) performing within 0.1% of
concern in VQA (Goyal et al., 2019). We con- blanked images (Column ‘Q+F’ of Table 3) and
sider question difficulty across three parameters: adversarial facts (Column ’F’ of Table 3) perform-
reasoning difficulty, task design, and corpus bias. ing within 1% of blanked facts (Column ‘Q+I’ of
Certain question types are inherently more com- Table 3). These results confirm the importance of
plex, as discussed in Section 5. Additionally, the factual data and the unimportance of raw image
task may have different numbers of answer classes features to a model trained on the full data.
per task, effectively weakening any priors mod-
els might form (see Entropy column in Table 1). 7 Conclusion and Future Work
Finally, an unbalanced dataset may cause certain
reasoning types to be underrepresented, making it We evaluated our model and found that it improves
harder for models to learn for them. ‘Spatial’ and on the previous state of the art by a substantial
‘Substraction’ questions are among the least repre- margin (19.1%). An ablation study revealed the
sented in the training dataset, which increase their specific strengths and weaknesses of our model on
difficulty for the model. certain question categories when evaluated on the
Unseen answer classes are also an issue. For KVQA dataset. We show that the UNITER model
‘Spatial’ questions, only 54.2% of the test answers is not actually using the visual input.
(output classes) are actually seen during training, In the future, we seek to create a large exter-
placing an upper bound on accuracy. We find nal knowledge dataset designed following KVQA
98.4% of ‘Spatial’ questions the model answered with more entities besides persons to encourage
correctly and 95.7% of ‘Spatial’ question the model grounded reasoning, and better calibration of an-
answered incorrectly were supplied with adequate swer types. We will also consider pretraining our
facts by the preprocessing pipeline. model on closely related tasks. This will help to
form a model capable of learning robust reasoning
Training time ablation and adversarial experi- with a high degree of spatial specificity and entity
ments To further probe the task, we perform a discrimination.
training time ablation with first facts, and then facts
and images removed (see Table 3). In this we seek
Acknowledgements
to exhibit the capability of our model to leverage
the available modalities and to compensate for the Peter Vickers is supported by the Centre for Doc-
missing ones. toral Training in Speech and Language Technolo-
Through comparing the training time and infer- gies (SLT) and their Applications funded by the
ence time ablations, we can better understand the UK Research and Innovation grant EP/S023062/1.

472
Ethical Statement Analyzing and Interpreting Neural Networks for
NLP, pages 276–286, Florence, Italy. Association
This work is based on the open-source KVQA for Computational Linguistics.
dataset, an English multimodal dataset, and the
Jacob Devlin, Ming Wei Chang, Kenton Lee, and
Wikidata knowledge base (also in English). No Kristina Toutanova. 2019. BERT: Pre-training of
English-specific preprocessing was used for this deep bidirectional transformers for language under-
research and the UNITER model is language ag- standing. In NAACL HLT 2019 - 2019 Conference of
nostic, which tends to suggest that this could gen- the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
eralize to other languages. We will make our code
nologies - Proceedings of the Conference, volume 1,
publicly available to ensure the reproducibility of pages 4171–4186.
our experiments in the following repository
David Ding, Felix Hill, Adam Santoro, and Matt
Botvinick. 2020. Object-based attention for spatio-
References temporal reasoning: Outperforming neuro-symbolic
models with flexible distributed architectures.
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-
garet Mitchell, Dhruv Batra, C Lawrence Zitnick, Mor Geva, Ankit Gupta, and Jonathan Berant. 2020. In-
and Devi Parikh. 2015. VQA: Visual question an- jecting Numerical Reasoning Skills into Language
swering. In Proceedings of the IEEE International Models. In Proceedings of the 58th Annual Meet-
Conference on Computer Vision, volume 2015 Inter, ing of the Association for Computational Linguis-
pages 2425–2433. tics, pages 946–958, Online. Association for Com-
putational Linguistics.
Lawrence W Barsalou. 2008. Grounded cognition. An-
nual Review of Psychology, 59:617–645. Yash Goyal, Tejas Khot, Aishwarya Agrawal, Dou-
glas Summers-Stay, Dhruv Batra, and Devi Parikh.
Hedi Ben-younes, Rémi Cadene, Matthieu Cord, and 2019. Making the V in VQA Matter: Elevating the
Nicolas Thome. 2017. MUTAN: Multimodal Role of Image Understanding in Visual Question An-
Tucker Fusion for Visual Question Answering. Pro- swering. International Journal of Computer Vision,
ceedings of the IEEE International Conference on 127(4):398–414.
Computer Vision, 2017-Octob:2631–2639.
Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong
Zied Bouraoui, Jose Camacho-Collados, and Steven He, and Jianfeng Gao. 2016. MS-celeb-1M: A
Schockaert. 2020. Inducing Relational Knowledge dataset and benchmark for large-scale face recogni-
from BERT. Proceedings of the AAAI Conference tion. In Lecture Notes in Computer Science (includ-
on Artificial Intelligence, 34(05):7456–7463. ing subseries Lecture Notes in Artificial Intelligence
and Lecture Notes in Bioinformatics), volume 9907
Samuel R. Bowman, Gabor Angeli, Christopher Potts, LNCS, pages 87–102.
and Christopher D. Manning. 2015. A large anno-
tated corpus for learning natural language inference. Drew A Hudson and Christopher D Manning. 2019.
In Conference Proceedings - EMNLP 2015: Con- GQA: A new dataset for real-world visual reasoning
ference on Empirical Methods in Natural Language and compositional question answering. In Proceed-
Processing, pages 632–642. Association for Compu- ings of the IEEE Computer Society Conference on
tational Linguistics (ACL). Computer Vision and Pattern Recognition, volume
2019-June, pages 6693–6702.
Samuel Broscheit. 2019. Investigating entity knowl-
edge in BERT with simple neural end-to-end en- Johannes M. van Hulst, Faegheh Hasibi, Koen Derck-
tity linking. In CoNLL 2019 - 23rd Conference sen, Krisztian Balog, and Arjen P. de Vries. 2020.
on Computational Natural Language Learning, Pro- REL: An Entity Linker Standing on the Shoulders of
ceedings of the Conference, pages 677–685. Associ- Giants. SIGIR 2020 - Proceedings of the 43rd Inter-
ation for Computational Linguistics. national ACM SIGIR Conference on Research and
Development in Information Retrieval, pages 2197–
Yen Chun Chen, Linjie Li, Licheng Yu, Ahmed El 2200.
Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and
Jingjing Liu. 2020. UNITER: UNiversal Image- Justin Johnson, Bharath Hariharan, Laurens van der
TExt Representation Learning. Lecture Notes Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross
in Computer Science (including subseries Lecture Girshick. 2016. CLEVR: A Diagnostic Dataset
Notes in Artificial Intelligence and Lecture Notes in for Compositional Language and Elementary Visual
Bioinformatics), 12375 LNCS:104–120. Reasoning. Proceedings - 30th IEEE Conference
on Computer Vision and Pattern Recognition, CVPR
Kevin Clark, Urvashi Khandelwal, Omer Levy, and 2017, 2017-Janua:1988–1997.
Christopher D Manning. 2019. What Does BERT
Look at? An Analysis of BERT’s Attention. In Pro- Nora Kassner, Benno Krojer, and Hinrich Schütze.
ceedings of the 2019 ACL Workshop BlackboxNLP: 2020. Are Pretrained Language Models Symbolic

473
Reasoners over Knowledge? In Proceedings of Information Processing Systems, volume 24, pages
the 24th Conference on Computational Natural Lan- 1143–1151. Curran Associates, Inc.
guage Learning, pages 552–564, Online. Associa-
tion for Computational Linguistics. Fabio Petroni, Tim Rocktäschel, Patrick Lewis, Anton
Bakhtin, Yuxiang Wu, Alexander H. Miller, and Se-
Ranjay Krishna, Justin Johnson, Yannis Kalantidis bastian Riedel. 2020. Language models as knowl-
Yahoo, David Ayman Shamma, Yuke Zhu, Oliver edge bases? In EMNLP-IJCNLP 2019 - 2019 Con-
Groth, Kenji Hata, Joshua Kravitz, Stephanie Chen, ference on Empirical Methods in Natural Language
Yannis Kalantidis, Li-Jia Li, David A Shamma, Processing and 9th International Joint Conference
Michael S Bernstein, and Li Fei-Fei. 2016. Visual on Natural Language Processing, Proceedings of
Genome: Connecting Language and Vision Using the Conference, pages 2463–2473. Association for
Crowdsourced Dense Image Annotations Human tra- Computational Linguistics.
jectory forecasting View project hybrid intrusion
detction systems View project Visual Genome Con- Nazneen Fatema Rajani, Bryan McCann, Caiming
necting Language and Vision Using Crowdsourced Xiong, and Richard Socher. 2019. Explain Your-
Dense Image A. Article in International Journal of self! Leveraging Language Models for Common-
Computer Vision, 123(1):32–73. sense Reasoning. ACL 2019 - 57th Annual Meet-
ing of the Association for Computational Linguistics,
Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Proceedings of the Conference, pages 4932–4942.
Hsieh, and Kai-Wei Chang. 2020. What Does BERT
with Vision Look At? In Proceedings of the 58th An- Shaoqing Ren, Kaiming He, Ross Girshick, and Jian
nual Meeting of the Association for Computational Sun. 2017. Faster R-CNN: Towards Real-Time
Linguistics, pages 5265–5275, Online. Association Object Detection with Region Proposal Networks.
for Computational Linguistics. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 39(6):1137–1149.
Tsung Yi Lin, Michael Maire, Serge Belongie, James
Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, Florian Schroff, Dmitry Kalenichenko, and James
and C. Lawrence Zitnick. 2014. Microsoft COCO: Philbin. 2015. FaceNet: A unified embedding for
Common objects in context. In Lecture Notes face recognition and clustering. In Proceedings of
in Computer Science (including subseries Lecture the IEEE Computer Society Conference on Com-
Notes in Artificial Intelligence and Lecture Notes in puter Vision and Pattern Recognition, volume 07-12-
Bioinformatics), PART 5, pages 740–755. Springer June, pages 815–823.
Verlag.
Sanket Shah, Anand Mishra, Naganand Yadati, and
Ilya Loshchilov and Frank Hutter. 2017. Fixing Weight Partha Pratim Talukdar. 2019. KVQA: Knowledge-
Decay Regularization in Adam. CoRR, abs/1711.0. Aware Visual Question Answering. Proceedings
Jiasen Lu, Dhruv Batra, Devi Parikh, and Ste- of the AAAI Conference on Artificial Intelligence,
fan Lee. 2019. ViLBERT: Pretraining Task- 33(01):8876–8884.
Agnostic Visiolinguistic Representations for Vision- Piyush Sharma, Nan Ding, Sebastian Goodman, and
and-Language Tasks. In Advances in Neural Infor- Radu Soricut. 2018. Conceptual captions: A
mation Processing Systems, volume 32, pages 13– cleaned, hypernymed, image alt-text dataset for auto-
23. Curran Associates, Inc. matic image captioning. In ACL 2018 - 56th Annual
Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Meeting of the Association for Computational Lin-
Parikh, and Stefan Lee. 2020. 12-in-1: Multi-Task guistics, Proceedings of the Conference (Long Pa-
Vision and Language Representation Learning. In pers), volume 1, pages 2556–2565. Association for
The IEEE/CVF Conference on Computer Vision and Computational Linguistics (ACL).
Pattern Recognition (CVPR).
Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu,
Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Furu Wei, and Jifeng Dai. 2019. VL-BERT: Pre-
2016. Hierarchical question-image co-attention for training of Generic Visual-Linguistic Representa-
visual question answering. In Advances in Neural tions. arXiv.
Information Processing Systems, volume 29, pages
289–297. Curran Associates, Inc. Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang,
Huajun Bai, and Yoav Artzi. 2020. A corpus for
Kenneth Marino, Mohammad Rastegari, Ali Farhadi, reasoning about natural language grounded in pho-
and Roozbeh Mottaghi. 2019. OK-VQA: A Visual tographs. In ACL 2019 - 57th Annual Meeting of the
Question Answering Benchmark Requiring External Association for Computational Linguistics, Proceed-
Knowledge. Proceedings of the IEEE Computer So- ings of the Conference, pages 6418–6428. Associa-
ciety Conference on Computer Vision and Pattern tion for Computational Linguistics (ACL).
Recognition, 2019-June:3190–3199.
Alexandre Tamborrino, Nicola Pellicanò, Baptiste Pan-
Vicente Ordonez, Girish Kulkarni, and Tamara Berg. nier, Pascal Voitot, and Louise Naudin. 2020. Pre-
2011. Im2Text: Describing Images Using 1 Mil- training Is (Almost) All You Need: An Application
lion Captioned Photographs. In Advances in Neural to Commonsense Reasoning. In Proceedings of the

474
58th Annual Meeting of the Association for Compu-
tational Linguistics, pages 3878–3887, Online. As-
sociation for Computational Linguistics.
Denny Vrandečić and Markus Krötzsch. 2014. Wiki-
data: a free collaborative knowledgebase. Commu-
nications of the ACM, 57(10):78–85.
Peng Wang, Qi Wu, Chunhua Shen, Anthony Dick, and
Anton Van Den Hengel. 2017. Explicit knowledge-
based reasoning for visual question answering. In
IJCAI International Joint Conference on Artificial
Intelligence, volume 0, pages 1290–1296. Interna-
tional Joint Conferences on Artificial Intelligence.
Peng Wang, Qi Wu, Chunhua Shen, Anton van den
Hengel, and Anthony Dick. 2016. FVQA: Fact-
based Visual Question Answering. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence,
40(10):2413–2427.
Jason Weston, Sumit Chopra, and Antoine Bordes.
2014. Memory Networks. 3rd International Con-
ference on Learning Representations, ICLR 2015 -
Conference Track Proceedings.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Chaumond, Clement Delangue, Anthony Moi, Pier-
ric Cistac, Tim Rault, Rémi Louf, Morgan Fun-
towicz, Joe Davison, Sam Shleifer, Patrick von
Platen, Clara Ma, Yacine Jernite, Julien Plu, Can-
wen Xu, Teven Le Scao, Sylvain Gugger, Mariama
Drame, Quentin Lhoest, and Alexander M. Rush.
2020. HuggingFace’s Transformers: State-of-the-
art Natural Language Processing. In Proceedings of
the 2020 Conference on Empirical Methods in Nat-
ural Language Processing: System Demonstrations,
pages 38–45, Online. Association for Computational
Linguistics.
Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin
Choi. 2019. From recognition to cognition: Vi-
sual commonsense reasoning. In Proceedings of the
IEEE Computer Society Conference on Computer
Vision and Pattern Recognition, volume 2019-June,
pages 6713–6724.
Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and
Yu Qiao. 2016. Joint Face Detection and Alignment
using Multi-task Cascaded Convolutional Networks.
IEEE Signal Processing Letters, 23(10):1499–1503.

475

You might also like