In Factuality

This paper proposes a new method to enhance the reasoning capabilities of a multi-modal pretrained model for visual question answering by integrating relevant facts extracted from an external knowledge base. The method is evaluated on the KVQA dataset and outperforms competitive baselines by 19%, achieving new state-of-the-art results.

Uploaded by

1692124482

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

In Factuality

Uploaded by

1692124482

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

In Factuality: Efficient Integration of Relevant Facts for Visual Question

Answering
Peter Vickers* and Nikolaos Aletras* and Emilio Monti+ and Loı̈c Barrault*
* +
Department of Computer Science Amazon United Kingdom
University of Sheffield [email protected]
{pgjvickers1, n.aletras, l.barrault}
@sheffield.ac.uk

Abstract state-of-the-art results in a wide range of natural

language inference tasks and benchmarks such as
Visual Question Answering (VQA) methods Natural Language Inference (Bowman et al., 2015).
aim at leveraging visual input to answer ques-
(Rajani et al., 2019) uses pretraining on a domain-
tions that may require complex reasoning over
entities. Current models are trained on la- specific dataset to improve CommonsenseQA by
belled data that may be insufficient to learn 10% absolute accuracy. Tamborrino et al. (2020)
complex knowledge representations. In this develop an improved training objective to improve
paper, we propose a new method to enhance COPA by 10% absolute accuracy.
the reasoning capabilities of a multi-modal Bouraoui et al. (2020) find that BERT is capable
pretrained model (Vision+Language BERT) of relational induction, whilst Broscheit (2019);
by integrating facts extracted from an ex-
Petroni et al. (2020) find that BERT stores non-
ternal knowledge base. Evaluation on the
KVQA dataset benchmark demonstrates that
trivial world-knowledge.
our method outperforms competitive baselines Previous work has argued that restriction to a
by 19%, achieving new state-of-the-art results. uni-modal context may itself impair reasoning per-
We also perform an extensive analysis high- formance (Barsalou, 2008; Li et al., 2020). In a bi-
lighting the limitations of our best performing modal Vision + Language (V+L) context, datasets
model through an ablation study. such as CLEVR and GQA allow for the evaluation
of both model reasoning and language grounding.
1 Introduction Within this setting, Ding et al. (2020) and Lu et al.
Visual Question Answering (VQA) is a popular (2020) show that appropriate neural models trained
multi-modal task of answering a question about on large quantities of data can exhibit accurate rea-
an image. It tracks both inter-modal interactions soning.
and reasoning capabilities of models (Wang et al., In this paper, we propose a new method of ap-
2017; Marino et al., 2019). Recent studies have plying a massively pretrained V+L BERT model
tested compositional reasoning (Johnson et al., (Chen et al., 2020) to the KVQA task (Shah et al.,
2016; Hudson and Manning, 2019) and the inte- 2019). Our method is able to learn a set of rea-
gration of external knowledge (Wang et al., 2017, soning types (confirming findings in Ding et al.
2016; Shah et al., 2019; Marino et al., 2019) for (2020)) but can increase performance even more by
VQA. In this paper, we address Knowledge-aware incorporating external factual information. KVQA
VQA (KVQA) (Shah et al., 2019)1 , defined as a answers require attending to a knowledge base,
VQA task where it is not reasonable to expect a allowing us to quantify the contribution of both
model without access to a knowledge base to be explicit and implicit knowledge extracted from su-
able to answer the questions in the test set. pervised training data. We also quantify the degree
In a uni-modal textual context, both synthetic to which corpus bias makes certain question types
dataset (Kassner et al., 2020) and task-driven (Ding harder, and outline how future datasets may be bet-
et al., 2020) studies of neural models have shown ter balanced.
significant competence at symbolic reasoning. This Our contributions are as follows:
is encouraging, as neural pretrained Language • We perform factual integration into a V+L
Models such as BERT (Devlin et al., 2019) achieve BERT-based model architecture VQA, leading
1
For data, examples, and licence information, please see to 19.1% accuracy improvement over previous
https://malllabiisc.github.io/resources/kvqa/ baselines on KVQA.

468
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics
and the 11th International Joint Conference on Natural Language Processing (Short Papers), pages 468–475
August 1–6, 2021. ©2021 Association for Computational Linguistics
• We evaluate our model’s reasoning capabili- Recent work has introduced methods to incorpo-
ties through an ablation study, proposing ex- rate visual information to create Vision+Language
planations for poor performance on certain BERT models through joint multimodal embed-
question types as well as highlighting our dings (Chen et al., 2020; Su et al., 2019; Lu et al.,
model’s strong preference for text and facts 2019). First, image and text are embedded into the
over the image modality. same space, and then Transformer networks are ap-
plied as in the standard BERT model (Devlin et al.,
• We conduct a bias study of the KVQA dataset, 2019).
revealing both strengths and potential im-
provements for future VQA datasets. Our work is most similar to that of Shah et al.
(2019) since the same preprocessing pipeline is
2 Related Work used. However, our system does not use a memory
network, and instead relies on on a BERT-based
VQA tasks explicitly encourage grounded reason- model (UNITER, see section 3) to model the rela-
ing (Antol et al., 2015), with emphasis on a variety tionship between question, facts, and image with
of sub-domains, such as commonsense (Zellers self-attention layers.
et al., 2019), compositionality and grounding (Suhr
et al., 2020), factual reasoning (Wang et al., 2017)
or external knowledge reasoning (Wang et al.,
2016; Marino et al., 2019; Shah et al., 2019).
State-of-the-art systems for external knowledge 3 Methodology
VQA are based on Memory networks (MemNet,
(Weston et al., 2014)). In Shah et al. (2019), the
facts are extracted from the Knowledge Graph (KG)
To answer KVQA with Neural models, we first take
by considering the visual (from image) and eventu-
the V+L BERT model UNITER (Chen et al., 2020)
ally textual (from Wikipedia caption) entities. They
with the highest score on the commonsense VQA
are then embedded using a Bi-LSTM encoder and
task, VCR (Zellers et al., 2019).
fed into the memory. After the question is embed-
ded in a similar way, the resulting representation is
In order to allow UNITER to accept external KG
used to query the memory by soft attention. Sev-
facts, we cast these facts to a textual form ‘Entity1
eral stacked memory layers are used to better model
Relation Entity2 ’. To keep the input facts count
multi-hop facts.
small, we perform a conditional search of the KG.
Wang et al. (2016, 2017) introduce two datasets, The KVQA task consists in finding a∗ :
KB-VQA and FVQA respectively, and address the
task with systems that perform searches in a visual
knowledge graph formed from the image and a KB.
The question is first mapped to a query of the form
a∗ = argmax p(a|q, i, K) ≈ argmax p(a|q, i, ki,q ) (1)
〈visual object, relationship, answer source〉, which a∈A a∈A

is then used to extract the supporting facts from the

KB. They report improved results when compared
to systems using LSTM, SVM and hierarchical
co-attention (Lu et al., 2016). where a∗ is the correct answer out of candidate
In Marino et al. (2019), the OK-VQA is pre- set A; and q, i, and K are a question, image and
sented with some baseline results obtained with knowledge base, respectively. As shown, we may
MUTAN (Ben-younes et al., 2017), a multimodal reduce the KG through a conditional search to find
tensor-based Tucker decomposition which models the relevant subset of facts ki,q .
interactions between visual (from CNN) and tex-
tual (from RNN) representations. Those systems To define the subset ki,q , we follow Shah et al.
exhibit rather low performance compared to those (2019) in extracting all facts from the knowledge
obtained on standard VQA, demonstrating that the base that are up to two hops from any entities de-
corpus requires external knowledge to be solved tected by the textual entity linking or the face de-
correctly. tection.

469
Micro-F1 of 0.686.
Normalised image location facts are generated
from these detections, such as ‘Barack Obama at 42
78’, which would indicate that the centre bounding
box for Barack Obama is at normalised (0-100) po-
sition x=42, y=78 of the image. We use the names
of identified entities to query Shah et al.’s 2019
reduced Wikidata graph (Vrandečić and Krötzsch,
2014) up to two hops. The extracted facts are fi-
nally cast to the form ‘subject relation object’.

3.2 Reasoning Stage

The neural model we use, UNITER, is pre-
trained on MS COCO (Lin et al., 2014), Visual
Genome (Krishna et al., 2016), Conceptual Cap-
tions (Sharma et al., 2018), and SBU Captions (Or-
donez et al., 2011). It is a multi-task system that
is trained on performing Masked Language Mod-
eling, Image-Text Matching, and Masked Region
Modeling (Chen et al., 2020).

4 Experimental Setup
We select the KVQA dataset for two reasons: to
Figure 1: Our Model our knowledge, it is the largest external knowledge
dataset (with 183k questions), and the questions
Our model, as presented in section 2 consists are annotated with their reasoning types. We use
of two stages: preprocessing, which implements accuracy as the evaluation metric and provide re-
relevant fact extraction, and reasoning, which se- sults over both the entire dataset and also for each
lects an answer from the question, facts, and image question type as provided in the KVQA dataset.
features. The baseline systems for KVQA are those pre-
sented in (Shah et al., 2019) and discussed in sec-
3.1 Preprocessing Stage tion 2. The first baseline is a stacked BLSTM
encoder, operating over question and facts. This
For preprocessing and fact acquisition, we broadly
system has an overall accuracy of 48.0% . The
reproduce the fact and feature extraction process
second is the MemNet architecture and has the pre-
used in Shah et al. (2019). We perform object
viously highest performing baseline accuracy at
detection with the Faster R-CNN network (Ren
50.2%.
et al., 2017). A seven-dimensional normalised size
and location vector is concatenated with the Faster We use the UNITER BASE pretrained model
R-CNN features. available at the ChenRocks GitHub repository2
with custom classification layers (MLP +softmax
For person detection, we use MTCNN (Zhang
output layer). For task training, we merge retrieved
et al., 2016) and Facenet (Schroff et al., 2015) mod-
facts with the question, dividing each statement
els, pretrained on the MS-celeb-1M (Guo et al.,
with the ‘[SEP]’ token, following research that
2016) dataset, to generate 128-dimensional em-
indicates that this token induces partitioning and
beddings. We predict names by nearest-neighbour
pipelining of information across attention layers
comparison with the KVQA reference dataset. We
(Clark et al., 2019). The textual input stream is to-
treat the name identification as a multi-class clas-
kenised with the HuggingFace ‘bert-base-uncased’
sification problem, achieving a Micro-F1 of 0.539.
tokeniser (Wolf et al., 2020). We set the maximum
Since this is lower than reported in Shah et al.
WordPiece sequences length to 412, the maximum
(2019), we follow them in applying a textual entity
visual objects count to 100, the learning rate to
linker (van Hulst et al., 2020) over supplied im-
2
age descriptions. This setup achieves a per-image https://github.com/ChenRocks/UNITER

470
Model Question Type Q+F+I Q+F Q+I F+I Q F I
1-Hop 65.7 65.7 32.4 3.9 32.4 3.8 4.5
Entropy
Question Type MemNet UNITER 1-Hop Counting 78.0 78.0 30.3 0.0 30.3 0.0 0.0
(Base 2) 1-Hop Subtraction 28.9 28.6 28.8 0.8 30.3 0.6 6.5
1-Hop 61.0 65.7 7.8 Boolean 94.6 94.6 55.2 1.3 55.2 1.0 10.5
Comparison 90.4 90.4 38.7 1.0 38.7 0.9 10.7
1-Hop Counting - 78.0 1.4
Counting 79.4 79.4 66.1 0.6 65.9 0.4 1.4
1-Hop Subtraction - 28.6 4.3 Intersection 79.4 79.4 61.0 0.4 60.6 0.3 0.0
Boolean 75.1 94.6 1.1 Multi-Entity 77.1 77.1 41.3 0.8 41.2 0.7 6.4
Comparison 50.5 90.4 2.1 Multi-Hop 87.9 87.9 29.0 0.8 28.9 0.8 0.0
Multi-Relation 75.2 75.2 25.1 3.0 25.0 3.0 2.5
Counting 49.5 79.4 2.3 Spatial 21.2 21.2 0.0 13.0 0.0 13.0 0.0
Intersection 72.5 79.4 1.2 Subtraction 34.4 34.4 1.3 1.0 0.9 0.7 0.0
Multi-Entity 43.5 77.1 3.3 Overall 69.3 69.3 31.6 3.1 31.5 3.0 3.6
Multi-Hop 53.2 87.9 3.7
Multi-Relation 45.2 75.2 7.1 Table 2: Ablation Study of Information. Q=Question,
Spatial 48.1 21.2 11.5 I=Image, F=Facts. Image refers to the Image feature
Subtraction 40.5 34.4 6.0 stream. Results are expressed as % accuracy by ques-
Overall 50.2 69.3 7.6 tion type.

Table 1: Results in terms of % accuracy of the consid-

ered systems break down into question types along with crease from MemNet (-26.7%). This question type
the question types distribution (last column). requires two-hop reasoning where the second hop is
a numerical operation of the form argmin(xi − yi ).
y
8 × 10−5 and use AdamW (Loshchilov and Hutter, Both of these have been shown to be problematic
2017) as optimizer. Once preprocessing is com- for BERT (Kassner et al., 2020; Geva et al., 2020).
pleted, we train the UNITER model with the cross-
entropy objective function for 80,000 iterations, 6 Analysis
which we empirically found to guarantee conver- UNITER performs well at the reasoning tasks in
gence. general, with the most surprising result being that it
apparently does better at multi-hop reasoning than
5 Results
one-hop. We believe that this can be explained
Table 1 shows the results of our system (UNITER), by the presence of unbalanced distribution of an-
using a question label break-down similar to Shah swer types in the dataset perturbing the results (see
et al. (2019). Overall, we observe that our system Table 1). We discuss this in Section 6.1.
outperforms the previous baseline MemNet setting In order to better understand the reasoning ca-
(see ‘World+WikiCap+ORG’ in Shah et al. (2019)) pability of our model and the impact of each input
with an absolute improvement of 19%. modality, we perform an inference time ablation
Our results show that UNITER is learning to study, presented in Table 2.
perform reasoning more accurately than MemNet Ablation of Image features (column ‘Q+F’) does
in all but two cases. In the question types involv- not change the performance, suggesting that the
ing multiple entities (‘Multi-Entity’, ‘Multi-Hop’, model is not attending to image features. To con-
‘Multi-Relation’), the increase is the greatest, sug- firm this hypothesis, we performed an experiment
gesting that UNITER is able to robustly learn these with adversarial images, obtaining very similar re-
reasoning here. We speculate that stacked self- sults for each question type and the same overall
attention layers in BERT are able to better attend score (69.30%). We explain this behaviour by the
to the many involved entities than MemNet. fact that the preprocessing pipeline extracts all the
We now discuss the performance of our model required information as explicit facts which the
on its weakest categories, namely ‘Subtraction’ and model prefers over the more ambiguous visual fea-
‘Spatial’. The poor performance on ‘Subtraction’ tures. We leave a deeper analysis for further work.
questions confirms previous results that BERT-like An interesting case is the ‘Spatial’ questions,
models require specialised pretraining for numer- where facts alone are able to correctly answer 13%
ical reasoning tasks (Geva et al., 2020). In the of the questions. This is likely the result of the
case of our model specifically, we note the lack of answers to this question type being entities present
numerical reasoning tasks in UNITER’s pretrain- in the facts. Again, we observe that the model is
ing regime. ‘Spatial’ is the model’s least accurate not able to learn this information from the visual
question type (21.4%) and the biggest absolute de- features.

471
Train Ablation Adversarial Modality* importance of a modality to solving the task.
Question Type Q+I Q I F
1-Hop 47.09 38.5 65.9 31.3 Through comparing train and inference ablation
1-Hop Counting 66.1 61.5 75.2 50.5 of facts (‘Q+I’ column of Table 3 and of Table 2)
1-Hop Subtraction 29.4 29.7 28.1 26.2
we observe that when facts are unavailable at train
Boolean 83.9 67.3 94.1 57.5
Comparison 83.4 60.3 90.6 47.8 time, the model attends to images to obtain 47.0%
Counting 75.4 75.2 78.9 70.2 accuracy, which is 15.4% more than the 31.6% ob-
Intersection 67.6 67.9 76.8 61.2 tained by the corresponding inference time ablation.
Multi-Entity 69.4 57.2 76.4 47.6
Multi-Hop 56.5 50.2 87.9 38.4 This indicates that the visual modality can provide
Multi-Relation 47.3 38.9 75.2 28.3 useful information for this task.
Spatial 3.3 1.2 21.1 0.0
Subtraction 2.1 2.6 39.2 1.6
We observe a similar trend in the fact and im-
Overall 47.0 40.8 69.3 32.8 age ablation setting (‘Q’ column of Table 3 and
of Table 2) that the model is able to greater lever-
Table 3: Further Ablation and Adversarial Studies. age questions to make accurate predictions when
*Adversarial Modality indicates that the sample from
additional modalities are never available.
that modality was randomly assigned from the entire
data split We also perform adversarial checks, where ran-
dom images or facts from the data split are pre-
sented at inference time. These align closely with
6.1 Bias Studies the ablation study, with adversarial images (Col-
We briefly discuss the corpus bias, a well-known umn ‘I’ of Table 3) performing within 0.1% of
concern in VQA (Goyal et al., 2019). We con- blanked images (Column ‘Q+F’ of Table 3) and
sider question difficulty across three parameters: adversarial facts (Column ’F’ of Table 3) perform-
reasoning difficulty, task design, and corpus bias. ing within 1% of blanked facts (Column ‘Q+I’ of
Certain question types are inherently more com- Table 3). These results confirm the importance of
plex, as discussed in Section 5. Additionally, the factual data and the unimportance of raw image
task may have different numbers of answer classes features to a model trained on the full data.
per task, effectively weakening any priors mod-
els might form (see Entropy column in Table 1). 7 Conclusion and Future Work
Finally, an unbalanced dataset may cause certain
reasoning types to be underrepresented, making it We evaluated our model and found that it improves
harder for models to learn for them. ‘Spatial’ and on the previous state of the art by a substantial
‘Substraction’ questions are among the least repre- margin (19.1%). An ablation study revealed the
sented in the training dataset, which increase their specific strengths and weaknesses of our model on
difficulty for the model. certain question categories when evaluated on the
Unseen answer classes are also an issue. For KVQA dataset. We show that the UNITER model
‘Spatial’ questions, only 54.2% of the test answers is not actually using the visual input.
(output classes) are actually seen during training, In the future, we seek to create a large exter-
placing an upper bound on accuracy. We find nal knowledge dataset designed following KVQA
98.4% of ‘Spatial’ questions the model answered with more entities besides persons to encourage
correctly and 95.7% of ‘Spatial’ question the model grounded reasoning, and better calibration of an-
answered incorrectly were supplied with adequate swer types. We will also consider pretraining our
facts by the preprocessing pipeline. model on closely related tasks. This will help to
form a model capable of learning robust reasoning
Training time ablation and adversarial experi- with a high degree of spatial specificity and entity
ments To further probe the task, we perform a discrimination.
training time ablation with first facts, and then facts
and images removed (see Table 3). In this we seek
Acknowledgements
to exhibit the capability of our model to leverage
the available modalities and to compensate for the Peter Vickers is supported by the Centre for Doc-
missing ones. toral Training in Speech and Language Technolo-
Through comparing the training time and infer- gies (SLT) and their Applications funded by the
ence time ablations, we can better understand the UK Research and Innovation grant EP/S023062/1.

472
Ethical Statement Analyzing and Interpreting Neural Networks for
NLP, pages 276–286, Florence, Italy. Association
This work is based on the open-source KVQA for Computational Linguistics.
dataset, an English multimodal dataset, and the
Jacob Devlin, Ming Wei Chang, Kenton Lee, and
Wikidata knowledge base (also in English). No Kristina Toutanova. 2019. BERT: Pre-training of
English-specific preprocessing was used for this deep bidirectional transformers for language under-
research and the UNITER model is language ag- standing. In NAACL HLT 2019 - 2019 Conference of
nostic, which tends to suggest that this could gen- the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
eralize to other languages. We will make our code
nologies - Proceedings of the Conference, volume 1,
publicly available to ensure the reproducibility of pages 4171–4186.
our experiments in the following repository
David Ding, Felix Hill, Adam Santoro, and Matt
Botvinick. 2020. Object-based attention for spatio-
References temporal reasoning: Outperforming neuro-symbolic
models with flexible distributed architectures.
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-
garet Mitchell, Dhruv Batra, C Lawrence Zitnick, Mor Geva, Ankit Gupta, and Jonathan Berant. 2020. In-
and Devi Parikh. 2015. VQA: Visual question an- jecting Numerical Reasoning Skills into Language
swering. In Proceedings of the IEEE International Models. In Proceedings of the 58th Annual Meet-
Conference on Computer Vision, volume 2015 Inter, ing of the Association for Computational Linguis-
pages 2425–2433. tics, pages 946–958, Online. Association for Com-
putational Linguistics.
Lawrence W Barsalou. 2008. Grounded cognition. An-
nual Review of Psychology, 59:617–645. Yash Goyal, Tejas Khot, Aishwarya Agrawal, Dou-
glas Summers-Stay, Dhruv Batra, and Devi Parikh.
Hedi Ben-younes, Rémi Cadene, Matthieu Cord, and 2019. Making the V in VQA Matter: Elevating the
Nicolas Thome. 2017. MUTAN: Multimodal Role of Image Understanding in Visual Question An-
Tucker Fusion for Visual Question Answering. Pro- swering. International Journal of Computer Vision,
ceedings of the IEEE International Conference on 127(4):398–414.
Computer Vision, 2017-Octob:2631–2639.
Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong
Zied Bouraoui, Jose Camacho-Collados, and Steven He, and Jianfeng Gao. 2016. MS-celeb-1M: A
Schockaert. 2020. Inducing Relational Knowledge dataset and benchmark for large-scale face recogni-
from BERT. Proceedings of the AAAI Conference tion. In Lecture Notes in Computer Science (includ-
on Artificial Intelligence, 34(05):7456–7463. ing subseries Lecture Notes in Artificial Intelligence
and Lecture Notes in Bioinformatics), volume 9907
Samuel R. Bowman, Gabor Angeli, Christopher Potts, LNCS, pages 87–102.
and Christopher D. Manning. 2015. A large anno-
tated corpus for learning natural language inference. Drew A Hudson and Christopher D Manning. 2019.
In Conference Proceedings - EMNLP 2015: Con- GQA: A new dataset for real-world visual reasoning
ference on Empirical Methods in Natural Language and compositional question answering. In Proceed-
Processing, pages 632–642. Association for Compu- ings of the IEEE Computer Society Conference on
tational Linguistics (ACL). Computer Vision and Pattern Recognition, volume
2019-June, pages 6693–6702.
Samuel Broscheit. 2019. Investigating entity knowl-
edge in BERT with simple neural end-to-end en- Johannes M. van Hulst, Faegheh Hasibi, Koen Derck-
tity linking. In CoNLL 2019 - 23rd Conference sen, Krisztian Balog, and Arjen P. de Vries. 2020.
on Computational Natural Language Learning, Pro- REL: An Entity Linker Standing on the Shoulders of
ceedings of the Conference, pages 677–685. Associ- Giants. SIGIR 2020 - Proceedings of the 43rd Inter-
ation for Computational Linguistics. national ACM SIGIR Conference on Research and
Development in Information Retrieval, pages 2197–
Yen Chun Chen, Linjie Li, Licheng Yu, Ahmed El 2200.
Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and
Jingjing Liu. 2020. UNITER: UNiversal Image- Justin Johnson, Bharath Hariharan, Laurens van der
TExt Representation Learning. Lecture Notes Maaten, Li Fei-Fei, C. Lawrence Zitnick, and Ross
in Computer Science (including subseries Lecture Girshick. 2016. CLEVR: A Diagnostic Dataset
Notes in Artificial Intelligence and Lecture Notes in for Compositional Language and Elementary Visual
Bioinformatics), 12375 LNCS:104–120. Reasoning. Proceedings - 30th IEEE Conference
on Computer Vision and Pattern Recognition, CVPR
Kevin Clark, Urvashi Khandelwal, Omer Levy, and 2017, 2017-Janua:1988–1997.
Christopher D Manning. 2019. What Does BERT
Look at? An Analysis of BERT’s Attention. In Pro- Nora Kassner, Benno Krojer, and Hinrich Schütze.
ceedings of the 2019 ACL Workshop BlackboxNLP: 2020. Are Pretrained Language Models Symbolic

473
Reasoners over Knowledge? In Proceedings of Information Processing Systems, volume 24, pages
the 24th Conference on Computational Natural Lan- 1143–1151. Curran Associates, Inc.
guage Learning, pages 552–564, Online. Associa-
tion for Computational Linguistics. Fabio Petroni, Tim Rocktäschel, Patrick Lewis, Anton
Bakhtin, Yuxiang Wu, Alexander H. Miller, and Se-
Ranjay Krishna, Justin Johnson, Yannis Kalantidis bastian Riedel. 2020. Language models as knowl-
Yahoo, David Ayman Shamma, Yuke Zhu, Oliver edge bases? In EMNLP-IJCNLP 2019 - 2019 Con-
Groth, Kenji Hata, Joshua Kravitz, Stephanie Chen, ference on Empirical Methods in Natural Language
Yannis Kalantidis, Li-Jia Li, David A Shamma, Processing and 9th International Joint Conference
Michael S Bernstein, and Li Fei-Fei. 2016. Visual on Natural Language Processing, Proceedings of
Genome: Connecting Language and Vision Using the Conference, pages 2463–2473. Association for
Crowdsourced Dense Image Annotations Human tra- Computational Linguistics.
jectory forecasting View project hybrid intrusion
detction systems View project Visual Genome Con- Nazneen Fatema Rajani, Bryan McCann, Caiming
necting Language and Vision Using Crowdsourced Xiong, and Richard Socher. 2019. Explain Your-
Dense Image A. Article in International Journal of self! Leveraging Language Models for Common-
Computer Vision, 123(1):32–73. sense Reasoning. ACL 2019 - 57th Annual Meet-
ing of the Association for Computational Linguistics,
Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Proceedings of the Conference, pages 4932–4942.
Hsieh, and Kai-Wei Chang. 2020. What Does BERT
with Vision Look At? In Proceedings of the 58th An- Shaoqing Ren, Kaiming He, Ross Girshick, and Jian
nual Meeting of the Association for Computational Sun. 2017. Faster R-CNN: Towards Real-Time
Linguistics, pages 5265–5275, Online. Association Object Detection with Region Proposal Networks.
for Computational Linguistics. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 39(6):1137–1149.
Tsung Yi Lin, Michael Maire, Serge Belongie, James
Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, Florian Schroff, Dmitry Kalenichenko, and James
and C. Lawrence Zitnick. 2014. Microsoft COCO: Philbin. 2015. FaceNet: A unified embedding for
Common objects in context. In Lecture Notes face recognition and clustering. In Proceedings of
in Computer Science (including subseries Lecture the IEEE Computer Society Conference on Com-
Notes in Artificial Intelligence and Lecture Notes in puter Vision and Pattern Recognition, volume 07-12-
Bioinformatics), PART 5, pages 740–755. Springer June, pages 815–823.
Verlag.
Sanket Shah, Anand Mishra, Naganand Yadati, and
Ilya Loshchilov and Frank Hutter. 2017. Fixing Weight Partha Pratim Talukdar. 2019. KVQA: Knowledge-
Decay Regularization in Adam. CoRR, abs/1711.0. Aware Visual Question Answering. Proceedings
Jiasen Lu, Dhruv Batra, Devi Parikh, and Ste- of the AAAI Conference on Artificial Intelligence,
fan Lee. 2019. ViLBERT: Pretraining Task- 33(01):8876–8884.
Agnostic Visiolinguistic Representations for Vision- Piyush Sharma, Nan Ding, Sebastian Goodman, and
and-Language Tasks. In Advances in Neural Infor- Radu Soricut. 2018. Conceptual captions: A
mation Processing Systems, volume 32, pages 13– cleaned, hypernymed, image alt-text dataset for auto-
23. Curran Associates, Inc. matic image captioning. In ACL 2018 - 56th Annual
Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Meeting of the Association for Computational Lin-
Parikh, and Stefan Lee. 2020. 12-in-1: Multi-Task guistics, Proceedings of the Conference (Long Pa-
Vision and Language Representation Learning. In pers), volume 1, pages 2556–2565. Association for
The IEEE/CVF Conference on Computer Vision and Computational Linguistics (ACL).
Pattern Recognition (CVPR).
Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu,
Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Furu Wei, and Jifeng Dai. 2019. VL-BERT: Pre-
2016. Hierarchical question-image co-attention for training of Generic Visual-Linguistic Representa-
visual question answering. In Advances in Neural tions. arXiv.
Information Processing Systems, volume 29, pages
289–297. Curran Associates, Inc. Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang,
Huajun Bai, and Yoav Artzi. 2020. A corpus for
Kenneth Marino, Mohammad Rastegari, Ali Farhadi, reasoning about natural language grounded in pho-
and Roozbeh Mottaghi. 2019. OK-VQA: A Visual tographs. In ACL 2019 - 57th Annual Meeting of the
Question Answering Benchmark Requiring External Association for Computational Linguistics, Proceed-
Knowledge. Proceedings of the IEEE Computer So- ings of the Conference, pages 6418–6428. Associa-
ciety Conference on Computer Vision and Pattern tion for Computational Linguistics (ACL).
Recognition, 2019-June:3190–3199.
Alexandre Tamborrino, Nicola Pellicanò, Baptiste Pan-
Vicente Ordonez, Girish Kulkarni, and Tamara Berg. nier, Pascal Voitot, and Louise Naudin. 2020. Pre-
2011. Im2Text: Describing Images Using 1 Mil- training Is (Almost) All You Need: An Application
lion Captioned Photographs. In Advances in Neural to Commonsense Reasoning. In Proceedings of the

474
58th Annual Meeting of the Association for Compu-
tational Linguistics, pages 3878–3887, Online. As-
sociation for Computational Linguistics.
Denny Vrandečić and Markus Krötzsch. 2014. Wiki-
data: a free collaborative knowledgebase. Commu-
nications of the ACM, 57(10):78–85.
Peng Wang, Qi Wu, Chunhua Shen, Anthony Dick, and
Anton Van Den Hengel. 2017. Explicit knowledge-
based reasoning for visual question answering. In
IJCAI International Joint Conference on Artificial
Intelligence, volume 0, pages 1290–1296. Interna-
tional Joint Conferences on Artificial Intelligence.
Peng Wang, Qi Wu, Chunhua Shen, Anton van den
Hengel, and Anthony Dick. 2016. FVQA: Fact-
based Visual Question Answering. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence,
40(10):2413–2427.
Jason Weston, Sumit Chopra, and Antoine Bordes.
2014. Memory Networks. 3rd International Con-
ference on Learning Representations, ICLR 2015 -
Conference Track Proceedings.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Chaumond, Clement Delangue, Anthony Moi, Pier-
ric Cistac, Tim Rault, Rémi Louf, Morgan Fun-
towicz, Joe Davison, Sam Shleifer, Patrick von
Platen, Clara Ma, Yacine Jernite, Julien Plu, Can-
wen Xu, Teven Le Scao, Sylvain Gugger, Mariama
Drame, Quentin Lhoest, and Alexander M. Rush.
2020. HuggingFace’s Transformers: State-of-the-
art Natural Language Processing. In Proceedings of
the 2020 Conference on Empirical Methods in Nat-
ural Language Processing: System Demonstrations,
pages 38–45, Online. Association for Computational
Linguistics.
Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin
Choi. 2019. From recognition to cognition: Vi-
sual commonsense reasoning. In Proceedings of the
IEEE Computer Society Conference on Computer
Vision and Pattern Recognition, volume 2019-June,
pages 6713–6724.
Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and
Yu Qiao. 2016. Joint Face Detection and Alignment
using Multi-task Cascaded Convolutional Networks.
IEEE Signal Processing Letters, 23(10):1499–1503.

475

A Simple Baseline For Knowledge-Based Visual Question Answering
No ratings yet
A Simple Baseline For Knowledge-Based Visual Question Answering
7 pages
sar
No ratings yet
sar
10 pages
Explainable High-order Visual Question Reasoning
No ratings yet
Explainable High-order Visual Question Reasoning
10 pages
REFEERENCE Accuracy 43
No ratings yet
REFEERENCE Accuracy 43
11 pages
Prophet Prompting LLM With Complementary Answer Heuristics 4 Knowledge Based Vqa
No ratings yet
Prophet Prompting LLM With Complementary Answer Heuristics 4 Knowledge Based Vqa
16 pages
Teney Tips and Tricks CVPR 2018 Paper
No ratings yet
Teney Tips and Tricks CVPR 2018 Paper
10 pages
simpleaug
No ratings yet
simpleaug
16 pages
9412 TOA Task Oriented Active
No ratings yet
9412 TOA Task Oriented Active
14 pages
Interpretable Visual Question Answering Via Reasoning Supervision
No ratings yet
Interpretable Visual Question Answering Via Reasoning Supervision
5 pages
1709.08203v1
No ratings yet
1709.08203v1
7 pages
dl (2)
No ratings yet
dl (2)
2 pages
Reducing Language Biases in Visual Question Answering With Visually-Grounded Question Encoder
No ratings yet
Reducing Language Biases in Visual Question Answering With Visually-Grounded Question Encoder
17 pages
dl (1)
No ratings yet
dl (1)
2 pages
From Image To Language A Critical Analysis of Visual Question Answering (VQA)
No ratings yet
From Image To Language A Critical Analysis of Visual Question Answering (VQA)
40 pages
2017-3-R2
No ratings yet
2017-3-R2
9 pages
VQA3
No ratings yet
VQA3
10 pages
Open-Vocabulary Video Question Answering - A New Benchmark For Evaluating The Generalizability of Video Question Answering Models - ICCV - 2023.dpf
No ratings yet
Open-Vocabulary Video Question Answering - A New Benchmark For Evaluating The Generalizability of Video Question Answering Models - ICCV - 2023.dpf
16 pages
Hci Report
No ratings yet
Hci Report
5 pages
Tell-And-Answer Towards Explainable Visual Question
No ratings yet
Tell-And-Answer Towards Explainable Visual Question
9 pages
dl (4)
No ratings yet
dl (4)
3 pages
2104 12756 PDF
No ratings yet
2104 12756 PDF
27 pages
CLEVR: A Diagnostic Dataset For Compositional Language and Elementary Visual Reasoning
No ratings yet
CLEVR: A Diagnostic Dataset For Compositional Language and Elementary Visual Reasoning
17 pages
visual quension answering system
No ratings yet
visual quension answering system
11 pages
VQA: Visual Question Answering
No ratings yet
VQA: Visual Question Answering
28 pages
Joint_Answering_and_Explanation_for_Visual_Commonsense_Reasoning
No ratings yet
Joint_Answering_and_Explanation_for_Visual_Commonsense_Reasoning
11 pages
Visual Question Answering A State of The Art Review
No ratings yet
Visual Question Answering A State of The Art Review
41 pages
Beyond VQA
No ratings yet
Beyond VQA
16 pages
dl (3)
No ratings yet
dl (3)
3 pages
DOCVQA1
No ratings yet
DOCVQA1
5 pages
1 s2.0 S0262885621000706 Main
No ratings yet
1 s2.0 S0262885621000706 Main
11 pages
Visual Question Answering: A State of The Art Review: Sruthy Manmadhan Binsu C. Kovoor
No ratings yet
Visual Question Answering: A State of The Art Review: Sruthy Manmadhan Binsu C. Kovoor
41 pages
Declarative Knowledge Distillation From Large Language Models For Visual Question Answering Datasets
No ratings yet
Declarative Knowledge Distillation From Large Language Models For Visual Question Answering Datasets
11 pages
output_2
No ratings yet
output_2
3 pages
Re GAT
No ratings yet
Re GAT
10 pages
Modular Visual Question Answering Via Code Generation: Our Code and Annotated Programs Are Available at
No ratings yet
Modular Visual Question Answering Via Code Generation: Our Code and Annotated Programs Are Available at
14 pages
VQA: Visual Question Answering
No ratings yet
VQA: Visual Question Answering
25 pages
An Analysis of Graph Convolutional Networks and Recent Datasets For Visual Question Answering
No ratings yet
An Analysis of Graph Convolutional Networks and Recent Datasets For Visual Question Answering
24 pages
Major Project Phase 1
No ratings yet
Major Project Phase 1
18 pages
Misra Learning by Asking CVPR 2018 Paper
No ratings yet
Misra Learning by Asking CVPR 2018 Paper
10 pages
Open Ended VQA Models Using Transformers
No ratings yet
Open Ended VQA Models Using Transformers
10 pages
Are Red Roses Red? Evaluating Consistency of Question-Answering Models
No ratings yet
Are Red Roses Red? Evaluating Consistency of Question-Answering Models
11 pages
Prompting_Large_Language_Models_with_Knowledge-Injection_for_Knowledge-Based_Visual_Question_Answering
No ratings yet
Prompting_Large_Language_Models_with_Knowledge-Injection_for_Knowledge-Based_Visual_Question_Answering
15 pages
2023 Acl-Short 65
No ratings yet
2023 Acl-Short 65
15 pages
Updated PPT presentation-ISA-1 Phase-2
No ratings yet
Updated PPT presentation-ISA-1 Phase-2
27 pages
Open Domain QA
No ratings yet
Open Domain QA
28 pages
Survey on VQA
No ratings yet
Survey on VQA
30 pages
paper-120
No ratings yet
paper-120
12 pages
Synthesize Step-by-Step Tools, Templates and LLMs As Data Generators For Reasoning-Based Chart VQA
No ratings yet
Synthesize Step-by-Step Tools, Templates and LLMs As Data Generators For Reasoning-Based Chart VQA
16 pages
Retrieval Reranking and Multi Task Learning For Knowledge Base Question Answering
No ratings yet
Retrieval Reranking and Multi Task Learning For Knowledge Base Question Answering
11 pages
Visual_Question_Answering_based_on_multimodal_triplet_knowledge_accumuation
No ratings yet
Visual_Question_Answering_based_on_multimodal_triplet_knowledge_accumuation
4 pages
Graph Neural Networks For Visual Question Answering: A Systematic Review
No ratings yet
Graph Neural Networks For Visual Question Answering: A Systematic Review
38 pages
Research Paper Neuro Symbolic
No ratings yet
Research Paper Neuro Symbolic
12 pages
2411.11150v1
No ratings yet
2411.11150v1
27 pages
SC-ML
No ratings yet
SC-ML
6 pages
Bidirectional Attentive Memory Networks For Question Answering Over Knowledge Bases
No ratings yet
Bidirectional Attentive Memory Networks For Question Answering Over Knowledge Bases
10 pages
2023.Findings Acl.432
No ratings yet
2023.Findings Acl.432
19 pages
vqa
No ratings yet
vqa
14 pages
Yang 2021
No ratings yet
Yang 2021
13 pages
03enhancing Text Book Question Answering Using Rag
No ratings yet
03enhancing Text Book Question Answering Using Rag
19 pages
Bag of Words Model: Unlocking Visual Intelligence with Bag of Words
From Everand
Bag of Words Model: Unlocking Visual Intelligence with Bag of Words
Fouad Sabry
No ratings yet
The Problem Methodology Findings: Konzor Sam Locop, Nelfa Medel, Jessa Mae Pioquinto
No ratings yet
The Problem Methodology Findings: Konzor Sam Locop, Nelfa Medel, Jessa Mae Pioquinto
1 page
King Arthur
No ratings yet
King Arthur
4 pages
Rizal Course Report
No ratings yet
Rizal Course Report
78 pages
9th January 2023 Sermon On ST John The Forerunner by ST John Chrysostom On The Theophany
No ratings yet
9th January 2023 Sermon On ST John The Forerunner by ST John Chrysostom On The Theophany
1 page
Worksheet Beijing 2022
No ratings yet
Worksheet Beijing 2022
2 pages
Top 100 Automation Testing Interview Questions - Most Automation Testing Interview Questions and Answers - Wisdom Jobs
No ratings yet
Top 100 Automation Testing Interview Questions - Most Automation Testing Interview Questions and Answers - Wisdom Jobs
11 pages
About Raffle Draw
No ratings yet
About Raffle Draw
36 pages
(Graddol) English Next Export
No ratings yet
(Graddol) English Next Export
27 pages
Important Questions From Java Fundamentals
No ratings yet
Important Questions From Java Fundamentals
7 pages
Lesson Plan LLT
No ratings yet
Lesson Plan LLT
3 pages
Course Material Tableau
No ratings yet
Course Material Tableau
54 pages
Lesson Plan 3 Quantity
No ratings yet
Lesson Plan 3 Quantity
3 pages
Virtual English Exam-Basic I Student's Full Name
No ratings yet
Virtual English Exam-Basic I Student's Full Name
4 pages
Mahabharata
No ratings yet
Mahabharata
1 page
Prof Ed 3 - Learning Disabilities and Giftedness PDF
No ratings yet
Prof Ed 3 - Learning Disabilities and Giftedness PDF
18 pages
Intro Iphone
No ratings yet
Intro Iphone
19 pages
Text Message Analysis
No ratings yet
Text Message Analysis
3 pages
Informatics College Pokhara: Information Systems CC4002NP
No ratings yet
Informatics College Pokhara: Information Systems CC4002NP
48 pages
21IT303 - SE Unitwise Materials With Question Bank
No ratings yet
21IT303 - SE Unitwise Materials With Question Bank
272 pages
Forming Young Souls - Seraphim Rose
No ratings yet
Forming Young Souls - Seraphim Rose
19 pages
HSC Board Exam July 2024 - Que.5B, 5C, 5D - History of Novel - Activities
No ratings yet
HSC Board Exam July 2024 - Que.5B, 5C, 5D - History of Novel - Activities
5 pages
GrammarSpace2 AnswerKeys
No ratings yet
GrammarSpace2 AnswerKeys
23 pages
نسخة نسخة Final Print Medical TerminologyI Eng 210 Lectures
No ratings yet
نسخة نسخة Final Print Medical TerminologyI Eng 210 Lectures
109 pages
Thomas Paine - The Age of Reason
No ratings yet
Thomas Paine - The Age of Reason
160 pages
40 Excel Chart Templates
No ratings yet
40 Excel Chart Templates
8 pages
Translation - Google Search 2
No ratings yet
Translation - Google Search 2
1 page
The Singular Value Decomposition: Prof. Walter Gander ETH Zurich Decenber 12, 2008
No ratings yet
The Singular Value Decomposition: Prof. Walter Gander ETH Zurich Decenber 12, 2008
18 pages
Hope and Plan
No ratings yet
Hope and Plan
12 pages
Coding Roadmap Reddit
No ratings yet
Coding Roadmap Reddit
1 page
Narrative Tenses-Inglés Level 3
No ratings yet
Narrative Tenses-Inglés Level 3
2 pages