A Simple Baseline For Knowledge-Based Visual Question Answering

A Simple Baseline for Knowledge-Based Visual Question Answering

Uploaded by

David Silva

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

A Simple Baseline For Knowledge-Based Visual Question Answering

A Simple Baseline for Knowledge-Based Visual Question Answering

Uploaded by

David Silva

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

A Simple Baseline for Knowledge-Based Visual Question Answering

Alexandros Xenos1∗ Themos Stafylakis2,3 Ioannis Patras1 Georgios Tzimiropoulos1

1
Queen Mary University of London 2 Athens University of Economics and Business
3
Omilia - Conversational Intelligence, Athens, Greece
{a.xenos,i.patras,g.tzimiropoulos}@qmul.ac.uk, [email protected]

Abstract Recently, several works on KB-VQA (Gui et al.,

This paper is on the problem of Knowledge- 2022; Lin et al., 2022) have emphasized the signif-
Based Visual Question Answering (KB-VQA). icance of incorporating both explicit and implicit
Recent works have emphasized the significance knowledge. However, such approaches usually re-
arXiv:2310.13570v1 [cs.CV] 20 Oct 2023

of incorporating both explicit (through external quire complicated pipelines. Firstly, a KB (e.g.
databases) and implicit (through LLMs) knowl- wikidata) covering world knowledge needs to be
edge to answer questions requiring external maintained and used for knowledge retrieval which
knowledge effectively. A common limitation is time-consuming and very sensitive to noise. Sec-
of such approaches is that they consist of rela-
ondly, powerful LLMs such as GPT-3 (Brown et al.,
tively complicated pipelines and often heavily
rely on accessing GPT-3 API. Our main contri- 2020) or OPT-175B (Zhang et al., 2022) are lever-
bution in this paper is to propose a much sim- aged due to the huge amount of implicit knowl-
pler and readily reproducible pipeline which, edge stored in their parameters and their powerful
in a nutshell, is based on efficient in-context reasoning capabilities through few-shot in-context
learning by prompting LLaMA (1 and 2) using learning. However, the computational or even ac-
question-informative captions as contextual in- tual monetary cost (e.g. cost for API access) as-
formation. Contrary to recent approaches, our
sociated with accessing such models renders them
method is training-free, does not require access
to external databases or APIs, and yet achieves unaffordable for many researchers. Thirdly, it is
state-of-the-art accuracy on the OK-VQA and crucial to train a fusion mechanism that can effec-
A-OK-VQA datasets. Finally, we perform sev- tively reason by combining the retrieved explicit
eral ablation studies to understand important and implicit knowledge.
aspects of our method. Our code is publicly
available at https://github.com/alexandrosXe/A- Main contributions: We present a simple yet
Simple-Baseline-For-Knowledge-Based-VQA powerful pipeline for KB-VQA which by-passes
the need for using most of the components of the
1 Introduction
above-mentioned systems. Specifically, the pro-
Knowledge-based VQA (KB-VQA) is a recently posed system is simply based on few-shot prompt-
introduced VQA task (Wang et al., 2017, 2018; ing of LLaMA-13B (Touvron et al., 2023a,b). The
Marino et al., 2019; Shah et al., 2019) where the im- key component of our method is the implementa-
age alone is not sufficient to answer the given question of effective in-context learning using question-
tion, but effective utilization of external knowledge informative captions as contextual information
resources is additionally required. To solve such a which, as we show, results in large accuracy boosts.
task, a model would need not only strong visual per-
ception but also reasoning capabilities while also The proposed system features several advan-
being able to effectively incorporate world knowl- tages: (1) it is entirely training-free, requiring
edge from external KBs (e.g. Wikipedia, etc) and only a few examples for in-context learning; (2)
LLMs. Systems capable of answering general and it is based on the open-source LLaMA-13B (Tou-
diverse questions about the visual world find a wide vron et al., 2023a,b) (considerably smaller than the
range of applications: from personal assistants to widely-used GPT-3); (3) it is straightforward to re-
aids for the visually impaired and robotics 1 . produce; and (4) achieves state-of-the-art (SOTA)
∗ accuracy on the widely-used OK-VQA (Marino
Corresponding author.
1
https://www.adelaide.edu.au/aiml/our-research/machine- et al., 2019) and A-OK-VQA datasets (Schwenk
learning/vqa-vision-and-language et al., 2022).
2 Related Work on KB-VQA vious work (Yang et al., 2022; Gui et al., 2022;
Lin et al., 2022) we leverage the open-source LLM
Methods Without LLMs: Several methods
LLaMA-13B (Touvron et al., 2023a,b) instead of
have been proposed including KRISP (Marino
GPT-3 as an implicit language knowledge base and
et al., 2021) which uses a multi-modal pretrained
treat VQA as an open-ended text generation task.
BERT (Devlin et al., 2019), MAVEx (Wu et al.,
Our method builds upon the pipeline of PICa,
2022) which proposes to validate promising answer
which is the pioneering work that utilizes GPT-3
candidates based on answer-specific knowledge
for few-shot in-context learning in order to address
retrieval, and DPR which uses pseudo-relevance
the KB-VQA task. GPT-3 is a decoder-only au-
labels integrated with answer generation for end-
toregressive LLM of 175B parameters, trained on a
to-end training. Typically, these systems are not as
diverse range of data sources, including Common
competitive as the ones based on LLMs.
Crawl, webtexts, books, and Wikipedia (Brown
Methods based on LLMs: PICa (Yang et al.,
et al., 2020). During inference, in-context few-shot
2022) is the first method to adopt GPT-3 for solv-
learning involves formulating a novel downstream
ing the KB-VQA task in a few-shot manner by
task as a text sequence generation task using the
just providing a few in-context VQA examples.
frozen GPT-3 model. When provided with a testing
Gui et al. (2022) proposed to use both implicit (i.e.
input x, the target y is predicted based on a format-
GPT-3) and explicit (i.e. KBs) knowledge based
ted prompt p(h, C, E, c, x). In this prompt, h repre-
on CLIP retrieval (Radford et al., 2021) which are
sents a prompt head or instruction that describes the
combined by a novel fusion module called KAT
task, while E = {e1 , e2 , ..., en } represents a set of
(based on T5 or Bart). Lin et al. (2022) proposed to
n in-context examples (shots), where ei = (xi , yi )
integrate local visual features and positional infor-
represents an input-target pair of the task, where xi
mation (bounding box coordinates), retrieved exter-
and yi are the input and target, respectively. These
nal and implicit knowledge (using a GPT-3) into a
pairs are constructed manually or sampled from
transformer-based question-answering model. Hu
the training set. C = {c1 , c2 , ..., cn } represents a
et al. (2023) proposed PromptCap, a novel task-
set of generic image captions describing each xi
aware captioning model that uses a natural lan-
since images cannot be inputted to GPT-3. The
guage prompt to control the generation of the visual
caption for the test input is labeled as c. The tar-
content that can be used in conjunction with GPT-3
get y is denoted as a text sequence consisting of L
in-context learning. Img2Prompt Guo et al. (2023)
tokens, expressed as y = (y 1 , y 2 , ..., y L ). At each
is a zero-shot VQA method that generates image-
decoding step t, the following conditions apply:
relevant exemplar prompts for the LLM. Their key
insight is that synthetic question-answer pairs can ŷ t = argmax pLLM (y t |p, ŷ <t ) (1)
be generated using image captioning and question- yt
generation techniques as in-context exemplars from In order to utilize any LLM for the knowledge-
the provided image. Prophet Shao et al. (2023) based VQA task, the crucial step is to design suit-
proposes to prompt GPT-3 with answer heuristics able prompts. When given a question qi and an
(answer candidates and answer-aware examples) image vi as inputs, the VQA task’s objective is
that are encoded into the prompts to enable GPT-3 to predict the corresponding answer ai . However,
to better comprehend the task, thus enhancing its since LLMs do not inherently comprehend images,
capacity. it becomes necessary to convert the image into a
caption ci using a pre-existing captioning model.
3 Methodology
While SOTA pretrained captioning models have
While explicit knowledge retrieval focuses on se- demonstrated impressive performance, they are pri-
mantic matching between an image and knowledge marily optimized to generate generic image cap-
entries, it lacks implicit commonsense knowledge tions. Unfortunately, these captions often fail to
(e.g. Lemons are sour) which can be found in capture all the specific details required to accu-
LLMs (Gui et al., 2022). LLMs are critical in ex- rately answer a given question about the image. In
tracting implicit knowledge due to the vast amount this work, instead of generic captions, we generate
of implicit information embedded in their parame- question-guided informative image captions using
ters, and their powerful reasoning capacity through the Plug-and-Play VQA (PNPVQA) framework
few-shot in-context learning. Different from pre- (Tiong et al., 2022) which identifies the most re-
Method Knowledge Resources Acc (%)
lated image patches to the question with a saliency KRISP Wikipedia+ConceptNet 38.35
map-based interpretability technique and generates MAVEx Wikipedia+ConceptNet+Google Images 39.4
captions from these patches only. Unified-IO (2.8B) Multimodal Pretraining 54
Flamingo (80B) Multimodal Pretraining 57.8
PICa-Full Frozen GPT-3 (175B) 48.0
KAT_base (single) Wikidata+Frozen GPT-3 (175B) 50.58
KAT_large (single) Wikidata+Frozen GPT-3 (175B) 53.09
KAT_large (ensemble) Wikidata+Frozen GPT-3 (175B) 54.41
REVIVE_large (single) Wikidata+Frozen GPT-3 (175B) 56.6
REVIVE_large (ensemble) Wikidata+Frozen GPT-3 (175B) 58.0
Prophet Frozen GPT-3 (175B) 61.1
Ours Frozen LLaMA (13B) 58.69
Ours + MCAN Frozen LLaMA (13B) 60.02
Ours Frozen LLaMA 2 (13B) 59.07
Ours + MCAN Frozen LLaMA 2 (13B) 61.2

Figure 1: Inference-time of our method for n-shot VQA. Table 1: Comparison with other methods on the OK-
The input prompt to LLaMA consists of a prompt head VQA dataset: Our method with 9 question-informative
h (blue box), n in-context examples ({ci , xi , yi }ni=1 ) captions achieves state-of-the-art performance.
(red boxes), and the VQA input {c, x} (green box).
The answer y is produced in an open-ended text gen-
eration manner. In this example we use two question- use more examples via multi-query ensemble.
informative captions per example (separated by com-
In-context Example Selection tries to search for
mas).
the best examples for each inference-time input x
among all available examples (Yang et al., 2022).
For each image-question pair, we first gener-
We consider in-context examples that have similar
ate 50 question-guided informative image captions
question features as x. More specifically, given
from the image vi using PNPVQA. We then em-
an inference-time question, we use BLIP’s text en-
ploy BLIP’s (Li et al., 2022) text encoder to encode
coder to obtain its textual feature and compute its
all the image captions and BLIP’s image encoder to
cosine similarity with the questions in all available
encode the image vi . We rank the image captions
in-context examples. We then average the ques-
per image vi according to their cosine similarity
tion text similarity with the image visual similarity
with the image vi and keep the top-m most simi-
to guide the example selection similarly to Yang
lar captions ci per example. After extracting the
et al. (2022). We select the top-n questions with
top-m most similar captions per image vi we con-
the highest similarity and use the corresponding
struct a carefully designed text prompt consisting
examples as the in-context examples.
of a general instruction sentence, the captions C,
Multi-query ensemble: Given an inference-time
the question, the test input’s captions c, and a set
example x, we use k × n in-context examples to
of context-question-answer triplets (shots) taken
generate k prompts. This way, we prompt LLaMA-
from the training dataset that are semantically most
13B for k times and obtain k answer predictions
similar to the current image-question pair (see Fig.
instead of 1 similar to Yang et al. (2022), where
1). Then this text prompt is passed to a frozen
k is the number of queries to ensemble. Finally,
LLaMA-13B model and in-context few-shot learn-
among the k answer predictions, we select the one
ing is performed in order to obtain its output as a
with the most occurrences (majority vote).
promising answer candidate to the current image-
question pair. 4 Experimental Results
3.1 Selecting Informing Examples For Comparative results on OK-VQA: Table 1 sum-
Few-Shot In-Context Learning marizes the results of various methods on OK-VQA
As Yang et al. (2022) notes, feeding more in- including our best method (last row) which uses 9
context examples to GPT-3 yields better few-shot question-informative captions and 5 query ensem-
performance. However, the maximum input length bles. When using LLaMA our approach outper-
of the model constrains the maximum number of forms all methods and achieves comparable results
examples n in the prompt. To better use these avail- with Prophet especially when using the same shot
able examples we: (i) improve the example quality selection strategy based on MCAN (Yu et al., 2019).
by careful in-context example selection (Liu et al., Moreover, it performs better than Unified-IO and
2022; Gui et al., 2022; Shao et al., 2023), and (ii) the 80B Flamingo which have been pre-trained
Method DA MC Captions n k Acc (%)
Val Test Val Test
Generic 14 5 43.35
ClipCap 30.9 25.9 56.9 51.4
Question-informative 14 5 57.56
ViLBERT 30.6 25.9 49.1 41.5
LXMERT 30.7 25.9 51.4 41.6
Table 3: Generic vs. question-informative captions.
KRISP 33.7 27.1 51.9 42.2
GPV-2 48.6 40.7 60.3 53.7
Unified-IO - 45.2 - - shows the performance of our method when using
Prophet 58.2 55.7 59.3 57.3 generic captions vs question-informative captions
Ours (LLaMA) 54.4 53.8 - - for in-context learning which is the key component
Ours + MCAN (LLaMA) 57.4 55.0 - - of our system. Following Yang et al. (2022); Shao
Ours (LLaMA 2) 57.1 55.4 - -
et al. (2023) we leverage the OSCAR+ (Zhang
Ours + MCAN (LLaMA 2) 58.6 57.5 - -
et al., 2021) as the captioning model. The results
Table 2: Comparison with other methods on the A-OK- suggest using question-informative captions results
VQA dataset: Our method with 9 question-informative in huge accuracy boosts (43.35% vs 57.56%).
captions achieves state-of-the-art performance at the
direct answer (DA) setting. Note that our method does
Shot Selection Strategy Captions m n k Acc (%)
not support multiple-choice (MC).
Random Question-informative 1 14 5 53.19
Avg. Question and Image Sim. Question-informative 1 14 5 56.50
MCAN latent space Question-informative 1 14 5 57.56
with multimodal objectives. When compared to
methods that rely on GPT-3 for implicit knowl- Table 4: Accuracy when using different shot selection
edge extraction, our approach outperforms PICa- strategies. Avg. question and image sim. strategy re-
Full which only uses generic image captions by trieves shots based on the average cosine similarity be-
12.02% while outperforming the SOTA supervised tween the test sample’s question and image, and the
training examples’ question and image. MCAN latent
methods KAT and REVIVE by 5.61% and 2.02%
space strategy retrieves shots that are closer to the test
respectively. Finally, when using LLaMA 2 and sample in the trained MCAN’s latent space.
MCAN-based shot selection strategy, our method
achieves state-of-the-art accuracy of 61.2%.
Effect of shot selection strategy: Table 4 shows
Comparative results on A-OK-VQA: Table 2
that selecting random shots during in-context learn-
summarizes the results of various methods on A-
ing hurts the accuracy, confirming the findings of
OK-VQA including our best method (last row)
Yang et al. (2022). Retrieving shots based on the
which uses 9 question-informative captions and
similarity between the test sample and the train-
5 query ensembles. We compare our method to the
ing examples yields a significant accuracy boost.
strong baselines in (Schwenk et al., 2022) and the
Prophet’s shot selection strategy based on MCAN
current state-of-the-art method Prophet (Shao et al.,
also seems to be effective but we note that it is
2023). When employing LLaMA, our approach
based on pre-training a vanilla VQA model on a
surpasses all other methods on the DA setting
different dataset (VQA-v2).
and achieves comparable results to Prophet, par-
ticularly when employing the same shot selection Effect of number of question-informative
strategy based on MCAN. Finally, with LLaMA captions: Fig. 2 (a) shows the accuracy when
2 and MCAN our method attains state-of-the-art we increase the number of captions per sample
performance on both the validation and test sets, in the prompt during in-context learning. Here,
achieving 58.6% and 57.5% accuracy respectively, we are using k = 5, and n = 10 when using 1-10
demonstrating the effectiveness and robust general- captions, and n = 5 when using more than 10
ization of our proposed method. captions due to max. sequence length constraints.
More captions provide more information for
5 Ablation Studies each example helping the model to make a more
accurate prediction based on context. As shown in
We conduct several ablations on OK-VQA to better the figure, the validation accuracy keeps increasing
understand the key components of our method. up to 60.02%. When using more than 10 captions,
the accuracy decreases but this also can be
Effect of question-informative captions: Table 3 attributed to the fact that we are also decreasing
LLMs apart from LLaMA, which presents a lim-
itation of our study. Lastly, due to limitations in
resources, we were unable to conduct experiments
with larger sizes beyond 13B. However, it would
indeed be intriguing to observe the performance
when employing LLaMA models of sizes such as
(a) (b)
30B or 65B.
Figure 2: (a) Accuracy vs number of question informa-
tive captions used per shot during few shot in-context Ethics Statement
learning. (b) Accuracy vs number of prompts k used
The authors of this paper recognize the importance
during in-context learning.
of responsible AI in both research and development
endeavors. We are committed to ensuring that the
the number of shots to 5. model we develop is not only accurate but also
Effect of multi-query ensemble: Fig. 2 (b) shows fair and unbiased. We understand the potentially
the accuracy as the number of prompts, k, increases. significant impact of VQA technology on society
As anticipated, employing multiple prompts of and, therefore, pledge to maintain transparency by
LLaMA instead of just one yields improved ac- sharing our findings and progress with relevant
curacy. However, beyond k = 6, the accuracy researchers and stakeholders.
begins to fluctuate. It is important to note that this
fluctuation could be attributed to the retrieval of
noisy (irrelevant to the question) context examples References
as the value of k increases. Tom Brown, Benjamin Mann, Nick Ryder, Melanie
Effect of explicit knowledge: We also tried to Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda
use KAT’s (Gui et al., 2022) KB and trained a T5 Askell, Sandhini Agarwal, Ariel Herbert-Voss,
(Raffel et al., 2020) in order to integrate explicit Gretchen Krueger, Tom Henighan, Rewon Child,
knowledge into our model. For each image, we Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens
used BLIP to extract explicit knowledge via image- Winter, Chris Hesse, Mark Chen, Eric Sigler, Ma-
teusz Litwin, Scott Gray, Benjamin Chess, Jack
to-text retrieval. We used 40 retrieved passages Clark, Christopher Berner, Sam McCandlish, Alec
and LLaMA predictions as explicit and implicit Radford, Ilya Sutskever, and Dario Amodei. 2020.
knowledge, respectively. We achieved an accuracy Language models are few-shot learners. In Ad-
of 58.70% which shows that our model does not vances in Neural Information Processing Systems,
volume 33, pages 1877–1901. Curran Associates,
benefit from such an approach. Inc.
Effect of size of LLM: We also used a LLaMA-
7B model using 9 question-informative captions, Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of
n = 10 and k = 5. Reducing the size of the LLM
deep bidirectional transformers for language under-
leads to decreased accuracy but the drop is not standing. In Proceedings of the 2019 Conference of
large, still obtaining 57.99% accuracy. the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
6 Conclusions nologies, Volume 1 (Long and Short Papers), pages
4171–4186, Minneapolis, Minnesota. Association for
Computational Linguistics.
We proposed a simple yet effective baseline for
KB-VQA. Our training-free method is based on Liangke Gui, Borui Wang, Qiuyuan Huang, Alexan-
in-context few-shot learning of the open-source der Hauptmann, Yonatan Bisk, and Jianfeng Gao.
LLaMA using question-informative captions. We 2022. KAT: A knowledge augmented transformer
for vision-and-language. In Proceedings of the
show that this is sufficient to achieve SOTA re- 2022 Conference of the North American Chapter of
sults on the widely used OK-VQA and A-OK-VQA the Association for Computational Linguistics: Hu-
datasets. man Language Technologies, pages 956–968, Seattle,
United States. Association for Computational Lin-
guistics.
Limitations
Jiaxian Guo, Junnan Li, Dongxu Li, Anthony
It is important to acknowledge that we have not Meng Huat Tiong, Boyang Li, Dacheng Tao, and
explored the utilization of any other medium-sized Steven C. H. Hoi. 2023. From images to textual
prompts: Zero-shot vqa with frozen large language Zhenwei Shao, Zhou Yu, Meng Wang, and Jun Yu. 2023.
models. Prompting large language models with answer heuris-
tics for knowledge-based visual question answering.
Yushi Hu, Hang Hua, Zhengyuan Yang, Weijia Shi,
Noah A. Smith, and Jiebo Luo. 2023. Promptcap: Anthony Meng Huat Tiong, Junnan Li, Boyang Li, Sil-
Prompt-guided task-aware image captioning. vio Savarese, and Steven C.H. Hoi. 2022. Plug-and-
play VQA: Zero-shot VQA by conjoining large pre-
Junnan Li, Dongxu Li, Caiming Xiong, and Steven trained models with zero training. In Findings of the
Hoi. 2022. Blip: Bootstrapping language-image pre- Association for Computational Linguistics: EMNLP
training for unified vision-language understanding 2022, pages 951–967, Abu Dhabi, United Arab Emi-
and generation. rates. Association for Computational Linguistics.

Yuanze Lin, Yujia Xie, Dongdong Chen, Yichong Xu, Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Chenguang Zhu, and Lu Yuan. 2022. Revive: Re- Martinet, Marie-Anne Lachaux, Timothée Lacroix,
gional visual representation matters in knowledge- Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
based visual question answering. Azhar, Aurelien Rodriguez, Armand Joulin, Edouard
Grave, and Guillaume Lample. 2023a. Llama: Open
Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, and efficient foundation language models.
Lawrence Carin, and Weizhu Chen. 2022. What
makes good in-context examples for GPT-3? In Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
Proceedings of Deep Learning Inside Out (DeeLIO bert, Amjad Almahairi, Yasmine Babaei, Nikolay
2022): The 3rd Workshop on Knowledge Extrac- Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
tion and Integration for Deep Learning Architectures, Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton
pages 100–114, Dublin, Ireland and Online. Associa- Ferrer, Moya Chen, Guillem Cucurull, David Esiobu,
tion for Computational Linguistics. Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller,
Cynthia Gao, Vedanuj Goswami, Naman Goyal, An-
Kenneth Marino, Xinlei Chen, Devi Parikh, Abhinav thony Hartshorn, Saghar Hosseini, Rui Hou, Hakan
Gupta, and Marcus Rohrbach. 2021. Krisp: Inte- Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa,
grating implicit and symbolic knowledge for open- Isabel Kloumann, Artem Korenev, Punit Singh Koura,
domain knowledge-based vqa. In Proceedings of Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Di-
the IEEE/CVF Conference on Computer Vision and ana Liskovich, Yinghai Lu, Yuning Mao, Xavier Mar-
Pattern Recognition (CVPR), pages 14111–14121. tinet, Todor Mihaylov, Pushkar Mishra, Igor Moly-
bog, Yixin Nie, Andrew Poulton, Jeremy Reizen-
Kenneth Marino, Mohammad Rastegari, Ali Farhadi, stein, Rashi Rungta, Kalyan Saladi, Alan Schelten,
and Roozbeh Mottaghi. 2019. Ok-vqa: A visual Ruan Silva, Eric Michael Smith, Ranjan Subrama-
question answering benchmark requiring external nian, Xiaoqing Ellen Tan, Binh Tang, Ross Tay-
knowledge. In Proceedings of the IEEE/CVF Con- lor, Adina Williams, Jian Xiang Kuan, Puxin Xu,
ference on Computer Vision and Pattern Recognition Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan,
(CVPR). Melanie Kambadur, Sharan Narang, Aurelien Ro-
driguez, Robert Stojnic, Sergey Edunov, and Thomas
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Scialom. 2023b. Llama 2: Open foundation and
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- fine-tuned chat models.
try, Amanda Askell, Pamela Mishkin, Jack Clark,
Gretchen Krueger, and Ilya Sutskever. 2021. Learn- Peng Wang, Qi Wu, Chunhua Shen, Anthony Dick, and
ing transferable visual models from natural language Anton Van Den Henge. 2017. Explicit knowledge-
supervision. based reasoning for visual question answering. In
Proceedings of the 26th International Joint Con-
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine ference on Artificial Intelligence, IJCAI’17, page
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, 1290–1296. AAAI Press.
Wei Li, and Peter J. Liu. 2020. Exploring the limits
of transfer learning with a unified text-to-text trans- Peng Wang, Qi Wu, Chunhua Shen, Anthony Dick, and
former. J. Mach. Learn. Res., 21(1). Anton van den Hengel. 2018. Fvqa: Fact-based vi-
sual question answering. IEEE Transactions on Pat-
Dustin Schwenk, Apoorv Khandelwal, Christopher tern Analysis and Machine Intelligence, 40(10):2413–
Clark, Kenneth Marino, and Roozbeh Mottaghi. 2022. 2427.
A-okvqa: A benchmark for visual question answer-
ing using world knowledge. In Computer Vision – Jialin Wu, Jiasen Lu, Ashish Sabharwal, and Roozbeh
ECCV 2022, pages 146–162, Cham. Springer Nature Mottaghi. 2022. Multi-modal answer validation for
Switzerland. knowledge-based vqa. Proceedings of the AAAI Con-
ference on Artificial Intelligence, 36(3):2712–2721.
Sanket Shah, Anand Mishra, Naganand Yadati, and
Partha Pratim Talukdar. 2019. Kvqa: Knowledge- Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei
aware visual question answering. Proceedings Hu, Yumao Lu, Zicheng Liu, and Lijuan Wang. 2022.
of the AAAI Conference on Artificial Intelligence, An empirical study of gpt-3 for few-shot knowledge-
33(01):8876–8884. based vqa. 36:3081–3089.
Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian.
2019. Deep modular co-attention networks for visual
question answering. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recog-
nition (CVPR).
Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei
Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jian-
feng Gao. 2021. Vinvl: Revisiting visual representa-
tions in vision-language models. In Proceedings of
the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), pages 5579–5588.
Susan Zhang, Stephen Roller, Naman Goyal, Mikel
Artetxe, Moya Chen, Shuohui Chen, Christopher De-
wan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mi-
haylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel
Simig, Punit Singh Koura, Anjali Sridhar, Tianlu
Wang, and Luke Zettlemoyer. 2022. Opt: Open pre-
trained transformer language models.

A Example Appendix
A.1 Implementation Details
We used the Huggingface Transformers library2 in
order to run LLaMA models. We used beam search
with beam size = 2 during generation and max new
tokens = 5 while using the default values for all the
other parameters in the generate method. We run
our model on a 40-GB VRAM A-100 GPU card.

2
https://huggingface.co/transformers/

Normalization Questions With Answers
84% (31)
Normalization Questions With Answers
6 pages
Prophet Prompting LLM With Complementary Answer Heuristics 4 Knowledge Based Vqa
No ratings yet
Prophet Prompting LLM With Complementary Answer Heuristics 4 Knowledge Based Vqa
16 pages
In Factuality
No ratings yet
In Factuality
8 pages
Prompting_Large_Language_Models_with_Knowledge-Injection_for_Knowledge-Based_Visual_Question_Answering
No ratings yet
Prompting_Large_Language_Models_with_Knowledge-Injection_for_Knowledge-Based_Visual_Question_Answering
15 pages
9412 TOA Task Oriented Active
No ratings yet
9412 TOA Task Oriented Active
14 pages
1709.08203v1
No ratings yet
1709.08203v1
7 pages
Modular Visual Question Answering Via Code Generation: Our Code and Annotated Programs Are Available at
No ratings yet
Modular Visual Question Answering Via Code Generation: Our Code and Annotated Programs Are Available at
14 pages
2023 Acl-Short 65
No ratings yet
2023 Acl-Short 65
15 pages
simpleaug
No ratings yet
simpleaug
16 pages
From Image To Language A Critical Analysis of Visual Question Answering (VQA)
No ratings yet
From Image To Language A Critical Analysis of Visual Question Answering (VQA)
40 pages
Teney Tips and Tricks CVPR 2018 Paper
No ratings yet
Teney Tips and Tricks CVPR 2018 Paper
10 pages
VQA: Visual Question Answering
No ratings yet
VQA: Visual Question Answering
28 pages
Misra Learning by Asking CVPR 2018 Paper
No ratings yet
Misra Learning by Asking CVPR 2018 Paper
10 pages
2017-3-R2
No ratings yet
2017-3-R2
9 pages
Major Project Phase 1
No ratings yet
Major Project Phase 1
18 pages
Survey on VQA
No ratings yet
Survey on VQA
30 pages
Open Ended VQA Models Using Transformers
No ratings yet
Open Ended VQA Models Using Transformers
10 pages
Beyond Captioning - Task-Specific Prompting For Improved VLM Performance in Mathematical Reasoning
No ratings yet
Beyond Captioning - Task-Specific Prompting For Improved VLM Performance in Mathematical Reasoning
10 pages
VQA: Visual Question Answering
No ratings yet
VQA: Visual Question Answering
25 pages
Reducing Language Biases in Visual Question Answering With Visually-Grounded Question Encoder
No ratings yet
Reducing Language Biases in Visual Question Answering With Visually-Grounded Question Encoder
17 pages
sar
No ratings yet
sar
10 pages
Explainable High-order Visual Question Reasoning
No ratings yet
Explainable High-order Visual Question Reasoning
10 pages
Improving Zero Shot Visual Question Answering Via Large Language Models With Reasoning Question Prompts
No ratings yet
Improving Zero Shot Visual Question Answering Via Large Language Models With Reasoning Question Prompts
12 pages
Synthesize Step-by-Step Tools, Templates and LLMs As Data Generators For Reasoning-Based Chart VQA
No ratings yet
Synthesize Step-by-Step Tools, Templates and LLMs As Data Generators For Reasoning-Based Chart VQA
16 pages
Fewvlm
No ratings yet
Fewvlm
13 pages
visual quension answering system
No ratings yet
visual quension answering system
11 pages
BLIVA_ A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions
No ratings yet
BLIVA_ A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions
12 pages
Visual Question Answering A State of The Art Review
No ratings yet
Visual Question Answering A State of The Art Review
41 pages
Open-Vocabulary Video Question Answering - A New Benchmark For Evaluating The Generalizability of Video Question Answering Models - ICCV - 2023.dpf
No ratings yet
Open-Vocabulary Video Question Answering - A New Benchmark For Evaluating The Generalizability of Video Question Answering Models - ICCV - 2023.dpf
16 pages
Beyond VQA
No ratings yet
Beyond VQA
16 pages
Tell-And-Answer Towards Explainable Visual Question
No ratings yet
Tell-And-Answer Towards Explainable Visual Question
9 pages
Hu PromptCap Prompt-Guided Image Captioning For VQA With GPT-3 ICCV 2023 Paper
No ratings yet
Hu PromptCap Prompt-Guided Image Captioning For VQA With GPT-3 ICCV 2023 Paper
13 pages
27999-Article Text-32053-1-2-20240324
No ratings yet
27999-Article Text-32053-1-2-20240324
9 pages
Visual Question Answering: A State of The Art Review: Sruthy Manmadhan Binsu C. Kovoor
No ratings yet
Visual Question Answering: A State of The Art Review: Sruthy Manmadhan Binsu C. Kovoor
41 pages
SC-ML
No ratings yet
SC-ML
6 pages
Declarative Knowledge Distillation From Large Language Models For Visual Question Answering Datasets
No ratings yet
Declarative Knowledge Distillation From Large Language Models For Visual Question Answering Datasets
11 pages
REFEERENCE Accuracy 43
No ratings yet
REFEERENCE Accuracy 43
11 pages
Interpretable Visual Question Answering Via Reasoning Supervision
No ratings yet
Interpretable Visual Question Answering Via Reasoning Supervision
5 pages
Guiding Vision-Language Model Selection For Visual Question-Answering Across Tasks, Domains, and Knowledge Types
No ratings yet
Guiding Vision-Language Model Selection For Visual Question-Answering Across Tasks, Domains, and Knowledge Types
19 pages
Hci Report
No ratings yet
Hci Report
5 pages
2411.11150v1
No ratings yet
2411.11150v1
27 pages
2406.17294v3
No ratings yet
2406.17294v3
18 pages
paper-120
No ratings yet
paper-120
12 pages
2104 12756 PDF
No ratings yet
2104 12756 PDF
27 pages
1 s2.0 S0262885621000706 Main
No ratings yet
1 s2.0 S0262885621000706 Main
11 pages
Yang 2021
No ratings yet
Yang 2021
13 pages
Updated PPT presentation-ISA-1 Phase-2
No ratings yet
Updated PPT presentation-ISA-1 Phase-2
27 pages
VQA3
No ratings yet
VQA3
10 pages
W L L M B T - Vqa?: HAT Arge Anguage Odels Ring To EXT Rich
No ratings yet
W L L M B T - Vqa?: HAT Arge Anguage Odels Ring To EXT Rich
11 pages
Textually Enriched Neural Module Networks For Visual Question Answering
No ratings yet
Textually Enriched Neural Module Networks For Visual Question Answering
9 pages
An Analysis of Graph Convolutional Networks and Recent Datasets For Visual Question Answering
No ratings yet
An Analysis of Graph Convolutional Networks and Recent Datasets For Visual Question Answering
24 pages
Visual_Question_Answering_based_on_multimodal_triplet_knowledge_accumuation
No ratings yet
Visual_Question_Answering_based_on_multimodal_triplet_knowledge_accumuation
4 pages
03enhancing Text Book Question Answering Using Rag
No ratings yet
03enhancing Text Book Question Answering Using Rag
19 pages
Graph Neural Networks For Visual Question Answering: A Systematic Review
No ratings yet
Graph Neural Networks For Visual Question Answering: A Systematic Review
38 pages
Research Paper Neuro Symbolic
No ratings yet
Research Paper Neuro Symbolic
12 pages
2020.tacl-1.37
No ratings yet
2020.tacl-1.37
17 pages
Joint_Answering_and_Explanation_for_Visual_Commonsense_Reasoning
No ratings yet
Joint_Answering_and_Explanation_for_Visual_Commonsense_Reasoning
11 pages
VQA Report
No ratings yet
VQA Report
30 pages
Neural Module Networks
No ratings yet
Neural Module Networks
10 pages
这里面提到了去偏门控函数设计是我要借鉴的
No ratings yet
这里面提到了去偏门控函数设计是我要借鉴的
13 pages
Deep Learning: Fundamentals and Applications
From Everand
Deep Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Towards Source-Free Cross Tissues Histopathological Cell Segmentation Via Target-Specific Finetuning
No ratings yet
Towards Source-Free Cross Tissues Histopathological Cell Segmentation Via Target-Specific Finetuning
13 pages
Pensieve - Retrospect-then-Compare Mitigates Visual Hallucination
No ratings yet
Pensieve - Retrospect-then-Compare Mitigates Visual Hallucination
33 pages
Online Adaptation of Language Models With A Memory of Amortized Contexts
No ratings yet
Online Adaptation of Language Models With A Memory of Amortized Contexts
14 pages
Patch N' Pack - NaViT, A Vision Transformer For Any Aspect Ratio and Resolution
No ratings yet
Patch N' Pack - NaViT, A Vision Transformer For Any Aspect Ratio and Resolution
26 pages
Project Creation Wizard: Ex:Sample Ex:SAM
No ratings yet
Project Creation Wizard: Ex:Sample Ex:SAM
3 pages
Specht V Netscape
No ratings yet
Specht V Netscape
14 pages
Departmental Store Management System
100% (2)
Departmental Store Management System
5 pages
PLC Programming For Industrial Automation Kevin Collins PDF
67% (3)
PLC Programming For Industrial Automation Kevin Collins PDF
3 pages
ZTE UniPOS NetMAX-GU V14.40 For Linux Installation Guide
No ratings yet
ZTE UniPOS NetMAX-GU V14.40 For Linux Installation Guide
12 pages
Chapter 5: Threads: Silberschatz, Galvin and Gagne ©2013 Operating System Concepts - 9 Edition
No ratings yet
Chapter 5: Threads: Silberschatz, Galvin and Gagne ©2013 Operating System Concepts - 9 Edition
16 pages
04 - Daily Tasks
No ratings yet
04 - Daily Tasks
49 pages
Fresh Fruit Mart
No ratings yet
Fresh Fruit Mart
16 pages
ZW3D CAD - Fundamentals PDF
No ratings yet
ZW3D CAD - Fundamentals PDF
300 pages
Laporan Praktikum Interaksi Manusia Dan Komputer
No ratings yet
Laporan Praktikum Interaksi Manusia Dan Komputer
187 pages
SAP HANA Security - Intellipaat Blog PDF
No ratings yet
SAP HANA Security - Intellipaat Blog PDF
6 pages
Intermediate SQL: Practice Exercises
No ratings yet
Intermediate SQL: Practice Exercises
4 pages
Upgrade mongoDB Server From 4.0.19 To 4.2.8
No ratings yet
Upgrade mongoDB Server From 4.0.19 To 4.2.8
5 pages
MongoDB Manual
No ratings yet
MongoDB Manual
793 pages
Lab 4: Metasploit Framework: CSC 4992 Cyber Security Practice
No ratings yet
Lab 4: Metasploit Framework: CSC 4992 Cyber Security Practice
22 pages
Windows API Visual Studio 6.0: MFC Application, Visual Basic 6.0 Swing (Java)
No ratings yet
Windows API Visual Studio 6.0: MFC Application, Visual Basic 6.0 Swing (Java)
16 pages
Uts Access
No ratings yet
Uts Access
3 pages
Juniper SRX Vs Palo Alto Next Gen Firewall PDF
No ratings yet
Juniper SRX Vs Palo Alto Next Gen Firewall PDF
3 pages
Stop Updates 10 Log
No ratings yet
Stop Updates 10 Log
3 pages
PySpark SQL Cheat Sheet Python PDF
No ratings yet
PySpark SQL Cheat Sheet Python PDF
1 page
Load Utilities in Teradata
No ratings yet
Load Utilities in Teradata
12 pages
AM Extension in OAF
No ratings yet
AM Extension in OAF
4 pages
SMS4DC Software
No ratings yet
SMS4DC Software
330 pages
Web Programming
No ratings yet
Web Programming
109 pages
Oracle Tuxedo On Docker Containers
No ratings yet
Oracle Tuxedo On Docker Containers
6 pages
Cookbook Vtiger PDF
No ratings yet
Cookbook Vtiger PDF
4 pages
Marlink X7 Modem Tool v2.2 User Manual
No ratings yet
Marlink X7 Modem Tool v2.2 User Manual
18 pages
Laravel 6 Highcharts Tutorial For Beginner
No ratings yet
Laravel 6 Highcharts Tutorial For Beginner
4 pages