A Simple Baseline For Knowledge-Based Visual Question Answering
A Simple Baseline For Knowledge-Based Visual Question Answering
of incorporating both explicit (through external quire complicated pipelines. Firstly, a KB (e.g.
databases) and implicit (through LLMs) knowl- wikidata) covering world knowledge needs to be
edge to answer questions requiring external maintained and used for knowledge retrieval which
knowledge effectively. A common limitation is time-consuming and very sensitive to noise. Sec-
of such approaches is that they consist of rela-
ondly, powerful LLMs such as GPT-3 (Brown et al.,
tively complicated pipelines and often heavily
rely on accessing GPT-3 API. Our main contri- 2020) or OPT-175B (Zhang et al., 2022) are lever-
bution in this paper is to propose a much sim- aged due to the huge amount of implicit knowl-
pler and readily reproducible pipeline which, edge stored in their parameters and their powerful
in a nutshell, is based on efficient in-context reasoning capabilities through few-shot in-context
learning by prompting LLaMA (1 and 2) using learning. However, the computational or even ac-
question-informative captions as contextual in- tual monetary cost (e.g. cost for API access) as-
formation. Contrary to recent approaches, our
sociated with accessing such models renders them
method is training-free, does not require access
to external databases or APIs, and yet achieves unaffordable for many researchers. Thirdly, it is
state-of-the-art accuracy on the OK-VQA and crucial to train a fusion mechanism that can effec-
A-OK-VQA datasets. Finally, we perform sev- tively reason by combining the retrieved explicit
eral ablation studies to understand important and implicit knowledge.
aspects of our method. Our code is publicly
available at https://github.com/alexandrosXe/A- Main contributions: We present a simple yet
Simple-Baseline-For-Knowledge-Based-VQA powerful pipeline for KB-VQA which by-passes
the need for using most of the components of the
1 Introduction
above-mentioned systems. Specifically, the pro-
Knowledge-based VQA (KB-VQA) is a recently posed system is simply based on few-shot prompt-
introduced VQA task (Wang et al., 2017, 2018; ing of LLaMA-13B (Touvron et al., 2023a,b). The
Marino et al., 2019; Shah et al., 2019) where the im- key component of our method is the implementa-
age alone is not sufficient to answer the given ques- tion of effective in-context learning using question-
tion, but effective utilization of external knowledge informative captions as contextual information
resources is additionally required. To solve such a which, as we show, results in large accuracy boosts.
task, a model would need not only strong visual per-
ception but also reasoning capabilities while also The proposed system features several advan-
being able to effectively incorporate world knowl- tages: (1) it is entirely training-free, requiring
edge from external KBs (e.g. Wikipedia, etc) and only a few examples for in-context learning; (2)
LLMs. Systems capable of answering general and it is based on the open-source LLaMA-13B (Tou-
diverse questions about the visual world find a wide vron et al., 2023a,b) (considerably smaller than the
range of applications: from personal assistants to widely-used GPT-3); (3) it is straightforward to re-
aids for the visually impaired and robotics 1 . produce; and (4) achieves state-of-the-art (SOTA)
∗ accuracy on the widely-used OK-VQA (Marino
Corresponding author.
1
https://www.adelaide.edu.au/aiml/our-research/machine- et al., 2019) and A-OK-VQA datasets (Schwenk
learning/vqa-vision-and-language et al., 2022).
2 Related Work on KB-VQA vious work (Yang et al., 2022; Gui et al., 2022;
Lin et al., 2022) we leverage the open-source LLM
Methods Without LLMs: Several methods
LLaMA-13B (Touvron et al., 2023a,b) instead of
have been proposed including KRISP (Marino
GPT-3 as an implicit language knowledge base and
et al., 2021) which uses a multi-modal pretrained
treat VQA as an open-ended text generation task.
BERT (Devlin et al., 2019), MAVEx (Wu et al.,
Our method builds upon the pipeline of PICa,
2022) which proposes to validate promising answer
which is the pioneering work that utilizes GPT-3
candidates based on answer-specific knowledge
for few-shot in-context learning in order to address
retrieval, and DPR which uses pseudo-relevance
the KB-VQA task. GPT-3 is a decoder-only au-
labels integrated with answer generation for end-
toregressive LLM of 175B parameters, trained on a
to-end training. Typically, these systems are not as
diverse range of data sources, including Common
competitive as the ones based on LLMs.
Crawl, webtexts, books, and Wikipedia (Brown
Methods based on LLMs: PICa (Yang et al.,
et al., 2020). During inference, in-context few-shot
2022) is the first method to adopt GPT-3 for solv-
learning involves formulating a novel downstream
ing the KB-VQA task in a few-shot manner by
task as a text sequence generation task using the
just providing a few in-context VQA examples.
frozen GPT-3 model. When provided with a testing
Gui et al. (2022) proposed to use both implicit (i.e.
input x, the target y is predicted based on a format-
GPT-3) and explicit (i.e. KBs) knowledge based
ted prompt p(h, C, E, c, x). In this prompt, h repre-
on CLIP retrieval (Radford et al., 2021) which are
sents a prompt head or instruction that describes the
combined by a novel fusion module called KAT
task, while E = {e1 , e2 , ..., en } represents a set of
(based on T5 or Bart). Lin et al. (2022) proposed to
n in-context examples (shots), where ei = (xi , yi )
integrate local visual features and positional infor-
represents an input-target pair of the task, where xi
mation (bounding box coordinates), retrieved exter-
and yi are the input and target, respectively. These
nal and implicit knowledge (using a GPT-3) into a
pairs are constructed manually or sampled from
transformer-based question-answering model. Hu
the training set. C = {c1 , c2 , ..., cn } represents a
et al. (2023) proposed PromptCap, a novel task-
set of generic image captions describing each xi
aware captioning model that uses a natural lan-
since images cannot be inputted to GPT-3. The
guage prompt to control the generation of the visual
caption for the test input is labeled as c. The tar-
content that can be used in conjunction with GPT-3
get y is denoted as a text sequence consisting of L
in-context learning. Img2Prompt Guo et al. (2023)
tokens, expressed as y = (y 1 , y 2 , ..., y L ). At each
is a zero-shot VQA method that generates image-
decoding step t, the following conditions apply:
relevant exemplar prompts for the LLM. Their key
insight is that synthetic question-answer pairs can ŷ t = argmax pLLM (y t |p, ŷ <t ) (1)
be generated using image captioning and question- yt
generation techniques as in-context exemplars from In order to utilize any LLM for the knowledge-
the provided image. Prophet Shao et al. (2023) based VQA task, the crucial step is to design suit-
proposes to prompt GPT-3 with answer heuristics able prompts. When given a question qi and an
(answer candidates and answer-aware examples) image vi as inputs, the VQA task’s objective is
that are encoded into the prompts to enable GPT-3 to predict the corresponding answer ai . However,
to better comprehend the task, thus enhancing its since LLMs do not inherently comprehend images,
capacity. it becomes necessary to convert the image into a
caption ci using a pre-existing captioning model.
3 Methodology
While SOTA pretrained captioning models have
While explicit knowledge retrieval focuses on se- demonstrated impressive performance, they are pri-
mantic matching between an image and knowledge marily optimized to generate generic image cap-
entries, it lacks implicit commonsense knowledge tions. Unfortunately, these captions often fail to
(e.g. Lemons are sour) which can be found in capture all the specific details required to accu-
LLMs (Gui et al., 2022). LLMs are critical in ex- rately answer a given question about the image. In
tracting implicit knowledge due to the vast amount this work, instead of generic captions, we generate
of implicit information embedded in their parame- question-guided informative image captions using
ters, and their powerful reasoning capacity through the Plug-and-Play VQA (PNPVQA) framework
few-shot in-context learning. Different from pre- (Tiong et al., 2022) which identifies the most re-
Method Knowledge Resources Acc (%)
lated image patches to the question with a saliency KRISP Wikipedia+ConceptNet 38.35
map-based interpretability technique and generates MAVEx Wikipedia+ConceptNet+Google Images 39.4
captions from these patches only. Unified-IO (2.8B) Multimodal Pretraining 54
Flamingo (80B) Multimodal Pretraining 57.8
PICa-Full Frozen GPT-3 (175B) 48.0
KAT_base (single) Wikidata+Frozen GPT-3 (175B) 50.58
KAT_large (single) Wikidata+Frozen GPT-3 (175B) 53.09
KAT_large (ensemble) Wikidata+Frozen GPT-3 (175B) 54.41
REVIVE_large (single) Wikidata+Frozen GPT-3 (175B) 56.6
REVIVE_large (ensemble) Wikidata+Frozen GPT-3 (175B) 58.0
Prophet Frozen GPT-3 (175B) 61.1
Ours Frozen LLaMA (13B) 58.69
Ours + MCAN Frozen LLaMA (13B) 60.02
Ours Frozen LLaMA 2 (13B) 59.07
Ours + MCAN Frozen LLaMA 2 (13B) 61.2
Figure 1: Inference-time of our method for n-shot VQA. Table 1: Comparison with other methods on the OK-
The input prompt to LLaMA consists of a prompt head VQA dataset: Our method with 9 question-informative
h (blue box), n in-context examples ({ci , xi , yi }ni=1 ) captions achieves state-of-the-art performance.
(red boxes), and the VQA input {c, x} (green box).
The answer y is produced in an open-ended text gen-
eration manner. In this example we use two question- use more examples via multi-query ensemble.
informative captions per example (separated by com-
In-context Example Selection tries to search for
mas).
the best examples for each inference-time input x
among all available examples (Yang et al., 2022).
For each image-question pair, we first gener-
We consider in-context examples that have similar
ate 50 question-guided informative image captions
question features as x. More specifically, given
from the image vi using PNPVQA. We then em-
an inference-time question, we use BLIP’s text en-
ploy BLIP’s (Li et al., 2022) text encoder to encode
coder to obtain its textual feature and compute its
all the image captions and BLIP’s image encoder to
cosine similarity with the questions in all available
encode the image vi . We rank the image captions
in-context examples. We then average the ques-
per image vi according to their cosine similarity
tion text similarity with the image visual similarity
with the image vi and keep the top-m most simi-
to guide the example selection similarly to Yang
lar captions ci per example. After extracting the
et al. (2022). We select the top-n questions with
top-m most similar captions per image vi we con-
the highest similarity and use the corresponding
struct a carefully designed text prompt consisting
examples as the in-context examples.
of a general instruction sentence, the captions C,
Multi-query ensemble: Given an inference-time
the question, the test input’s captions c, and a set
example x, we use k × n in-context examples to
of context-question-answer triplets (shots) taken
generate k prompts. This way, we prompt LLaMA-
from the training dataset that are semantically most
13B for k times and obtain k answer predictions
similar to the current image-question pair (see Fig.
instead of 1 similar to Yang et al. (2022), where
1). Then this text prompt is passed to a frozen
k is the number of queries to ensemble. Finally,
LLaMA-13B model and in-context few-shot learn-
among the k answer predictions, we select the one
ing is performed in order to obtain its output as a
with the most occurrences (majority vote).
promising answer candidate to the current image-
question pair. 4 Experimental Results
3.1 Selecting Informing Examples For Comparative results on OK-VQA: Table 1 sum-
Few-Shot In-Context Learning marizes the results of various methods on OK-VQA
As Yang et al. (2022) notes, feeding more in- including our best method (last row) which uses 9
context examples to GPT-3 yields better few-shot question-informative captions and 5 query ensem-
performance. However, the maximum input length bles. When using LLaMA our approach outper-
of the model constrains the maximum number of forms all methods and achieves comparable results
examples n in the prompt. To better use these avail- with Prophet especially when using the same shot
able examples we: (i) improve the example quality selection strategy based on MCAN (Yu et al., 2019).
by careful in-context example selection (Liu et al., Moreover, it performs better than Unified-IO and
2022; Gui et al., 2022; Shao et al., 2023), and (ii) the 80B Flamingo which have been pre-trained
Method DA MC Captions n k Acc (%)
Val Test Val Test
Generic 14 5 43.35
ClipCap 30.9 25.9 56.9 51.4
Question-informative 14 5 57.56
ViLBERT 30.6 25.9 49.1 41.5
LXMERT 30.7 25.9 51.4 41.6
Table 3: Generic vs. question-informative captions.
KRISP 33.7 27.1 51.9 42.2
GPV-2 48.6 40.7 60.3 53.7
Unified-IO - 45.2 - - shows the performance of our method when using
Prophet 58.2 55.7 59.3 57.3 generic captions vs question-informative captions
Ours (LLaMA) 54.4 53.8 - - for in-context learning which is the key component
Ours + MCAN (LLaMA) 57.4 55.0 - - of our system. Following Yang et al. (2022); Shao
Ours (LLaMA 2) 57.1 55.4 - -
et al. (2023) we leverage the OSCAR+ (Zhang
Ours + MCAN (LLaMA 2) 58.6 57.5 - -
et al., 2021) as the captioning model. The results
Table 2: Comparison with other methods on the A-OK- suggest using question-informative captions results
VQA dataset: Our method with 9 question-informative in huge accuracy boosts (43.35% vs 57.56%).
captions achieves state-of-the-art performance at the
direct answer (DA) setting. Note that our method does
Shot Selection Strategy Captions m n k Acc (%)
not support multiple-choice (MC).
Random Question-informative 1 14 5 53.19
Avg. Question and Image Sim. Question-informative 1 14 5 56.50
MCAN latent space Question-informative 1 14 5 57.56
with multimodal objectives. When compared to
methods that rely on GPT-3 for implicit knowl- Table 4: Accuracy when using different shot selection
edge extraction, our approach outperforms PICa- strategies. Avg. question and image sim. strategy re-
Full which only uses generic image captions by trieves shots based on the average cosine similarity be-
12.02% while outperforming the SOTA supervised tween the test sample’s question and image, and the
training examples’ question and image. MCAN latent
methods KAT and REVIVE by 5.61% and 2.02%
space strategy retrieves shots that are closer to the test
respectively. Finally, when using LLaMA 2 and sample in the trained MCAN’s latent space.
MCAN-based shot selection strategy, our method
achieves state-of-the-art accuracy of 61.2%.
Effect of shot selection strategy: Table 4 shows
Comparative results on A-OK-VQA: Table 2
that selecting random shots during in-context learn-
summarizes the results of various methods on A-
ing hurts the accuracy, confirming the findings of
OK-VQA including our best method (last row)
Yang et al. (2022). Retrieving shots based on the
which uses 9 question-informative captions and
similarity between the test sample and the train-
5 query ensembles. We compare our method to the
ing examples yields a significant accuracy boost.
strong baselines in (Schwenk et al., 2022) and the
Prophet’s shot selection strategy based on MCAN
current state-of-the-art method Prophet (Shao et al.,
also seems to be effective but we note that it is
2023). When employing LLaMA, our approach
based on pre-training a vanilla VQA model on a
surpasses all other methods on the DA setting
different dataset (VQA-v2).
and achieves comparable results to Prophet, par-
ticularly when employing the same shot selection Effect of number of question-informative
strategy based on MCAN. Finally, with LLaMA captions: Fig. 2 (a) shows the accuracy when
2 and MCAN our method attains state-of-the-art we increase the number of captions per sample
performance on both the validation and test sets, in the prompt during in-context learning. Here,
achieving 58.6% and 57.5% accuracy respectively, we are using k = 5, and n = 10 when using 1-10
demonstrating the effectiveness and robust general- captions, and n = 5 when using more than 10
ization of our proposed method. captions due to max. sequence length constraints.
More captions provide more information for
5 Ablation Studies each example helping the model to make a more
accurate prediction based on context. As shown in
We conduct several ablations on OK-VQA to better the figure, the validation accuracy keeps increasing
understand the key components of our method. up to 60.02%. When using more than 10 captions,
the accuracy decreases but this also can be
Effect of question-informative captions: Table 3 attributed to the fact that we are also decreasing
LLMs apart from LLaMA, which presents a lim-
itation of our study. Lastly, due to limitations in
resources, we were unable to conduct experiments
with larger sizes beyond 13B. However, it would
indeed be intriguing to observe the performance
when employing LLaMA models of sizes such as
(a) (b)
30B or 65B.
Figure 2: (a) Accuracy vs number of question informa-
tive captions used per shot during few shot in-context Ethics Statement
learning. (b) Accuracy vs number of prompts k used
The authors of this paper recognize the importance
during in-context learning.
of responsible AI in both research and development
endeavors. We are committed to ensuring that the
the number of shots to 5. model we develop is not only accurate but also
Effect of multi-query ensemble: Fig. 2 (b) shows fair and unbiased. We understand the potentially
the accuracy as the number of prompts, k, increases. significant impact of VQA technology on society
As anticipated, employing multiple prompts of and, therefore, pledge to maintain transparency by
LLaMA instead of just one yields improved ac- sharing our findings and progress with relevant
curacy. However, beyond k = 6, the accuracy researchers and stakeholders.
begins to fluctuate. It is important to note that this
fluctuation could be attributed to the retrieval of
noisy (irrelevant to the question) context examples References
as the value of k increases. Tom Brown, Benjamin Mann, Nick Ryder, Melanie
Effect of explicit knowledge: We also tried to Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda
use KAT’s (Gui et al., 2022) KB and trained a T5 Askell, Sandhini Agarwal, Ariel Herbert-Voss,
(Raffel et al., 2020) in order to integrate explicit Gretchen Krueger, Tom Henighan, Rewon Child,
knowledge into our model. For each image, we Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens
used BLIP to extract explicit knowledge via image- Winter, Chris Hesse, Mark Chen, Eric Sigler, Ma-
teusz Litwin, Scott Gray, Benjamin Chess, Jack
to-text retrieval. We used 40 retrieved passages Clark, Christopher Berner, Sam McCandlish, Alec
and LLaMA predictions as explicit and implicit Radford, Ilya Sutskever, and Dario Amodei. 2020.
knowledge, respectively. We achieved an accuracy Language models are few-shot learners. In Ad-
of 58.70% which shows that our model does not vances in Neural Information Processing Systems,
volume 33, pages 1877–1901. Curran Associates,
benefit from such an approach. Inc.
Effect of size of LLM: We also used a LLaMA-
7B model using 9 question-informative captions, Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of
n = 10 and k = 5. Reducing the size of the LLM
deep bidirectional transformers for language under-
leads to decreased accuracy but the drop is not standing. In Proceedings of the 2019 Conference of
large, still obtaining 57.99% accuracy. the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
6 Conclusions nologies, Volume 1 (Long and Short Papers), pages
4171–4186, Minneapolis, Minnesota. Association for
Computational Linguistics.
We proposed a simple yet effective baseline for
KB-VQA. Our training-free method is based on Liangke Gui, Borui Wang, Qiuyuan Huang, Alexan-
in-context few-shot learning of the open-source der Hauptmann, Yonatan Bisk, and Jianfeng Gao.
LLaMA using question-informative captions. We 2022. KAT: A knowledge augmented transformer
for vision-and-language. In Proceedings of the
show that this is sufficient to achieve SOTA re- 2022 Conference of the North American Chapter of
sults on the widely used OK-VQA and A-OK-VQA the Association for Computational Linguistics: Hu-
datasets. man Language Technologies, pages 956–968, Seattle,
United States. Association for Computational Lin-
guistics.
Limitations
Jiaxian Guo, Junnan Li, Dongxu Li, Anthony
It is important to acknowledge that we have not Meng Huat Tiong, Boyang Li, Dacheng Tao, and
explored the utilization of any other medium-sized Steven C. H. Hoi. 2023. From images to textual
prompts: Zero-shot vqa with frozen large language Zhenwei Shao, Zhou Yu, Meng Wang, and Jun Yu. 2023.
models. Prompting large language models with answer heuris-
tics for knowledge-based visual question answering.
Yushi Hu, Hang Hua, Zhengyuan Yang, Weijia Shi,
Noah A. Smith, and Jiebo Luo. 2023. Promptcap: Anthony Meng Huat Tiong, Junnan Li, Boyang Li, Sil-
Prompt-guided task-aware image captioning. vio Savarese, and Steven C.H. Hoi. 2022. Plug-and-
play VQA: Zero-shot VQA by conjoining large pre-
Junnan Li, Dongxu Li, Caiming Xiong, and Steven trained models with zero training. In Findings of the
Hoi. 2022. Blip: Bootstrapping language-image pre- Association for Computational Linguistics: EMNLP
training for unified vision-language understanding 2022, pages 951–967, Abu Dhabi, United Arab Emi-
and generation. rates. Association for Computational Linguistics.
Yuanze Lin, Yujia Xie, Dongdong Chen, Yichong Xu, Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Chenguang Zhu, and Lu Yuan. 2022. Revive: Re- Martinet, Marie-Anne Lachaux, Timothée Lacroix,
gional visual representation matters in knowledge- Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
based visual question answering. Azhar, Aurelien Rodriguez, Armand Joulin, Edouard
Grave, and Guillaume Lample. 2023a. Llama: Open
Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, and efficient foundation language models.
Lawrence Carin, and Weizhu Chen. 2022. What
makes good in-context examples for GPT-3? In Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
Proceedings of Deep Learning Inside Out (DeeLIO bert, Amjad Almahairi, Yasmine Babaei, Nikolay
2022): The 3rd Workshop on Knowledge Extrac- Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
tion and Integration for Deep Learning Architectures, Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton
pages 100–114, Dublin, Ireland and Online. Associa- Ferrer, Moya Chen, Guillem Cucurull, David Esiobu,
tion for Computational Linguistics. Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller,
Cynthia Gao, Vedanuj Goswami, Naman Goyal, An-
Kenneth Marino, Xinlei Chen, Devi Parikh, Abhinav thony Hartshorn, Saghar Hosseini, Rui Hou, Hakan
Gupta, and Marcus Rohrbach. 2021. Krisp: Inte- Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa,
grating implicit and symbolic knowledge for open- Isabel Kloumann, Artem Korenev, Punit Singh Koura,
domain knowledge-based vqa. In Proceedings of Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Di-
the IEEE/CVF Conference on Computer Vision and ana Liskovich, Yinghai Lu, Yuning Mao, Xavier Mar-
Pattern Recognition (CVPR), pages 14111–14121. tinet, Todor Mihaylov, Pushkar Mishra, Igor Moly-
bog, Yixin Nie, Andrew Poulton, Jeremy Reizen-
Kenneth Marino, Mohammad Rastegari, Ali Farhadi, stein, Rashi Rungta, Kalyan Saladi, Alan Schelten,
and Roozbeh Mottaghi. 2019. Ok-vqa: A visual Ruan Silva, Eric Michael Smith, Ranjan Subrama-
question answering benchmark requiring external nian, Xiaoqing Ellen Tan, Binh Tang, Ross Tay-
knowledge. In Proceedings of the IEEE/CVF Con- lor, Adina Williams, Jian Xiang Kuan, Puxin Xu,
ference on Computer Vision and Pattern Recognition Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan,
(CVPR). Melanie Kambadur, Sharan Narang, Aurelien Ro-
driguez, Robert Stojnic, Sergey Edunov, and Thomas
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Scialom. 2023b. Llama 2: Open foundation and
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- fine-tuned chat models.
try, Amanda Askell, Pamela Mishkin, Jack Clark,
Gretchen Krueger, and Ilya Sutskever. 2021. Learn- Peng Wang, Qi Wu, Chunhua Shen, Anthony Dick, and
ing transferable visual models from natural language Anton Van Den Henge. 2017. Explicit knowledge-
supervision. based reasoning for visual question answering. In
Proceedings of the 26th International Joint Con-
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine ference on Artificial Intelligence, IJCAI’17, page
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, 1290–1296. AAAI Press.
Wei Li, and Peter J. Liu. 2020. Exploring the limits
of transfer learning with a unified text-to-text trans- Peng Wang, Qi Wu, Chunhua Shen, Anthony Dick, and
former. J. Mach. Learn. Res., 21(1). Anton van den Hengel. 2018. Fvqa: Fact-based vi-
sual question answering. IEEE Transactions on Pat-
Dustin Schwenk, Apoorv Khandelwal, Christopher tern Analysis and Machine Intelligence, 40(10):2413–
Clark, Kenneth Marino, and Roozbeh Mottaghi. 2022. 2427.
A-okvqa: A benchmark for visual question answer-
ing using world knowledge. In Computer Vision – Jialin Wu, Jiasen Lu, Ashish Sabharwal, and Roozbeh
ECCV 2022, pages 146–162, Cham. Springer Nature Mottaghi. 2022. Multi-modal answer validation for
Switzerland. knowledge-based vqa. Proceedings of the AAAI Con-
ference on Artificial Intelligence, 36(3):2712–2721.
Sanket Shah, Anand Mishra, Naganand Yadati, and
Partha Pratim Talukdar. 2019. Kvqa: Knowledge- Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei
aware visual question answering. Proceedings Hu, Yumao Lu, Zicheng Liu, and Lijuan Wang. 2022.
of the AAAI Conference on Artificial Intelligence, An empirical study of gpt-3 for few-shot knowledge-
33(01):8876–8884. based vqa. 36:3081–3089.
Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian.
2019. Deep modular co-attention networks for visual
question answering. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recog-
nition (CVPR).
Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei
Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jian-
feng Gao. 2021. Vinvl: Revisiting visual representa-
tions in vision-language models. In Proceedings of
the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), pages 5579–5588.
Susan Zhang, Stephen Roller, Naman Goyal, Mikel
Artetxe, Moya Chen, Shuohui Chen, Christopher De-
wan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mi-
haylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel
Simig, Punit Singh Koura, Anjali Sridhar, Tianlu
Wang, and Luke Zettlemoyer. 2022. Opt: Open pre-
trained transformer language models.
A Example Appendix
A.1 Implementation Details
We used the Huggingface Transformers library2 in
order to run LLaMA models. We used beam search
with beam size = 2 during generation and max new
tokens = 5 while using the default values for all the
other parameters in the generate method. We run
our model on a 40-GB VRAM A-100 GPU card.
2
https://huggingface.co/transformers/