0% found this document useful (0 votes)
43 views

Prophet Prompting LLM With Complementary Answer Heuristics 4 Knowledge Based Vqa

Uploaded by

Uyên Huỳnh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

Prophet Prompting LLM With Complementary Answer Heuristics 4 Knowledge Based Vqa

Uploaded by

Uyên Huỳnh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1

Prophet: Prompting Large Language Models


with Complementary Answer Heuristics for
Knowledge-based Visual Question Answering
Zhou Yu Member, IEEE, Xuecheng Ouyang, Zhenwei Shao,
Meng Wang Fellow, IEEE, Jun Yu Senior Member, IEEE

Abstract—Knowledge-based visual question answering (VQA) requires external knowledge beyond the image to answer the question.
arXiv:2303.01903v3 [cs.CV] 14 Dec 2023

Early studies retrieve required knowledge from explicit knowledge bases (KBs), which often introduces irrelevant information to the
question, hence restricting the performance of their models. Recent works have resorted to using a powerful large language model
(LLM) as an implicit knowledge engine to acquire the necessary knowledge for answering. Despite the encouraging results achieved by
these methods, we argue that they have not fully activated the capacity of the blind LLM as the provided textual input is insufficient to
depict the required visual information to answer the question. In this paper, we present Prophet—a conceptually simple, flexible, and
general framework designed to prompt LLM with answer heuristics for knowledge-based VQA. Specifically, we first train a vanilla VQA
model on a specific knowledge-based VQA dataset without external knowledge. After that, we extract two types of complementary
answer heuristics from the VQA model: answer candidates and answer-aware examples. Finally, the two types of answer heuristics are
jointly encoded into a formatted prompt to facilitate the LLM’s understanding of both the image and question, thus generating a more
accurate answer. By incorporating the state-of-the-art LLM GPT-3 [1], Prophet significantly outperforms existing state-of-the-art
methods on four challenging knowledge-based VQA datasets. To demonstrate the generality of our approach, we instantiate Prophet
with the combinations of different VQA models (i.e., both discriminative and generative ones) and different LLMs (i.e., both commercial
and open-source ones).

Index Terms—Visual Question Answering (VQA), large language models (LLMs), knowledge-based VQA, multimodal learning.

1 I NTRODUCTION

R E cent advances in deep learning have enabled sub-


stantial progress in visual question answering (VQA)
which requires a machine to answer free-form questions by
on open-domain knowledge have been established [7], [8],
which means KBs are no longer provided and any external
knowledge resource can be used for answering. We focus on
reasoning about given images. Benefiting from large-scale the task with open-domain knowledge in this paper.
vision-language pretraining, the state-of-the-art methods A straightforward solution for knowledge-based VQA
have even surpassed human level on several representa- is to retrieve knowledge entries from explicit KBs, e.g.,
tive benchmarks [2], [3], [4]. Despite the success of these Wikipedia and ConceptNet [9]. Then, a KB-augmented VQA
methods, their reasoning abilities are far from satisfactory, model performs joint reasoning over the retrieved knowl-
especially when external knowledge is required to answer the edge, image, and question to predict the answer [10], [11],
questions. In this situation, the task of knowledge-based [12], [13], [14]. However, the performance of these retrieval-
VQA is introduced to validate models’ abilities to leverage based approaches is limited for two reasons: (i) the required
external knowledge. Early knowledge-based VQA bench- knowledge may not be successfully retrieved from the KBs;
marks additionally provide structured knowledge bases and (ii) even if the required knowledge is retrieved, plenty
(KBs) and annotate required knowledge facts for all the of irrelevant knowledge is inevitably introduced, which
questions [5], [6]. More recently, benchmarks emphasizing hampers the learning of VQA models.
Apart from those studies using explicit KBs, another
line of research resorts to pretrained large language models
• This work was supported in part by the Zhejiang Provincial Natu- (LLMs), e.g., GPT-3 [1], as implicit knowledge engines for
ral Science Foundation of China under Grant LR22F020001, in part
by the National Natural Science Foundation of China under Grants knowledge acquisition. A pioneering work by PICa employs
62125201, 62072147, 62020106007 and 61836002, and in part by the the frozen GPT-3 model to answer the question with a
Zhejiang Provincial Natural Science Foundation of China under Grant formatted prompt as its input [15]. Given a testing image-
LDT23F02025F02. (Corresponding author: Jun Yu.)
• Z. Yu, Z. Shao, J. Yu are with the Key Laboratory of Complex Systems
question pair, PICa first translates the image into a cap-
Modeling and Simulation, the School of Computer Science, Hangzhou tion using an off-the-shelf captioning model. The question,
Dianzi University, China. (e-mail: [email protected]; [email protected]; caption, and a few in-context examples are then integrated
[email protected]) into a textual prompt that can induce GPT-3 to predict the
• X. Ouyang is with the HDU-ITMO Joint Institute, Hangzhou Dianzi
University, China. (e-mail: [email protected]) answer directly. Thanks to the powerful knowledge reason-
• M. Wang is with the School of Computer Science and Informa- ing ability of GPT-3, PICa achieves significant performance
tion Engineering, Hefei University of Technology, China. (e-mail: improvements compared to those retrieval-based methods
[email protected])
using explicit KBs. Inspired by PICa, KAT [16] and REVIVE
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2

C come the limitations in (i) and (ii), respectively. Given a


what fruit comes
(Q) LLM A testing input consisting of an image and a question, the
from these trees?
Q
(K)
<entity #1, desc #1>
... PICa [15] answer candidates refer to a list of promising answers to
<entity #n, desc #n>
V Q K the testing input, where each answer is associated with a
knowledge base C candidate, confidence score. The answer-aware examples refer to a list
evidence KB-augmented
LLM
GPT-3 VQA model A of in-context examples, where each example has a similar
Q
KAT [16] / REVIVE [17] answer to the testing input. Interestingly, these two types of
(V) answer heuristics can be simultaneously obtained from any
C Q
V answer vanilla VQA model trained on a specific knowledge-based
vanilla heuristics
VQA model LLM A VQA dataset. A schematic of Prophet is illustrated at the
captioning model Q
Prophet (ours) bottom of Fig. 1.
(C) a group of people Discriminative: MCAN Commercial: GPT-3
Without bells and whistles, Prophet surpasses previous
walk in a city square Generative: mPLUG Open-source: LLaMa, Falcon, etc. state-of-the-art single-model results on the challenging OK-
VQA and A-OKVQA datasets [7], [8], including the heavily-
Fig. 1: Conceptual comparisons of three knowledge-based engineered Flamingo-80B model trained on 1.8B image-text
VQA frameworks using a frozen LLM model, e.g., GPT-3 pairs [2]. Moreover, Prophet is friendly to most researchers,
[1]. While PICa [15], KAT [16], and REVIVE [17] directly as our results can be reproduced using a single GPU and a
feed the caption (C) and question (Q) into the LLM as number of GPT-3 invocations.
the prompt, we argue that the information they provide A preliminary version of this manuscript was published
for the LLM is insufficient thus cannot fully activate the in [22]. Based on that version, we have made the follow-
LLM’s potential. In contrast, our Prophet learns a vanilla ing contributions to further improve the performance and
VQA model without external knowledge to produce answer validate the generality of Prophet: (i) we investigate diverse
heuristics, which endows the LLM with richer and more types of VQA models, including the classical discriminative
task-specific information for answer prediction. In contrast models trained from scratch and the latest generative VQA
to the counterparts that resort to specific VQA models and models pretrained on large-scale corpus; (ii) we expand the
LLMs, our Prophet is more general that can be instanti- used LLM from the commercial GPT-3 model to a wide
ated with the combinations of different VQA models (i.e., range of open-source models; (iii) apart from OK-VQA and
discriminative [18] and generative ones [19]) and different A-OKVQA, we conduct more experiments on two other
LLMs (i.e., commercial [1] and open-source ones [20], [21]). knowledge-based VQA datasets, namely ScienceQA [23]
and TextVQA [24]. The source code is made available here1 .
[17] learn KB-augmented VQA models to exploit both the We hope these studies may serve as a new baseline to inspire
implicit knowledge from LLMs and explicit knowledge future research on knowledge-based VQA and universal
from KBs for answer prediction. The synergy of the two vision-language learning.
knowledge resources brings further improvements to their
models. Despite the promising results achieved by these
methods, they have not fully activated the capability of the 2 R ELATED W ORK
LLMs due to the following limitations: Visual Question Answering (VQA). VQA has been of
(i) The generated captions cannot cover all the necessary growing interest over the last few years. Recent studies
information in the image. Consider the example in Fig. in VQA research can be roughly divided into the follow-
1, the caption “a group of people walk in a city square” ing categories: better visual features [25], [26], [27], more
contributes nothing to answering the question “what powerful model architectures [18], [28], [29], [30], and more
fruit comes from these trees”. In this situation, the LLM effective learning paradigms [31], [32], [33], [34], [35]. Most
has to make an aimless and biased guess to answer the current state-of-the-art VQA methods employ the Trans-
question. former architecture [36]. By incorporating vision-language
(ii) LLMs like GPT-3 employ a few-shot learning paradigm pretraining on large-scale datasets, they have approached
that requires a few in-context examples to adapt to or even surpassed human-level performance on several rep-
new tasks. Therefore, the choice of these examples is resentative benchmarks [2], [3], [4], [37], [38]. Besides these
critical to model performance. As reported in [15], all studies on general-purpose VQA, there is also a growing
its example selection strategies achieve far inferior per- trend towards exploring more granular VQA tasks with
formance to the oracle strategy that uses the similarity specific reasoning skills, e.g., neural-symbolic reasoning [39],
of ground-truth answers. [40] and knowledge utilization [5], [7].
We ask: Is it possible to endow the LLM with some heuristics to Knowledge-based VQA. The core of this task lies in knowl-
enhance its capacity for knowledge-based VQA? edge acquisition and integration. Early explorations parse
In this paper, we present Prophet—a conceptually sim- the inputs into structured queries and retrieve supporting
ple yet effective framework designed to prompt LLMs with knowledge from fixed knowledge bases (KBs) to obtain the
answer heuristics for knowledge-based VQA. By answer answers [5], [6]. As the provided knowledge resources are
heuristics, we mean some promising answers that are pre- not sufficient to represent general knowledge, subsequent
sented in a proper manner in the prompt. Specifically, we research mainly focuses on acquiring explicit knowledge
introduce two types of complementary answer heuristics,
namely answer candidates and answer-aware examples, to over- 1. https://github.com/MILVLG/prophet
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 3

Please answer the question according to the context


and answer candidates . Each answer candidate is
associated with a confidence score within a bracket.
latent answer-aware examples The true answer may not be included in the candidates.
answer space
Q: is this legal or illegal? Candidates:

Vanilla VQA Model


A: illegal
Q: what stigma is with this cat? ... +
•air(0.69)
Context: Inflated kites in various shapes float in the air.
Question: What chemical makes cats fly?
A: bad luck •helium(0.62)

Frozen LLM
Q: what chemical Candidates: air (0.69), helium (0.62), oxygen (0.04)
makes cats fly? •oxygen(0.04) Answer: helium
A: helium
Q: what material are these swaths? ... ...
A: silk helium
Q: what is the name of the sandwich?
A: grilled cheese Candidates: Context: The man is smiling at a birthday cake.
•candle(0.99) Question: What is he about to blow out?
training samples Q: what is he about
+ •birthday(0.02) Candidates: candle (0.99), birthday (0.02), fire (0.01)
to blow out? •fire(0.01) Answer: candle
A: candle

Candidates: Context: a group of children stand around a cake.


•air(0.28) Question: What fills the balloons?
+ •helium(0.07) Candidates: air (0.28), helium (0.07), wine (0.03)
Q: what fills the •wine(0.03) Answer:
Q: what fills the balloons? balloons?

testing sample answer prompt with answer heuristics


candidates
Answer Heuristics Generation (§3.2) Heuristics-enhanced Prompting (§3.3)

Fig. 2: Our Prophet framework has two stages: answer heuristics generation and heuristics-enhanced prompting. In the
answer heuristics generation stage, a vanilla VQA model trained on specific knowledge-based VQA dataset is employed
to generate two types of complementary answer heuristics, i.e., answer candidates and answer-aware examples. In the
heuristics-enhanced prompting stage, the answer heuristics, question, and caption are integrated into a formatted prompt
to instruct a frozen LLM (e.g., GPT-3) to predict an answer. As shown in the example, both answer heuristics contribute to
the answer of “helium”.

from multiple open-domain knowledge resources, e.g., Con- 3 T HE P ROPHET F RAMEWORK


ceptNet [9], Wikipedia [41], and Google Images [12]. This Our Prophet is a conceptually simple two-stage framework.
retrieved knowledge is integrated with the image-question In the answer heuristics generation stage, a vanilla VQA
pair for answer prediction [12], [13], [42]. Motivated by the model is learned to generate two types of answer heuristics,
powerful capacities of LLMs (e.g., GPT-3 [1]) in knowledge i.e., answer candidates and answer-aware examples (de-
reasoning, recent state-of-the-art approaches regard an LLM tailed in §3.2). In the heuristics-enhanced prompting stage,
as an implicit knowledge engine. They either utilize it to the answer heuristics, question, and caption are integrated
predict answers from given questions and extract visual into a formatted prompt to instruct a frozen LLM to predict
captions [15] or to extract answer candidates with evidence an answer (detailed in §3.3). An overview of the Prophet
to improve answer prediction [16], [17]. Nevertheless, they framework is depicted in Fig. 2.
have not fully activated the reasoning capability of LLMs,
as the necessary visual information to answer the question
is not represented exactly. This motivates us to explore 3.1 Preliminaries
the strategies for prompting LLMs with question-aware Before presenting the Prophet, we briefly introduce the
information (i.e., answer heuristics). Similar to Prophet, a in-context learning paradigm developed by GPT-3 and its
concurrent work PromptCap also aims to enhance the input adaptation to knowledge-based VQA by PICa [15].
information for LLMs by learning a question-aware caption- GPT-3 is an autoregressive language model pretrained
ing model [43]. However, PromptCap needs to use LLM in on a tremendous dataset. During inference, in-context few-
both the training and testing phases, which incurs tremen- shot learning formulates a new downstream task as a text
dous computational costs as the training set is usually large. sequence generation task on the frozen model. Given a
In contrast, Prophet is more economical as it only utilizes testing input x, its target y is predicted conditioned on a for-
LLM in the testing phase. matted prompt p(h, E, x), where h refers to a prompt head,
aka instruction, that describes the task, E = {e1 , e2 , ..., en }
In-context learning. Unlike the pretrain-then-finetune
corresponds to n in-context examples. Let the target y =
paradigm for language models like BERT [44], GPT-3 inno-
(y 1 , y 2 , ..., y L ) be a text sequence of L tokens. For notational
vatively introduces a few-shot in-context learning paradigm
convenience, we denote [l] as a set of natural numbers from
and has become the de facto standard for subsequent LLMs.
1 to l and use y [l] = (y 1 , ..., y l ) to represent a sub-sequence
To adapt to a new task, GPT-3 only needs to concatenate
containing the first l words of y . At each decoding step l,
a few examples of the task with the input as the prompt
we have:
at inference time and requires no parameter updates. This
y l = argmax pGPT-3 (ŷ l |p, y [l−1] ) (1)
appealing property has inspired research on training mul- ŷ l
timodal few-shot learners [2]. Empirical studies show that
a huge model (e.g., 80B parameters in Flamingo [2]) is re- where each in-context example ei = (xi , y i ) contains an
quired for effective few-shot learning, which is unaffordable input-target pair of the task, which is constructed manually
for most people to reproduce their results. or sampled from the training set.
To adapt LLMs like GPT-3 to address the knowledge-
based VQA task, the key is to design proper prompts. Given
a question q and an image v as inputs, the VQA task aims
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 4

3.2.1 Discriminative VQA models


Denote a VQA training dataset as D = {(vi , qi , ai )}M i=1 ,
(V) ... where vi , qi , ai refer to the image, question, and answer,
multi-class iterative
classification prediction respectively. The most frequent answers in the training set
discriminative generative form an answer vocabulary V = {wj }S j=1 , where S is
who made the famous VQA model VQA model the answer vocabulary size. A discriminative VQA model
scientific experiment by
(Q) using the flying object Mdisc is learned from D to perform an S -way classifica-
in the image? Q V Q V <BOS> tion over the answers. Generally, the model Mdisc can be
separated into two submodels, i.e., a backbone MB disc and
Fig. 3: Discriminative vs. generative VQA models. Tak-
a prediction head MH B
disc . The backbone Mdisc acts as an
ing an image (V) and a question (Q) as inputs, a typical
encoder to fuse multimodal inputs v and q and obtain a
discriminative VQA model like MCAN [18] performs multi-
fused feature z :
class classification to predict the most relevant answer (may
z = MB disc (v, q) (2)
contain multiple words) from a predefined answer vocabu-
lary, while a typical generative VQA model like mPLUG [19] The prediction head MH simply adopts a linear layer
iteratively predicts one answer word at a time to constitute followed by a sigmoid function to project the fused feature
the final answer. z into a score vector y ∈ RS over the answer vocabulary:
y = MH
disc (z) (3)
to predict a target answer a. Since LLMs do not understand
images intrinsically, the image needs to be translated into where the i-th element of y represents the confidence score
a caption c using an off-the-shelf captioning model. PICa for answer wi . Based on the above definitions, we explain
formulates the testing input x as the following template: how to generate the two types of answer heuristics below.
Note that although the learned VQA model Mdisc does
Context: c \n Question: q \n Answer: not incorporate any external knowledge, it can be used for
knowledge-based VQA when trained properly. We regard
where the variables marked in blue will be substituted it as a reference model and compare its performance to
by specific testing inputs. \n stands for a line break in Prophet in the experiments to show the effectiveness of LLM
the template. Accordingly, each in-context example ei is for knowledge-based VQA.
formulated into a similar template as follows: Answer candidates. Given a testing input (v, q), we obtain
its score vector y for all answers using Eq.(3). Denoting
Context: ci \n Question: qi \n Answer: ai
si ∈ R+ as the i-th element of y , we obtain the top-K
answers with the highest scores as follows:
where ci , qi , and ai refer to an image-question-answer triplet
collected from the training set. The complete prompt of PICa IAC = argTopK sj (4)
j∈{1,2,...,S}
consists of a fixed prompt head, a few in-context examples,
and a testing input. This prompt is fed into a frozen LLM where IAC denotes an index set of the top-K answer candi-
for answer prediction. dates. The answer candidates C are defined as follows:
Our Prophet inherits the pipeline of PICa. In addition,
C = {(wj , sj ) | j ∈ IAC } (5)
we introduce answer heuristics into the prompt structure to
better activate the reasoning capability of the LLM, which where wj and sj are an answer candidate and its confidence
leads to more accurate answers. score, respectively. To make the formats of the in-context
examples and testing input consistent, for each example ei
we also calculate and provide a set of answer candidates Ci .
3.2 Stage-1: Answer Heuristics Generation
We introduce two types of answer heuristics: answer can- Answer-aware examples. Several previous studies have
didates and answer-aware examples. Given a testing input shown that the choice of in-context examples is crucial for
consisting of an image and a question, the answer candi- GPT-3’s few-shot learning performance [15]. Their results
dates refer to a list of promising answers to the testing motivate us to devise an answer-aware example selection
input, where each answer is associated with a confidence strategy.
score. The answer-aware examples refer to a list of in- Given a testing input (v, q) and any training input
context examples, where each example has similar answers (vi , qi ), we can obtain their corresponding fused features
to the testing input. Interestingly, these two types of answer z and zi from Eq.(2) using the trained model. Since the
heuristics can be obtained simultaneously from any vanilla fused features are linearly projected for answer prediction,
VQA model trained on specific knowledge-based VQA task. we conjecture that these fused features lie in a latent answer
As shown in Fig. 3, existing VQA methods can be space that contains rich semantics of the answers to the given
categorized into discriminative and generative ones based image-question pairs. If z and zi are close in the latent space,
on the ways they obtain answers. This discrepancy leads they are more likely to share similar answers and image-
to different strategies for answer heuristics generation. We question inputs.
elaborate the strategy for each of the two classes of VQA We calculate the cosine similarity of the fused feature
models below. between the testing input and each training input, then
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 5

select top-N nearest neighbors in the latent space as the where each wj represents an answer consisting of a se-
answer-aware examples: quence of answer words and sj ∈ R+ denotes its corre-
sponding confidence score calculated over all the answer
z T zi words. The answer candidate set C is obtained from the
IAE = argTopN (6)
i∈{1,2,...,M } ∥z∥2 ∥zi ∥2 generative model Mgen equipped with the beam search
strategy. Specifically, we initialize each answer wj with the
where IAE is an index set of the top-N similar samples in D. same [BOS] token. At each decoding step l, each wj of length
The answer-aware examples E are defined as follows: l is first passed through Mgen to obtain its top-K candidate
words with the highest scores. After that, an expand-then-
E = {(vi , qi , ai ) | i ∈ IAE } (7)
reduce strategy is performed to update the K answers: (i)
Note that the fused features of the training inputs can be expand step: each wj is expanded K times to combine with
computed and stored beforehand, allowing efficient answer- the K candidate words, resulting in K ∗ K new candidates
aware example selection. answers of length l + 1; (ii) reduce step: among the K ∗ K
candidate answers, only the top-K ones with the highest
Pl
3.2.2 Generative VQA models accumulated scores s = i=1 log y i are retained, which are
then regarded as the inputs to the next decoding step.
Recent state-of-the-art VQA models tend to use generative
model architectures due to their remarkable scalability and Answer-aware examples. Similar to the example selection
generalizability [19], [31], [37]. strategy for discriminative models, the answer-aware exam-
Given the same VQA training dataset D = ples for generative models are also obtained by performing
{(vi , qi , ai )}M
i=1 as above, a generative VQA model Mgen kNN search in a latent answer space. It is worth noting that
is learned from D to generate answers word-by-word from the granularity of the latent features is different for the two
a pre-defined word vocabulary V = {wj }S j=1 , where S is the types of VQA models: each latent feature obtained from a
word vocabulary size. Each answer can be represented as a discriminative VQA model refers to an answer entry in the
text sequence with a dynamic length of L words: answer vocabulary, while each latent feature obtained from
a generative VQA model refers to an answer word.
w = (w1 , w2 , ..., wL ) (8) Given a testing input (v, q) and i-th training input
where w1 = [BOS] refers to a special start-of-sentence token (vi , qi ), the latent features for their multi-word answers
and wL = [EOS] refers to an end-of-sentence token. can be respectively represented as feature groups Z =
Similar to the discriminative model, Mgen can also be [z 1 , z 2 , ..., z L ] ∈ RL×d and Zi = [zi1 , zi2 , ..., ziLi ] ∈ RLi ×d ,
separated into a backbone MB where d is the common dimensionality of the latent answer
gen and a prediction head
space, L and Li refer to the answer lengths of Z and Zi ,
MH B
gen . The backbone Mgen corresponds to an encoder-
respectively. We define a simple score function as follows to
decoder or a pure decoder architecture that fuses multi-
average the dot-product similarity of each paired features
modal inputs v and q , and then generates latent feature of
each answer word using an autoregressive manner:
zj ∈ Z and zik ∈ Zi :
L Li
z l = MB [l−1] 1 XX z j zik
gen (v, q, w ) (9) πi = (12)
L ∗ Li j=1 k=1 ∥z ∥2 ∥zik ∥2
j
l
where z denotes the latent feature of l-th answer word.
On top of the latent feature z l , the prediction head MH gen Using the score function above, we obtain the top-N
applies a linear projection (or a MLP) followed by a softmax nearest neighbors of the query input in the training set
function to decode it into a score distribution y l ∈ RS over and then format them as the answer-aware examples E as
the whole word vocabulary: follows:
IAE = argTopN πi
y l = MH l
gen (z ) (10) i∈{1,2,...,M } (13)
E = {(vi , qi , ai ) | i ∈ IAE }
where the l-th answer word wl is obtained from y l by
greedily choosing the word with the highest score. Until an where IAE is an index set of the top-N nearest neighbors in
[EOS] token is generated, wl is appended to w[l−1] to obtain the training set D.
w[l] , which is iteratively fed into the model Mgen to predict
the next word.
3.3 Stage-2: Heuristics-enhanced Prompting
Answer candidates. Given a testing input (v, q), we can
obtain its most relevant answer using the greedy decoding After obtaining the answer heuristics (i.e., answer candi-
strategy above. However, how to obtain the answer candi- dates C and answer-aware examples E ) from the stage-1, we
dates consisting of the top-K answers and their confidence encode them into a heuristics-enhanced prompt to facilitate
scores is not straightforward. We resort to the beam search al- the few-shot learning capacity of the LLM for knowledge-
gorithm, which is widely used in neural machine translation based VQA.
[45] and visual captioning [46], to address the issue. A prompt consists of a prompt head, a set of in-context
Similar to Eq. (5), we denote the top-K answer candi- examples, and a testing input. The prompt head describes
dates as a set of tuples as follows: the VQA task in natural language. We refer to the prompt
head designed in PICa and supplement it with a new de-
C = {(w1 , s1 ), (w2 , s2 ), ..., (wK , sK )} (11) scription of the answer candidates. Although we encourage
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 6

LLM to generate answers according to the answer candi- questions, only the ‘IMG’ subset of 10.3K (48.7%) samples
dates, we also allow it to take broad explorations and gen- have image content, which is used in our experiments.
erate answers beyond the candidates. The complete format Consequently, the retained dataset consists of 6.2K training,
of our prompt head is shown in the yellow box of Fig. 2. 2.1K validation, and 2.0K testing samples. The questions
Our in-context examples are derived from the obtained require high school-level science knowledge to arrive at the
N answer-aware examples E = {e1 , e2 , ..., eN }. Based on correct answer chosen from multiple choices.
PICa’s template in §3.1, for example ei , we introduce its
TextVQA contains 28K images and 45K questions, where
answer candidates Ci by adding one line of code as follows:
each question requires models to read and reason about
Context: ci \n Question: qi \n the text in the image to give a correct answer [24]. The
Candidates: wj1 (sj1 ), wj2 (sj2 ),...,wjK (sjK ) \n dataset is split into three subsets of 34.6K training, 5K
Answer: ai validation, and 5.7K testing questions. Similar to OK-VQA,
each question is annotated with ten open-ended answers by
where j1 , j2 , · · · , jK correspond to the actual indices of the
humans and soft-voting accuracy is used as the evaluation
elements in Ci . Each answer candidate wjk is paired with
metric. Following the strategy in [48], [49], we supplement
its confidence score sjk within a bracket. The confidence
the training set with the augmented VQA samples from ST-
scores additionally offer the reliability of the corresponding
VQA [50].
answer candidates, which helps the LLM focus more on
the promising candidates and be more tolerant of the less
relevant candidates. For the testing input, its template is 4.2 Implementation Details
similar to that for the in-context examples, except that the Default settings on OK-VQA. We use the MCAN-large [18]
answer slot is left blank for the LLM to fill with. as our default VQA model to generate answer heuristics.
To better exploit available examples, we use the multi- To improve the model capability, we modify the original
query ensemble strategy [15]. Specifically, we increase the MCAN model by: (i) replacing the original bottom-up-
number of answer-aware examples to N *T to obtain T attention region-based features with the grid-based features
paralleled prompts, where each prompt still contains N extracted from CLIP’s visual encoder with a RN50×64 back-
examples. By prompting the LLM for T times, we obtain T bone [51]; and (ii) replacing the original LSTM network with
answer predictions. The majority voting is performed over a pretrained BERT-large model [44].
the T predictions to determine the final answer. The effects Similar to [11], we apply the transfer learning paradigm
of different N and T will be verified in the experiments. to further enhance the model capability. The model is first
pretrained on the VQAv2 dataset [47] and Visual Genome
4 E XPERIMENTS dataset [52]. To prevent data contamination, we remove
We mainly evaluate the performance of Prophet on two those samples from the pretraining dataset, whose images
prevalent knowledge-based VQA datasets: OK-VQA [7] and are used in the testing split of OK-VQA. After that, the
A-OKVQA [8]. We conduct comprehensive ablation exper- pretrained model is further finetuned on the training split
iments to explore the effectiveness of Prophet. By taking of OK-VQA to obtain our final VQA model. Note that
the ablation results into account, we perform thorough com- the answer vocabulary of the pretrained model (with 3,129
parisons of Prophet and state-of-the-art methods. Moreover, answers) is quite different from the vocabulary of OK-
we showcase the generalization ability of Prophet on two VQA. To bridge this gap, we merge the answer vocabulary
diverse knowledge-based VQA datasets ScienceQA [23] and of OK-VQA2 with the existing vocabulary, resulting in an
Text-VQA [24], which require external science and OCR expanded answer vocabulary with 4,477 answers for model
knowledge, respectively. finetuning. This model is trained on a single Nvidia RTX
3090 GPU, which is affordable for most people.
During the prompting stage using LLMs, we follow PICa
4.1 Datasets to use OSCAR+ as the captioning model [26]. Unless other-
OK-VQA is a commonly used knowledge-based VQA wise noted, we set the number of answer candidates K =10,
dataset [7]. The dataset contains 9K and 5K image-question the number of in-context examples N =16, and the number
pairs for training and testing, respectively. All questions of queries T =5 as our default settings. The default version
are manually filtered to ensure that outside knowledge is of GPT-3 used in our experiments is text-davinci-002 and
required to answer the questions. Each data sample is anno- the sampling temperature is set to 0.
tated with ten open-ended answers. The accuracy computed
Settings on other datasets. The settings and strategies
by the soft scores is used as the evaluation metric [47]. We
for OK-VQA can be directly transferred to A-OKVQA to
use the 1.1 version of OK-VQA in the experiments.
address its DA task. For the MC task, we follow the strategy
A-OKVQA is currently the largest knowledge-based VQA in [8] to project the predicted answer to the nearest answer
dataset [8]. The dataset is split into three subsets: 17K choice. Moreover, we design a Prophet variant for the MC
training, 1K validation, and 7K testing. Each question is task. It uses a slightly different prompt by adding the
annotated with ten open-ended answers for direct answer multiple choices to in-context examples and testing input,
(DA) evaluation. Besides, it provides a multiple choice (MC) and instructs the LLM to choose the correct one from four
evaluation to choose the correct answer from four choices. choices.
ScienceQA is a dataset that consists of about 21K questions 2. Similar to [25], we collect answers that appear more than eight
over a diverse set of science topics [23]. Out of the 21K times in the training set of OK-VQA, resulting in 2,794 answers.
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 7
VQA model, paradigm stage-1 acc. accuracy visual features stage-1 acc. accuracy #candidates (K ) hit rate accuracy
ViLBERT, retrieval [12] 35.20 40.28 (+5.08) Bottom-Up [25] 46.83 55.34 (+8.51) 0 - 49.63
ViLBERT, prompt† 35.28 44.97 (+9.69) VinVL [26] 47.88 56.23 (+8.35) 1 53.04 56.04
CLIP-ViT-L/14 [51] 52.03 60.12 (+8.09) 5 75.20 60.17
CLIP-RN50×64 [51] 53.04 60.84 (+7.80) 10 79.83 60.84

(a) Prompting vs. retrieval. Our prompting-based (b) Capability of VQA models. More powerful (c) Answer candidates. They are cru-
paradigm is more effective than the retrieval-based VQA models lead to higher accuracies, but obtain cial to Prophet and increasing K leads
one in MAVEx [12]. † : our re-implementation. slightly less relative improvements from stage-2. to better performance.

example selection hit rate accuracy #examples (N ) accuracy (T =1) accuracy (T =5) variants accuracy
(a) rand 5.31 58.66 0 49.97 49.97 (a) default 60.84
(b) ques + img [15] 59.58 59.82 1 54.89 56.75 (b) w/o prompt head 60.54
(c) fused 83.63 60.84 8 57.49 59.91 (c) w/o confidence scores 55.46
(d) fused + ques + img 82.45 60.38 16 57.52 60.84 (d) w/o image captions 58.27
(e) answer logits 79.25 60.40 20 57.91 61.10 (e) default+tags [15] 60.51

(d) Example selection strategy. Our answer-aware (e) Numbers of examples and queries. Increasing (f) Prompt contents. The default set-
example selection based on fused features is more N and T improves model performance at the ex- tings contain the exact necessary in-
effective than the others. pense of linearly increasing overheads. formation for prompting.

TABLE 1: Ablation experiments for Prophet. All the reported results are evaluated on the testing set of OK-VQA v1.1. The
best result in each table is bolded and the result with the default settings is marked in gray .

For ScienceQA, we reuse all the default settings for OK- MAVEx’s retrieval-based paradigm in external knowledge
VQA. If a training sample provides extra textual hint, we acquisition and integration.
simply append the text to the generated caption as the new
Capability of VQA models. In Table 1b, we study how
context of the corresponding image. For TextVQA, we use
the VQA models of different capabilities impact the per-
the commercial system from Amazon to extract OCR from
formance of Prophet. To better control the model capability,
images 3 , whose effectiveness has been verified in previous
we use the same MCAN model trained with four visual
work [49]. The extracted OCR texts are provided in both the
features: region-based Bottom-Up [25] and VinVL [26] fea-
in-context examples and testing input to instruct the LLM.
tures and grid-based CLIP features from two backbones
Settings of other VQA models. In addition to MCAN, we (ViT-L/14 and RN50×64) [51]. Results show that more
also experiment with one generative VQA model mPLUG powerful VQA models (reflected in the stage-1 accuracies)
[19], which is first pretrained on task-agnostic image- lead to better performance of Prophet, as they provide
text corpus and then finetuned on specific VQA dataset. answer heuristics of higher quality. Combining the results in
Following the aforementioned two-stage transfer learning Table 1a, we also observe that more powerful VQA models
paradigm for MCAN, the pretrained mPLUG model is first achieve less relative improvements from GPT-3, which can
finetuned on the VQAv2 dataset and then further finetuned be explained by the intrinsic diminishing return property.
on specific knowledge-based VQA dataset. As a by-product, we verify that the visual features are
important to the performance of knowledge-based VQA,
which is consistent with the observations in [17]. The models
4.3 Ablation Studies with CLIP-based visual features significantly outperform
We conduct ablation experiments for Prophet on OK-VQA those with region-based features, indicating that the CLIP’s
using the default settings above. Results shown in Table 1 visual features contain richer visual knowledge due to large-
and Fig. 4 are discussed in detail below. scale pretraining.
Prompting vs. retrieval. Prophet uses a prompting-based In addition to using different visual features for MCAN,
paradigm to predict the answer based on a set of promising we can also replace the whole MCAN model with any
answer candidates. In contrast, a previous work MAVEx generative models pretrained on large-scale multimodal
[12] exploits answer candidates but adopts a retrieval-based datasets as mentioned in §3.1. These results will be reported
paradigm to search knowledge from external KBs to de- in the main results.
termine the answer. As both Prophet and MAVEx train a Answer candidates. Table 1c varies the number of answer
VQA model to generate answer candidates (stage-1), we can candidates K from 0 to 10 to explore its effect on Prophet.
compare the superiority of the two paradigms (stage-2). In For each testing sample, if the ground-truth answer is hit
Table 1a, we show the performance of the two paradigms in by one of the K answer candidates, we accumulate the soft
terms of stage-1 accuracy and final accuracy, respectively. score of that ground-truth answer4 . The hit rate is calculated
For a fair comparison, we re-implement the VQA model over the testing set by dividing the accumulated score by the
used in MAVEx, i.e., ViLBERT [32], to generate answer number of samples.
heuristics for our Prophet. From the results, we can see that From the results, we can see that: (i) without any answer
based on the same VQA model, our Prophet outperforms candidates, Prophet’s accuracy drops by 6.4 points (K =0
MAVEx by a large margin (44.97% vs. 40.28%), showing
the superiority of our prompting-based paradigm over 4. In practice, multiple ground-truth answers are provided. If mul-
tiple answers are hit simultaneously, we choose the answer with the
3. https://aws.amazon.com/textract/ largest soft score for accumulation.
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 8

vs. K =1), showing the importance of answer candidates in


Prophet; (ii) with the increase of answer candidates, the hit
rate and final accuracy grow accordingly but they exhibit a
tendency to saturate. This is because the quality of answer
candidates eventually meets saturation as K increases; (iii)
when K =1, the final accuracy is even higher than the hit
rate (56.04% vs. 53.04%), which implies that GPT-3 has a
strong capability to correct the wrong answer candidates
while keeping the correct ones.
Example selection strategy. To show the effectiveness of (a) behavior distribution (b) per-type accuracy
our answer-aware example selection strategy, we compare
it to other example selection strategies in Table 1d. The Fig. 4: Prophet’s prediction behaviors in terms of (a) dis-
compared strategies include: (a) rand: examples that are tribution and (b) per-type accuracy. As Prophet takes K
randomly selected; (b) ques + img: examples that are selected answer candidates as inputs, we define three prediction
based on the joint similarity of question and image features, behaviors for Prophet as follows: “keep top-1”, “in top 2-
which is used in PICa; (c) fused: our default strategy that K ”, and “beyond top K ”. All the testing samples can be
selects examples based on the similarity of fused features; categorized into one of the three classes.
(d) fused + ques + img: a combination of our default strategy
and PICa’s strategy; and (e) answer logits: examples that are Stage 2 pred.
correct wrong
selected based on the similarity of answer logits obtained in Stage 1 pred.
Eq.(3). Besides the final accuracy, we also report the hit rate correct 54.4% 4.2%
of answers within the selected examples for each strategy. wrong 12.0% 29.4%

The results show that the accuracy is positively cor- TABLE 2: Prophet’s combinatorial prediction behaviors
related with the hit rate of answers, which verifies our in two stages. Prophet maintains the majority of correct
hypothesis that answer-aware examples contribute signif- predictions at stage-1, and the accuracy improvement by
icantly to the performance of Prophet. Compared with stage-2 is mainly because the number of wrong-to-correct
other strategies, our default strategy (c) achieves the best samples is larger than that of the correct-to-wrong samples.
performance with the highest hit rate. The strategy (d) that
integrates other information (ques + img) into the (c) leads
to worse performance due to the introduction of irrelevant the confidence scores are of critical importance to the per-
and noisy information. Finally, strategy (e) reports slightly formance of our Prophet. This is because they carry the
worse performance than (c). We conjecture that this is be- necessary information for GPT-3 to understand the answer
cause the answer logits have lost too much information of candidates. Second, without image captions, Prophet still
the input question and image, which is also useful for GPT-3 works steadily. This reflects the fact that our answer heuris-
to perform knowledge reasoning. tics in prompts already provide sufficient information for
Prophet to solve the task. Third, the prompt head is of less
Numbers of examples and queries. Table 1d contains
importance, indicating that GPT-3 is capable of understand-
the ablation studies for the numbers of examples and
ing the task directly from the in-context examples. Finally,
queries. We choose different numbers of examples N ∈
introducing extra information like object tags leads to a
{0, 1, 8, 16, 20} for each query and different numbers of
slight performance drop, which is contrary to the results
queries T ∈ {1, 5}, respectively. The results show that
in PICa. We conjecture this information has already been
the performance of Prophet improves with the increase of
encoded in answer heuristics implicitly.
N and T , which is consistent with the results in PICa.
By increasing T from 1 to 5, the entries with larger N Prediction behaviors in different stages. In Table 1b, we can
enjoy greater performance improvements at the expense of observe a significant performance improvement of Prophet
linearly increasing overheads. (stage-2) over its corresponding MCAN model (stage-1). To
Interestingly, the Prophet variant with N =0 delivers better understand this improvement, we conduct a statisti-
worse performance than the VQA model in stage-1 (49.97% cal analysis of Prophet’s prediction behaviors. As Prophet
vs. 53.04%), even though answer candidates are provided. takes K answer candidates from MCAN as inputs, we
Meanwhile, when given one example (N =1), the Prophet define three prediction behaviors for Prophet: “keep top-
variant distinctly surpasses the VQA model (56.75% vs. 1”, “in top 2-K ”, and “beyond top-K ”. All the testing
53.04%). This suggests the necessity of few-shot in-context samples can be categorized into one of the three classes. The
examples for GPT-3 to activate its capability to adapt to the statistical results in Figure 4 show that: (i) for 68.1% of the
knowledge-based VQA task. testing samples (green slice), Prophet keeps the top-1 pre-
dictions of MCAN. These samples achieve a 69% accuracy
Prompt contents. In Table 1f, we ablate the prompt contents and are mostly easy samples; (ii) for 21.8% of the testing
in the default settings by: (b) removing the prompt head; samples (blue slice), Prophet selects answers from the top
(c) removing the confidence scores for answer candidates; 2-K answer candidates. These samples are relatively hard,
(d) removing image captions; and (e) adding predicted tags so that MCAN delivers a 24% accuracy while Prophet has a
from external models [15]. much higher 40% accuracy; (iii) for the remaining 10.1% of
The results lead to the following observations: First, the testing samples (yellow slice), Prophet predicts answers
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 9
per-sample method accuracy
LLM (version or size) accuracy
average cost
methods with external knowledge bases
commercial models Mucko [10] 29.2∗
GPT-3 (text-davinci-002) $0.2 60.8 ConceptBERT [56] 33.7∗
GPT-3 (3.5-turbo-instruct) $0.015 58.9 KRISP [11] 38.9
open-source models Visual Retriever-Reader [42] 39.2
† LLaMA-1 (7B) [20] 2.6s 51.8 MAVEx [12] 40.3
† LLaMA-1 (13B) [20] 4.6s 56.1 TRiG [13] 49.4
† LLaMA-1 (30B) 8.7s 57.1 UnifER [59] 42.1
† LLaMA-1 (65B) [20] 16.6s 58.8 methods with multimodal pretraining
† Falcon (7B) [21] 2.7s 50.5 Unified-IO (2.8B) [57] 54.0
† Falcon (40B) [21] 14.1s 57.1 Flamingo (80B) [2] 57.8
LLaMA-2 (7B) [53] 2.7s 56.6 PALI (17B) [58] 64.5
LLaMA-2-Chat (7B) [53] 2.7s 54.0
methods with GPT-3 API
LLaMA-2 (13B) [53] 4.8s 57.9
PICa [15] 48.0
LLaMA-2-Chat (13B) [53] 4.8 56.5
KAT† [16] 53.1
LLaMA-2 (70B) [53] 18.3s 59.6
REVIVE† [17] 56.6
Mistral (7B) [54] 3.0s 59.7
PromptCap (OFA)† [43] 60.4
Prophet (MCAN) 61.1
TABLE 3: Ablation study of different LLMs. All variants
Prophet (mPLUG) 62.5
use the default settings and are evaluated on the testing
set of OK-VQA. The per-sample average costs of the open- TABLE 4: Comparisons to the state-of-the-art methods
source models are measured by the GPU running time on on OK-VQA testing set. The compared methods are split
a server with A100 GPUs while the costs of the commercial into three groups based on their knowledge resources and
models are measured by money. † indicates the LLM’s max usages. ∗ : accuracy is evaluated on OK-VQA v1.0. † : method
token length is insufficient for N =16 examples. For these needs to query GPT-3 during training.
LLMs, we reduce N to fit their maximum capacity.

GPT-3 level performance, revealing the potential of open-


beyond the answer candidates5 . For these most difficult
source LLMs in the near future.
samples, MCAN only delivers a 12% accuracy while Prophet
magnificently achieves a 42% accuracy.
As a supplement to the above results, we calculate the 4.4 Main Results
distribution of four situations of the predictions from stage- For the comparisons below, we use all the default settings
1 and stage-2 in Table 2. From the results, we can see that: except the number of examples N . We set N =20 for OK-
(i) Prophet maintains the majority of correct predictions VQA and A-OKVQA and respectively set N =7 and N =16
by MCAN and only 4.2% samples are overturned; (ii) the for ScienceQA and TextVQA as they need extra hint and
improvement of Prophet is mainly due to the fact that the OCR tokens. By instantiating Prophet with two VQA mod-
proportion of wrong-to-correct samples (12.4%) is larger than els, we obtain Prophet (MCAN) and Prophet (mPLUG).
that of the correct-to-wrong samples (4.2%); (iii) there are still
Comparative results on OK-VQA. Table 4 contains the
a considerable amount of samples (29.4%) that both MCAN
comparisons of our Prophet and existing state-of-the-art
and Prophet fail to give the correct answer, which leaves
methods on OK-VQA. The table is split into three sections.
sufficient room for future improvement.
The first section lists the retrieval-based methods leveraging
Different LLMs. In Table 3, we investigate the effects of external KBs [10], [11], [12], [13], [42], [56]. The second
different LLMs by replacing the default GPT-3 (text-davinci- section contains the methods that are directly pretrained
002) with the latest commercial and open-source models. on a large-scale multimodal corpus [2], [57], [58]. The last
From the results, we have the following observations: (i) section shows the methods that incorporate the large lan-
the capability of the default GPT-3 model significantly out- guage model GPT-3, which is publicly available via an
performs all the compared LLMs, including its accelerated online API [15], [16], [17], [43]. Our Prophet belongs to the
variant (3.5-turbo-instruct) with 0.075× running cost; (ii) for last section. It outperforms all the compared methods by
the LLMs of the same class but different sizes (e.g., 7B and a distinct margin. Prophet is 13.1 points higher than PICa
13B LLaMA-1 models [20]), the large-size ones show better [15] when both methods use GPT-3 as the only knowledge
performance than the small-size ones at the expense of resource. This confirms our hypothesis that the capacity
near-linearly increasing running time; (iii) the chat-oriented of GPT-3 has not been fully activated in previous studies.
variants like LLaMA-2-Chat [53], which are additionally Compared to KAT [16] and REVIVE [17], which utilize GPT-
trained by instruction tuning and human feedback [55], de- 3 and other external KBs together in sophisticated systems,
liver inferior performance to their non-chatty counterparts. our Prophet is much simpler and more effective. Moreover,
This can be explained by the introduced alignment tax when KAT, REVIVE, and PromptCap need to use GPT-3 to process
aligning the model with human behaviors; (iv) with only 7B all the training samples for their model training, which
model parameters, the latest LLM Mistral [54] reports near significantly increases the costs. In contrast, our Prophet
only uses GPT-3 at inference time, which is more econom-
5. The probability that Prophet’s prediction is constituted of the ical. Compared to the Flamingo-80B equipped with 32 in-
combination of candidates is rare that can be neglected. context examples [2], Prophet (MCAN) delivers a significant
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 10

DA MC method accuracy method accuracy


method
val test val test MCAN [18] 51.2 LoRRA [24] 27.6
ClipCap [8] 30.9 25.9 56.9 51.4 GPT-3 [1] 65.7 M4C [65] 40.5
ViLBERT [8] 30.6 25.9 49.1 41.5 Chameleon [61] 77.6 PromptCap [43] 51.9
LXMERT [8] 30.7 25.9 51.4 41.6 InstructBLIP [62] 79.5 TAG [66] 53.7
KRISP [8] 33.7 27.1 51.9 42.2 LLaMA-Adapter [63] 80.3 TAP [48] 54.0
GPV-2 [8] 48.6 40.7 60.3 53.7 MM-CoT [64] 82.9 Flamingo-80B [2] 54.1
Unified-IO [57] - 45.2 - - Human Average [23] 87.5 mPLUG-doc [67] 57.6
PromptCap (OFA) [43] 56.3 59.6 73.2 73.1 LLaVa [60] 88.0 LaTr [49] 59.6
Prophet (MCAN) 58.2 55.7 76.4 73.6 Prophet (mPLUG) 88.2 Prophet (mPLUG) 61.3
Prophet (mPLUG) 64.7 58.5 76.6 75.1
(a) ScienceQA (IMG) (b) TextVQA
TABLE 5: Comparisons to the state-of-the-art methods on TABLE 6: Comparisons to the state-of-the-art methods on
A-OKVQA. DA and MC refer to the direct-answer and the testing set of ScienceQA and TextVQA, respectively.
multiple-choice tasks, respectively. For the MC task, we
devise a Prophet variant with a slightly different prompt.
EXAMPLE #1
Context: A bed and chair are in a large
bedroom.
performance improvement. Despite the fact that Prophet Question: What material are those drapes
made out of?
(MCAN) has a clear performance gap compared to PALI-17B Candidates: lace (0.99), silk (0.26), polyester
(0.08), cloth (0.08), nylon (0.03)
[58], Prophet is more resource-efficient from the perspective Answer: lace

of reproducibility6 . Finally, by replacing MCAN with the Context: a bedroom with a bed and a canopy.
Question: Name the type of curtains shown in
EXAMPLE #2
Context: A bed that has sheets, a cover, and
pillows.
pretrained generative model mPLUG, our method exhibits this picture?
Candidates: curtain (0.11), lace (0.02), cloth Question: What fabric is that bedspread
(0.02), fabric (0.01), canopy (0.01) made from?
a 1.4-point further improvement, showing the substantial Prophet: lace Candidates: silk (0.93), lace (0.04), cotton
(0.04), cloth (0.02), polyester (0.01)
GT: {drape: 1.0, rod pocket curtain: 0.6, big: 0.6,
contribution of a powerful VQA model for Prophet. lace: 0.6} Answer: lace

Comparative results on A-OKVQA. Table 5 contains EXAMPLE 1


Context: A dog is standing next to a suitcase.
Question: Which kind of rope is used around
the comparative results on the challenging A-OKVQA the neck of dog shown in this photo?
Candidates: leash (0.91), leather (0.62),
dataset. The results on the DA task show that Prophet nylon (0.22), rope (0.05), red (0.04)
Answer: leash
(MCAN) model significantly outperform most existing ap-
proaches, reflecting the effectiveness and generalization Context: a black dog sitting next to a stuffed EXAMPLE 2
Context: A dog sitting beside of a red structure
of our method. Compared to the current state-of-the-art animal.
Question: What long strap can be tied to the in the grass.
object around the animal's neck? Question: What does the dog have around its
method PromptCap [43] which also involves a pretrained Candidates: bow (0.25), bow tie (0.04), tag neck?
Candidates: collar (1.00), leash (0.01), color
(0.01), scarf (0.01), nylon (0.01)
VQA model OFA [37] and GPT-3, Prophet (MCAN) exhibits Prophet: leash
GT: {leash: 1.0}
(0.00), chain (0.00), necklace (0.00)
Answer: collar

similar performance when using a weaker VQA model.


For the MC task, we introduce a Prophet variant with Fig. 5: We show two typical samples consisting of the
slightly modifying the prompt used in the original Prophet. testing inputs (left) and their in-context examples (right).
In particular, we add the multiple-choice information into The predicted answers of Prophet have a high probability
both the in-context examples and testing input to instruct to appear in the answer candidates and answer-aware
GPT-3 to choose the correct one from four choices. Com- examples, showing the effectiveness of answer heuristics in
pared with all the methods, Prophet (MCAN) surpasses all enhancing LLM’s ability to predict the correct answer.
the counterparts on the MC task, showing the flexibility
and scalability of Prophet. Moreover, Prophet (mPLUG) performs the published state-of-the-art methods, including
steadily outperforms Prophet (MCAN), emphasizing the those methods with text-aware or layout-aware pretraining
significance of a powerful VQA model to Prophet. on large-scale scene-text image datasets [48], [49].
Results on ScienceQA and TextVQA To verify the gen-
eralization ability of Prophet, we conduct experiments on 4.5 Qualitative Analysis
two additional knowledge-based VQA datasets ScienceQA
In Fig. 5, we illustrate two typical samples consisting of the
(IMG) and TextVQA, which require different types of
testing inputs and their in-context examples to explain how
knowledge (i.e., scientific knowledge and OCR knowledge)
the answer heuristics work. The results show that the syn-
than that for OK-VQA and A-OKVQA. A Table 6 shows
ergy of answer candidates and the answer-aware examples
that comparative results of Prophet and existing state-of-
facilitates the generation of high-quality answers. In the first
the-art methods on respective datasets. As we have wit-
sample, the candidate answer ‘lace’ with a low confidence
nessed the steady improvements of mPLUG over MCAN,
score is finally selected by the LLM as it frequently appears
we only report the results for Prophet (mPLUG) on these
in the in-context examples. In the second sample, we see that
two datasets. Specifically, Prophet surpasses all the coun-
Prophet can make a correct prediction beyond the answer
terparts on ScienceQA (IMG), including the average human
candidates when the proper answer heuristic (the word
performance [23] and the latest LLaVa model trained with
‘leash’) is provided in the in-context examples.
visual instruction tuning [60]. On TextVQA, Prophet out-

6. Flamingo-80B is trained on 1,536 TPUv4 for 15 days and PALI is 5 B ROADER I MPACT
trained on 1,024 TPUv4 for 7 days, which are unaffordable for most
researchers. In contrast, Prophet (MCAN) uses one RTX-3090 to train a From a multimodal LLM (MLLM) point of view, Prophet
VQA model for 4 days and a certain number of GPT-3 invocations. is a loosely-coupled MLLM consisting of a vision-language
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 11

(VL) model and a frozen LLM, aiming to endow the VL case OK-VQA accuracy
model with knowledge reasoning ability. Compared with original MCAN 43.6
+ CLIP visual feats 49.6
the tightly-coupled MLLMs (e.g., Flamingo [2] and LLaVa
+ RoPE mechanism 50.3
[60]) which jointly optimize the VL model and LLM in
+ BERT as the text encoder 53.0
an end-to-end manner, Prophet is more flexible that can
support any open-source or commercial LLM. TABLE 7: Ablations for model architectures. ‘+’ denotes
Moreover, Prophet can also be regarded as a learning-to- each modification is applied to the previous variant.
prompt paradigm that learns an external model to generate training strategy OK-VQA accuracy
prompts to better comprehend the target task, thus facil- (a) train from scratch 35.6
itating the capability of the pretrained LLM (or MLLM). (b) pretrain, w/o finetune 41.1
From this point of view, recent studies like VoxPoser [68] (c) w/ finetune, replace last layer 47.7
and SoM-Prompting [69] share a similar idea with our work. (d) w/ finetune, append new answers 53.0
We believe this paradigm can be widely used in a variety of TABLE 8: Ablations for training strategies. All variants use
LLM-related tasks. the improved model architecture in the last row in Table 7.
Training recipe. We first pretrain the model on the aug-
6 C ONCLUSION
mented train+val+vg dataset from VQAv2 [47] and Visual
In this paper, we present Prophet—a conceptually simple Genome [52], with excluding the samples whose images
framework which uses LLMs as the knowledge engine are used in the testing split of OK-VQA to avoid data
for knowledge-based VQA. To better activate the few-shot contamination. The settings for the pretraining stage are
learning capacity of LLMs, we introduce a novel paradigm identical to the original implementation of MCAN. After
to prompt LLMs with two types of complementary answer that, the model is finetuned on the downstream OK-VQA
heuristics. Extensive ablations, comparative experiments, and A-OKVQA datasets, respectively. For finetuning, the
and comprehensive analyses on four diverse knowledge- commonly used strategy is to replace the last linear layer
based VQA datasets show the superiority of Prophet over (i.e., the classification layer) with a new layer to adapt to the
all existing state-of-the-art methods. Notably, Prophet can answer vocabulary of the downstream dataset. However,
be instantiated with varied combinations of a wide range of the answer vocabularies of the pretraining and finetuning
VQA models and LLMs, showing its flexibility, scalability, datasets are partially overlapped. To maximally utilize the
and generalizability. We hope that our work can inspire pretrained model parameters in the last layer, we inherit the
future research on knowledge-based VQA and universal parameters of existing answers and append new parameters
multimodal learning in the era of LLMs. for the new answers. After that, we freeze all the pretrained
parameters and only update the new parameters for one
A PPENDIX A epoch as a warm-up, and then train all model parameters
M ORE I MPLEMENTATION D ETAILS for the rest training epochs.
Table 8 shows the effects of different training strategies.
A.1 The Default VQA Model
Even without finetuning, the pretrained model (b) is su-
Our default VQA model is carefully designed in terms of perior to the model trained from scratch (a), implying the
model architecture and training strategy. In the following importance of pretraining. Moreover, our new finetuning
table, we show the improvements of our default MCAN strategy (d) leads to significantly better performance than
model over the counterparts trained from scratch. More the commonly used strategy (c), showing the effectiveness
details are provided next. of inheriting model parameters for existing answers.
from scratch, from scratch, transfer learning,
original model [18] improved model improved model A.2 Prompt Formats
31.5 35.6 53.0
We show an exemplar prompt for the standard Prophet in
Improved model architecture. We introduce an improved Table 9 and an exemplar prompt for the variant designed
variant of MCAN [18] based on its open-sourced MCAN- for the MC task of A-OKVQA in Table 10. The exemplar
large implementation. Our modifications to the model ar- prompts for ScienceQA and TextVQA are illustrated in Table
chitecture include: (i) we replace the original bottom-up- 11 and 12, respectively.
attention features with the grid-based features extracted
from the CLIP’s visual encoder with RN50×64 backbone
A PPENDIX B
[51]; (ii) we introduce the RoPE mechanism [70] to each
image self-attention layer of MCAN to supplement the M ORE Q UALITATIVE AND Q UANTITATIVE A NALYSES
grid-based features with positional information; and (iii) We provide more in-depth analyses of Prophet’s perfor-
we replace the original LSTM network with a pre-trained mance on the testing set of OKVQA. All results are carried
BERT-large model [44] as the text encoder before MCAN. out using the default settings.
Table 7 shows the accuracies of different model variants We show the per-type accuracies of MCAN (stage-1)
on the testing set of OK-VQA. By progressively adding the and Prophet (stage-2) in Table 13. Prophet outperforms
modifications to the original MCAN model, our improved MCAN on all categories, indicating that generality of the
MCAN model reports a 53.0% accuracy, which is on par knowledge in GPT-3. The improvement on the “Science and
with current state-of-the-art methods like KAT [16]. Technology” category is not as large as the rest categories.
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 12
Please answer the question according to the context and the answer candidates. Each answer candidate is associated with a
confidence score within a bracket. The true answer may not be included in the candidates.
===
Context: The motorcycle racers are getting ready for a race.
===
Question: What sport are these guys doing?
===
Candidates: motorcross(0.94), motocross(0.79), bike(0.35), dirt bike(0.28), motorcycle(0.03),
bmx(0.03), cycling(0.02), motorbike(0.02), race(0.02), bicycle(0.02)
===
Answer: motorcross
===
Context: a black motorcycle parked in a parking lot.
===
Question: What sport can you use this for?
===
Candidates: race(0.53), motorcycle(0.41), motocross(0.19), bike(0.17), motorcross(0.15),
cycling(0.11), dirt bike(0.10), ride(0.08), bicycling(0.01), bicycle(0.01)
===
Answer:

TABLE 9: An exemplar prompt for the standard Prophet. We show one in-context example here due to space limitations.
Following the implementations in PICa [15] and KAT [16], we use a special symbol ‘===’ to separate each two lines.

Please choose the correct answer in the choices according to the context, the question and the answer candidates. Each answer
candidate is associated with a confidence score within a bracket. The true answer may not be included in the candidates.
===
Context: A young man riding a skateboard on a sidewalk.
===
Question: What part of his body will be most harmed by the item in his mouth?
===
Candidates: skateboard(0.02), nothing(0.02), table(0.01), leg(0.01), helmet(0.00), knees(0.00),
skateboarding(0.00), head(0.00), teeth(0.00), falling(0.00)
===
Choices: (A) back, (B) lungs, (C) feet, (D) eyes
===
Answer: (B)
===
Context: a young boy kneeling on a skateboard on the street.
===
Question: What did this lad likely injure here?
===
Candidates: skateboard(0.18), shoes(0.02), shoe(0.02), skateboarding(0.01), street(0.01),
flowers(0.01), skating(0.01), boy(0.01), head(0.00), skateboarder(0.00)
===
Choices: (A) knee, (B) elbow, (C) rear, (D) board
===
Answer:

TABLE 10: An exemplar prompt for the Prophet variant on the MC task of A-OKVQA. Compared to the standard prompt
in Table 9, we add one extra line of choices for the example and testing input, and change the output format to adapt to
the multiple-choice task. All the differences are marked in red.

which can be explained that the required knowledge for the potential of devising more powerful VQA models. The
this category is more specialized and professional. These cause of “(c) correct but differently expressed answer” also
questions are also challenging for humans. accounts for a considerable proportion. This reflects the
limitation of the annotations and evaluation metric of OK-
We perform human studies to analyze the causes of VQA.
wrong predictions in Table 14. For each category, we ran- Figure 6 demonstrates some testing samples from differ-
domly sample 10% testing samples that Prophet fails to get ent knowledge categories. In the 1st-3rd columns, we show
the correct answer. This results in 172 samples. We ask three the correctly answered samples with different prediction
annotators to categorize each sample into one of the follow- behaviors (i.e., keep top-1, in top 2-K , and beyond top-K ).
ing four failure causes: (a) insufficient visual understanding; The visualized results indicate that Prophet can adaptively
(b) incorrect knowledge reasoning; (c) correct but differently choose suitable answers from candidates. In the last column,
expressed answer; (d) others (e.g., the failure is caused by we show some failure samples, implying that there is still
the ambiguity of the question). From the results, we can room for future improvement.
see that the cause of “(b) incorrect knowledge reasoning”
accounts for the highest proportion, which suggests that
the bottleneck of Prophet still lies in the knowledge acqui-
sition and reasoning. The cause of “(a) insufficient visual
understanding” has the second highest proportion, showing
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 13
Please choose the correct answer in the choices according to the context, the question and the answer candidates. Each answer
candidate is associated with a confidence score within a bracket. The true answer may not be included in the candidates.
===
Context: A picture of a black and white model of a molecule. The model below represents graphite. Graphite is used to make
pencil lead.
===
Question: Complete the statement. Graphite is ().
===
Candidates: an elementary substance(1.00), a compound(0.02), an adult substance(0.01), an an elementary substance(0.01)
===
Choices: (A) a compound, (B) an elementary substance
===
Answer: (B)
===
Context: A pair of eye glasses with the word h on them. The model below represents a molecule of hydrogen. Hydrogen gas was
once used to make large airships, such as blimps, float. It is no longer used in airships because it catches fire easily.
===
Question:Complete the statement. Hydrogen is ().
===
Candidates: a compound(0.68), an elementary substance(0.32), the same substance(0.00), the same amount(0.00)
===
Choices: (A) an elementary substance, (B) a compound
===
Answer:

TABLE 11: An exemplar prompt for the Prophet variant on ScienceQA (IMG). The sentences marked in red are the
optional text hints provided by the dataset.

Please answer the question according to the context and the answer candidates. Each answer candidate is associated with a
confidence score within a bracket. The true answer may not be included in the candidates.
===
Context: A close up of a cell phone with a keyboard.
===
OCR: Market, 3, Facebook, Browser, 5, 4, 6, 1, 8, 30.
===
Question: How many apps are on this page excluding market?
===
Candidates: 6(0.20), 5(0.19), 8(0.18), 9(0.12), 7(0.08), answering does(0.05),10(0.05),13(0.05),12(0.04),4(0.04)
===
Answer: 7
===
Context: A screenshot of a yahoo mail page.
===
OCR: Free, Page, Nake WT My Page, ADVERTISEMENT, YAHOO!, FREE Camera Phone, Notepad, MAIL, Yaboo! Mail.
===
Question: What is free on this page?
===
Candidates: amera(0.40), video camera(0.29), video(0.13), photos(0.04), video call(0.04), webcam(0.03), videos(0.03),
photography(0.01), photoshop(0.01), internet explorer(0.01)
===
Answer:

TABLE 12: An exemplar prompt for the Prophet variant on TextVQA. Compared to the standard prompt, we additionally
introduce the OCR tokens (marked in red) extracted from an off-the-shelf OCR system.

category MCAN Prophet


Plants and Animals 52.58 63.67
Science and Technology 48.10 48.81
Sports and Recreation 59.08 66.00
Geography, History, Language and Culture 52.48 62.98 failure cause proportion
Brands, Companies and Products 51.98 54.77 (a) insufficient visual understanding 27.3%
Vehicles and Transportation 50.82 58.01 (b) incorrect knowledge reasoning 44.1%
Cooking and Food 55.53 62.09 (c) correct but differently expressed answer 22.8%
Weather and Climate 65.12 68.37 (d) others 5.8%
People and Everyday life 49.44 54.67
Objects, Material and Clothing 50.05 57.20
TABLE 14: The distribution of failure causes by human
studies.
TABLE 13: Per-category accuracies of MCAN (stage-1) and
Prophet (stage-2). This performance improvements of using
GPT-3 are observed on all categories.
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 14

correct prediction correct prediction correct prediction


incorrect prediction
keep top-1 candidates in top 2-K candidates beyond top-K candidates
Q: What do you Q: What new company Q: What type of bike
Q: What sport can call the device has created is on the ground?
you use this for? competition for this C: dirt bike(0.85)
that keeps boats type of
C: race (0.53) in place at sea? adirt(0.74)
vehicles and motorcycle(0.41) C: dock (0.79)
transportation?
C: taxi (0.79)
motorbike(0.35)
motocross(0.19) anchor (0.14) motorcycle(0.17)
transportation bike(0.17) pier (0.02)
ford (0.03)
car (0.02) bmx(0.15)
motorcross(0.15) yellow (0.01) P: dirt bike
float (0.01) ibm (0.01) G: bmx, bicycle, 10
P: race pole (0.01) speed
P: uber
P: anchor

Q: Can you guess Q: What leaf is in


Q: What brand is the model of tv Q: Is this creme an this logo?
this device? shown in this acid or base? C: maple(0.91)
brands, C: samsung(1.00) picture? C: calcium (0.04) maple leaf(0.09)
toshiba(0.01) C: flatscreen(0.52) vitamin c (0.04)
companies wii(0.00) flat screen(0.45) protein (0.04) canadian(0.05)
sony(0.00) samsung(0.25) carbs (0.03) oak(0.01)
and products wilson(0.00) sony (0.20) vitamin (0.02) leaf(0.01)
P: samsung led (0.06) P: base P: maple leaf
P: samsung G: maple, canada

Q: What would Q: What is the Q: If this chair where Q: What kind of glass
happen if these decorative fabric outside it might be is used to make that
items fall to the on the floor made from what shower enclosure?
objects, ground? called? reed like material? C: frosted(0.59)
clear(0.12)
C: break(0.06) C:carpet(0.84) C: wood(0.09)
material and crash(0.04) rug(0.79) canvas(0.08)
fancy(0.03)
large(0.02)
died(0.02) vacuum(0.03) wicker(0.07)
clothing float(0.01) cotton(0.00) cloth(0.06)
thick(0.02)
P: frosted
sell(0.01) blanket(0.00) cotton(0.05) G: tempered, clear,
P: break P: rug P: rattan pane

Q: What are the two Q: What toe related Q: Is this boy a


Q: What is that phrase is most
items that this professional player
man doing with commonly or still in high
the bat? athlete has in
associated with school?
C: hit(0.72) either hand?
sports and hit ball(0.49) C:ski (0.40)
this sport?
C: surfboard(0.02)
C: college(0.10)
beginner(0.04)
pole (0.31)
recreation swing(0.15) ski pole (0.08)
surf(0.01) roger federer(0.04)
amateur(0.04)
homerun(0.04) ski poles (0.04) 10(0.01)
surf board(0.01) minor(0.02)
hit baseball(0.02) skiis (0.02) P: amateur
P: hit P: ski pole wave(0.01)
P: hang 10 G: high school, school

Q: In what city is the


Q: What flavor is Q: How many Q: Which of the foods restaurant the man
calories is in a in the picture is best in the green hat is
this pastry? for you to eat when eating at?
C:chocolate(0.30) food like this?
you have a cold? C: beijing(0.23)
cooking milk (0.05) C: 500(0.56)
600(0.56) C: sandwich(0.78) tokyo(0.17)
and food vanilla (0.04) 200(0.16)
bread(0.38) japan(0.10)
pie (0.01) toast(0.17) china(0.09)
250(0.16) grilled cheese(0.04) new york(0.04)
ham (0.01) 300(0.14) cheese(0.03) P: tokyo
P: chocolate P: 600 P: soup G: miami, hong kong,
seattle, chicago

Q: Do you think that


it is more likely Q: What religion Q: What kind of
Q: What event is building is this?
that this is a court does the statue
geography, this?
or someone's belone to? C: school(0.60)
C: concert(0.37) classroom(0.44)
history, rally(0.14) home?
C: school(0.40)
C:christianity(0.08)
chinese(0.07)
office(0.06)
language sing(0.09) court(0.04) hindu(0.04)
court(0.03)
funeral(0.08) church(0.02)
office(0.04) hinduism(0.03)
and culture parade(0.03) P: classroom
church(0.02) muslim(0.03) G: hall, school,
P: concert public(0.02) P: buddhism
P: court church, university

Q: For how long


Q: Why might Q: Is this at a salt should the man in Q: Who leaves a
someone go to water beach or a this picture toilet like this?
this place? lake? continue to brush C: people(0.10)
people and C: shop(0.30) C: lake(0.68) his teeth? kid(0.07)
work(0.25) beach(0.31) C: 1 hour(0.17) plumber(0.05)
everyday life travel(0.12) ocean(0.20) 10 minutes(0.14) man(0.05)
vacation(0.06) sea(0.01) hour(0.14) human(0.03)
money(0.03) both(0.01) 2 hours(0.06) P: plumber
P: shop P: beach 2 weeks(0.06) G: man, men
P: 2 minutes

Q: What does this


Q: What retractable
Q: What type of Q: What type of bird appendage could grow from?
bird is this? is this? this animal use to C: flower(0.56)
C: blue jay(0.93) C: robin (0.41) destroy the chair? tree(0.52)
plants and robin(0.54) cardinal(0.35) C: foot(0.13)
lily(0.11)
garden(0.09)
animals finch(0.26) woodpecker(0.03) leg(0.09)
dirt(0.06)
sparrow(0.17) red(0.02) paw(0.06)
sparrow(0.01) feet(0.04) P: tree
blue(0.07) G: ground, plant,
P: blue jay P: cardinal arm(0.01)
P: claw hibiscus plan
stem, root

Q: How do we Q: How do i adjust


Q: What is this know a filter was Q: Where is this the volume?
object for? used to create picture taken C: remote (0.12)
C: work(0.46) this picture? from? button (0.03)
science and compute(0.18) C: light(0.43) C: air(0.02) remote
computer(0.16) color(0.12) sky(0.01) control(0.02)
technology type(0.05) reflection(0.02) above(0.01) cd (0.02)
kite(0.01) radio (0.01)
study(0.02) photoshop(0.02) zebra(0.01) P: volume button
P: work sepia(0.01) P: space G: knob, turn knob,
P: color turn middle knob

Q: What weather Q: What style of


phenomenon Q: What is the Q: How strong was dress is this
most likely weather like? the wind? woman wearing?
happened? C: cloudy(0.77) C: very(0.61) C: summer(0.05)
weather C: flood(0.90) windy(0.62) extremely(0.02) jean (0.05)
casual(0.03)
and climate storm(0.87) overcast(0.22) 30mph(0.02) denim(0.03)
rain(0.76) stormy(0.05) windy(0.01) short(0.03)
hurricane(0.06) warm(0.03) unknown(0.01) P: sundress
crash(0.02) P: windy P: very strong G: line, sleeveless,
P: flood sun, mini

Fig. 6: Different categories and prediction behaviors. Each row contains four testing samples from a specific knowledge
category. The first to the third columns correspond to the correctly answered samples of different prediction behaviors (i.e.,
keep top-1, in top 2-K , and beyond top-K ). The last column contains failure samples.
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 15

R EFERENCES [24] A. Singh, V. Natarajan, M. Shah, Y. Jiang, X. Chen, D. Batra,


D. Parikh, and M. Rohrbach, “Towards vqa models that can read,”
[1] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhari- in CVPR, 2019, pp. 8317–8326.
wal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Lan- [25] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould,
guage models are few-shot learners,” in NeurIPS, 2020, pp. 1877– and L. Zhang, “Bottom-up and top-down attention for image
1901. captioning and visual question answering,” in CVPR, 2018, pp.
[2] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, 6077–6086.
K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo: a [26] P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, Y. Choi, and
visual language model for few-shot learning,” in NeurIPS, 2022. J. Gao, “Vinvl: Revisiting visual representations in vision-language
[3] L. Yuan, D. Chen, Y.-L. Chen, N. Codella, X. Dai, J. Gao, H. Hu, models,” in CVPR, 2021, pp. 5579–5588.
X. Huang, B. Li, C. Li et al., “Florence: A new foundation model [27] S. Shen, L. H. Li, H. Tan, M. Bansal, A. Rohrbach, K.-W. Chang,
for computer vision,” arXiv preprint arXiv:2111.11432, 2021. Z. Yao, and K. Keutzer, “How much can clip benefit vision-and-
[4] W. Wang, H. Bao, L. Dong, and F. Wei, “Vlmo: Unified language tasks?” ICLR, 2022.
vision-language pre-training with mixture-of-modality-experts,” [28] R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko, “Learn-
in NeurIPS, 2021. ing to reason: End-to-end module networks for visual question
[5] P. Wang, Q. Wu, C. Shen, A. Dick, and A. Van Den Hengel, “Fvqa: answering,” in ICCV, 2017, pp. 804–813.
Fact-based visual question answering,” IEEE TPAMI, vol. 40, [29] J.-H. Kim, J. Jun, and B.-T. Zhang, “Bilinear attention networks,”
no. 10, pp. 2413–2427, 2017. NeurIPS, vol. 31, 2018.
[6] P. Wang, Q. Wu, C. Shen, A. R. Dick, and A. van den Hengel, “Ex-
[30] L. Li, Z. Gan, Y. Cheng, and J. Liu, “Relation-aware graph attention
plicit knowledge-based reasoning for visual question answering,”
network for visual question answering,” in ICCV, 2019, pp. 10 313–
in IJCAI, 2017.
10 322.
[7] K. Marino, M. Rastegari, A. Farhadi, and R. Mottaghi, “Ok-
[31] J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-
vqa: A visual question answering benchmark requiring external
image pre-training for unified vision-language understanding and
knowledge,” in CVPR, 2019, pp. 3195–3204.
generation,” ICML, 2022.
[8] D. Schwenk, A. Khandelwal, C. Clark, K. Marino, and R. Mottaghi,
[32] J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-
“A-okvqa: A benchmark for visual question answering using
agnostic visiolinguistic representations for vision-and-language
world knowledge,” in ECCV. Springer, 2022, pp. 146–162.
tasks,” in NeurIPS, 2019.
[9] H. Liu and P. Singh, “Conceptnet: a practical commonsense rea-
[33] H. Tan and M. Bansal, “Lxmert: Learning cross-modality encoder
soning tool-kit,” BT technology journal, vol. 22, no. 4, pp. 211–226,
representations from transformers,” EMNLP, 2019.
2004.
[10] Z. Zhu, J. Yu, Y. Wang, Y. Sun, Y. Hu, and Q. Wu, “Mucko: Multi- [34] Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng,
layer cross-modal knowledge reasoning for fact-based visual ques- and J. Liu, “Uniter: Universal image-text representation learning,”
tion answering,” in IJCAI, 2020, pp. 1097–1103. in ECCV, 2020, pp. 104–120.
[11] K. Marino, X. Chen, D. Parikh, A. Gupta, and M. Rohrbach, “Krisp: [35] Y. Cui, Z. Yu, C. Wang, Z. Zhao, J. Zhang, M. Wang, and J. Yu,
Integrating implicit and symbolic knowledge for open-domain “Rosita: Enhancing vision-and-language semantic alignments via
knowledge-based vqa,” in CVPR, 2021, pp. 14 111–14 121. cross-and intra-modal knowledge integration,” in ACM MM, 2021,
[12] J. Wu, J. Lu, A. Sabharwal, and R. Mottaghi, “Multi-modal answer pp. 797–806.
validation for knowledge-based vqa,” in AAAI, 2022, pp. 2712– [36] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
2721. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”
[13] F. Gao, Q. Ping, G. Thattai, A. Reganti, Y. N. Wu, and P. Natarajan, NeurIPS, vol. 30, 2017.
“Transform-retrieve-generate: Natural language-centric outside- [37] P. Wang, A. Yang, R. Men, J. Lin, S. Bai, Z. Li, J. Ma, C. Zhou,
knowledge visual question answering,” in CVPR, 2022, pp. 5067– J. Zhou, and H. Yang, “OFA: unifying architectures, tasks, and
5077. modalities through a simple sequence-to-sequence learning frame-
[14] Y. Ding, J. Yu, B. Liu, Y. Hu, M. Cui, and Q. Wu, “Mukea: Mul- work,” in ICML, 2022, pp. 21 218–23 340.
timodal knowledge extraction and accumulation for knowledge- [38] J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, and
based visual question answering,” in CVPR, 2022, pp. 5089–5098. Y. Wu, “Coca: Contrastive captioners are image-text foundation
[15] Z. Yang, Z. Gan, J. Wang, X. Hu, Y. Lu, Z. Liu, and L. Wang, “An models,” TMLR, 2022.
empirical study of gpt-3 for few-shot knowledge-based vqa,” in [39] J. Johnson, B. Hariharan, L. Van Der Maaten, L. Fei-Fei,
AAAI, 2022, pp. 3081–3089. C. Lawrence Zitnick, and R. Girshick, “Clevr: A diagnostic dataset
[16] L. Gui, B. Wang, Q. Huang, A. Hauptmann, Y. Bisk, and for compositional language and elementary visual reasoning,” in
J. Gao, “Kat: A knowledge augmented transformer for vision-and- CVPR, 2017, pp. 2901–2910.
language,” NAACL, 2021. [40] D. A. Hudson and C. D. Manning, “Gqa: A new dataset for real-
[17] Y. Lin, Y. Xie, D. Chen, Y. Xu, C. Zhu, and L. Yuan, “REVIVE: world visual reasoning and compositional question answering,”
Regional visual representation matters in knowledge-based visual in CVPR, 2019, pp. 6700–6709.
question answering,” in NeurIPS, 2022. [41] D. Vrandečić and M. Krötzsch, “Wikidata: A free collaborative
[18] Z. Yu, J. Yu, Y. Cui, D. Tao, and Q. Tian, “Deep modular co- knowledgebase,” Communications of the ACM, vol. 57, no. 10, pp.
attention networks for visual question answering,” in CVPR, 2019, 78–85, 2014.
pp. 6281–6290. [42] M. Luo, Y. Zeng, P. Banerjee, and C. Baral, “Weakly-supervised
[19] C. Li, H. Xu, J. Tian, W. Wang, M. Yan, B. Bi, J. Ye, H. Chen, visual-retriever-reader for knowledge-based question answering,”
G. Xu, Z. Cao et al., “mplug: Effective and efficient vision-language EMNLP, pp. 6417–6431, 2021.
learning by cross-modal skip-connections,” in EMNLP, 2022, pp. [43] Y. Hu, H. Hua, Z. Yang, W. Shi, N. A. Smith, and J. Luo, “Prompt-
7241–7259. cap: Prompt-guided image captioning for vqa with gpt-3,” in
[20] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, ICCV, 2023, pp. 2963–2975.
T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: [44] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-
Open and efficient foundation language models,” arXiv preprint training of deep bidirectional transformers for language under-
arXiv:2302.13971, 2023. standing,” in NAACL, 2019, pp. 4171–4186.
[21] G. Penedo, Q. Malartic, D. Hesslow, R. Cojocaru, A. Cap- [45] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence
pelli, H. Alobeidli, B. Pannier, E. Almazrouei, and J. Launay, learning with neural networks,” in NeurIPS, 2014.
“The refinedweb dataset for falcon llm: outperforming curated [46] J. Yu, J. Li, Z. Yu, and Q. Huang, “Multimodal transformer
corpora with web data, and web data only,” arXiv preprint with multi-view visual representation for image captioning,” IEEE
arXiv:2306.01116, 2023. Trans. Circuits Syst. Video Technol., vol. 30, no. 12, pp. 4467–4480,
[22] Z. Shao, Z. Yu, M. Wang, and J. Yu, “Prompting large language 2020.
models with answer heuristics for knowledge-based visual ques- [47] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh,
tion answering,” in CVPR, 2023, pp. 14 974–14 983. “Making the V in VQA matter: Elevating the role of image
[23] P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, understanding in Visual Question Answering,” in CVPR, 2017.
P. Clark, and A. Kalyan, “Learn to explain: Multimodal reasoning [48] Z. Yang, Y. Lu, J. Wang, X. Yin, D. Florencio, L. Wang, C. Zhang,
via thought chains for science question answering,” in NeurIPS, L. Zhang, and J. Luo, “Tap: Text-aware pre-training for text-vqa
2022, pp. 2507–2521. and text-caption,” in CVPR, 2021, pp. 8751–8761.
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 16

[49] A. F. Biten, R. Litman, Y. Xie, S. Appalaraju, and R. Manmatha,


“Latr: Layout-aware transformer for scene-text vqa,” in CVPR,
2022, pp. 16 548–16 558.
[50] A. F. Biten, R. Tito, A. Mafla, L. Gomez, M. Rusinol, E. Valveny,
C. Jawahar, and D. Karatzas, “Scene text visual question answer-
ing,” in ICCV, 2019, pp. 4291–4301.
[51] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal,
G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transfer-
able visual models from natural language supervision,” in ICML,
2021, pp. 8748–8763.
[52] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz,
S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma et al., “Visual
genome: Connecting language and vision using crowdsourced
dense image annotations,” IJCV, vol. 123, no. 1, pp. 32–73, 2017.
[53] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei,
N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama
2: Open foundation and fine-tuned chat models,” arXiv preprint
arXiv:2307.09288, 2023.
[54] M. AI, “Mistral 7b,” arXiv preprint arXiv:2310.06825, 2023.
[55] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin,
C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language
models to follow instructions with human feedback,” in NeurIPS,
2022, pp. 27 730–27 744.
[56] F. Gardères, M. Ziaeefard, B. Abeloos, and F. Lecue, “Conceptbert:
Concept-aware representation for visual question answering,” in
EMNLP, 2020, pp. 489–498.
[57] J. Lu, C. Clark, R. Zellers, R. Mottaghi, and A. Kembhavi, “Unified-
io: A unified model for vision, language, and multi-modal tasks,”
arXiv preprint arXiv:2206.08916, 2022.
[58] X. Chen, X. Wang, S. Changpinyo, A. Piergiovanni, P. Padlewski,
D. Salz, S. Goodman, A. Grycner, B. Mustafa, L. Beyer et al., “Pali:
A jointly-scaled multilingual language-image model,” in ICLR,
2023.
[59] Y. Guo, L. Nie, Y. Wong, Y. Liu, Z. Cheng, and M. Kankanhalli,
“A unified end-to-end retriever-reader framework for knowledge-
based vqa,” in ACM MM, 2022, pp. 2061–2069.
[60] H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,”
arXiv preprint arXiv:2304.08485, 2023.
[61] P. Lu, B. Peng, H. Cheng, M. Galley, K.-W. Chang, Y. N. Wu,
S.-C. Zhu, and J. Gao, “Chameleon: Plug-and-play composi-
tional reasoning with large language models,” arXiv preprint
arXiv:2304.09842, 2023.
[62] W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li,
P. Fung, and S. Hoi, “Instructblip: Towards general-purpose
vision-language models with instruction tuning,” 2023.
[63] R. Zhang, J. Han, A. Zhou, X. Hu, S. Yan, P. Lu, H. Li, P. Gao, and
Y. Qiao, “Llama-adapter: Efficient fine-tuning of language models
with zero-init attention,” arXiv preprint arXiv:2303.16199, 2023.
[64] Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis, and A. Smola,
“Multimodal chain-of-thought reasoning in language models,”
arXiv preprint arXiv:2302.00923, 2023.
[65] R. Hu, A. Singh, T. Darrell, and M. Rohrbach, “Iterative answer
prediction with pointer-augmented multimodal transformers for
textvqa,” in CVPR, 2020.
[66] J. Wang, M. Gao, Y. Hu, R. R. Selvaraju, C. Ramaiah, R. Xu, J. F.
JaJa, and L. S. Davis, “Tag: Boosting text-vqa via text-aware visual
question-answer generation,” 2022.
[67] J. Ye, A. Hu, H. Xu, Q. Ye, M. Yan, Y. Dan, C. Zhao, G. Xu,
C. Li, J. Tian et al., “mplug-docowl: Modularized multimodal
large language model for document understanding,” arXiv preprint
arXiv:2307.02499, 2023.
[68] W. Huang, C. Wang, R. Zhang, Y. Li, J. Wu, and L. Fei-Fei,
“Voxposer: Composable 3d value maps for robotic manipulation
with language models,” arXiv preprint arXiv:2307.05973, 2023.
[69] J. Yang, H. Zhang, F. Li, X. Zou, C. Li, and J. Gao, “Set-of-mark
prompting unleashes extraordinary visual grounding in gpt-4v,”
arXiv preprint arXiv:2310.11441, 2023.
[70] J. Su, Y. Lu, S. Pan, B. Wen, and Y. Liu, “Roformer: Enhanced
transformer with rotary position embedding,” arXiv preprint
arXiv:2104.09864, 2021.

You might also like