Prophet Prompting LLM With Complementary Answer Heuristics 4 Knowledge Based Vqa
Prophet Prompting LLM With Complementary Answer Heuristics 4 Knowledge Based Vqa
Abstract—Knowledge-based visual question answering (VQA) requires external knowledge beyond the image to answer the question.
arXiv:2303.01903v3 [cs.CV] 14 Dec 2023
Early studies retrieve required knowledge from explicit knowledge bases (KBs), which often introduces irrelevant information to the
question, hence restricting the performance of their models. Recent works have resorted to using a powerful large language model
(LLM) as an implicit knowledge engine to acquire the necessary knowledge for answering. Despite the encouraging results achieved by
these methods, we argue that they have not fully activated the capacity of the blind LLM as the provided textual input is insufficient to
depict the required visual information to answer the question. In this paper, we present Prophet—a conceptually simple, flexible, and
general framework designed to prompt LLM with answer heuristics for knowledge-based VQA. Specifically, we first train a vanilla VQA
model on a specific knowledge-based VQA dataset without external knowledge. After that, we extract two types of complementary
answer heuristics from the VQA model: answer candidates and answer-aware examples. Finally, the two types of answer heuristics are
jointly encoded into a formatted prompt to facilitate the LLM’s understanding of both the image and question, thus generating a more
accurate answer. By incorporating the state-of-the-art LLM GPT-3 [1], Prophet significantly outperforms existing state-of-the-art
methods on four challenging knowledge-based VQA datasets. To demonstrate the generality of our approach, we instantiate Prophet
with the combinations of different VQA models (i.e., both discriminative and generative ones) and different LLMs (i.e., both commercial
and open-source ones).
Index Terms—Visual Question Answering (VQA), large language models (LLMs), knowledge-based VQA, multimodal learning.
1 I NTRODUCTION
Frozen LLM
Q: what chemical Candidates: air (0.69), helium (0.62), oxygen (0.04)
makes cats fly? •oxygen(0.04) Answer: helium
A: helium
Q: what material are these swaths? ... ...
A: silk helium
Q: what is the name of the sandwich?
A: grilled cheese Candidates: Context: The man is smiling at a birthday cake.
•candle(0.99) Question: What is he about to blow out?
training samples Q: what is he about
+ •birthday(0.02) Candidates: candle (0.99), birthday (0.02), fire (0.01)
to blow out? •fire(0.01) Answer: candle
A: candle
Fig. 2: Our Prophet framework has two stages: answer heuristics generation and heuristics-enhanced prompting. In the
answer heuristics generation stage, a vanilla VQA model trained on specific knowledge-based VQA dataset is employed
to generate two types of complementary answer heuristics, i.e., answer candidates and answer-aware examples. In the
heuristics-enhanced prompting stage, the answer heuristics, question, and caption are integrated into a formatted prompt
to instruct a frozen LLM (e.g., GPT-3) to predict an answer. As shown in the example, both answer heuristics contribute to
the answer of “helium”.
select top-N nearest neighbors in the latent space as the where each wj represents an answer consisting of a se-
answer-aware examples: quence of answer words and sj ∈ R+ denotes its corre-
sponding confidence score calculated over all the answer
z T zi words. The answer candidate set C is obtained from the
IAE = argTopN (6)
i∈{1,2,...,M } ∥z∥2 ∥zi ∥2 generative model Mgen equipped with the beam search
strategy. Specifically, we initialize each answer wj with the
where IAE is an index set of the top-N similar samples in D. same [BOS] token. At each decoding step l, each wj of length
The answer-aware examples E are defined as follows: l is first passed through Mgen to obtain its top-K candidate
words with the highest scores. After that, an expand-then-
E = {(vi , qi , ai ) | i ∈ IAE } (7)
reduce strategy is performed to update the K answers: (i)
Note that the fused features of the training inputs can be expand step: each wj is expanded K times to combine with
computed and stored beforehand, allowing efficient answer- the K candidate words, resulting in K ∗ K new candidates
aware example selection. answers of length l + 1; (ii) reduce step: among the K ∗ K
candidate answers, only the top-K ones with the highest
Pl
3.2.2 Generative VQA models accumulated scores s = i=1 log y i are retained, which are
then regarded as the inputs to the next decoding step.
Recent state-of-the-art VQA models tend to use generative
model architectures due to their remarkable scalability and Answer-aware examples. Similar to the example selection
generalizability [19], [31], [37]. strategy for discriminative models, the answer-aware exam-
Given the same VQA training dataset D = ples for generative models are also obtained by performing
{(vi , qi , ai )}M
i=1 as above, a generative VQA model Mgen kNN search in a latent answer space. It is worth noting that
is learned from D to generate answers word-by-word from the granularity of the latent features is different for the two
a pre-defined word vocabulary V = {wj }S j=1 , where S is the types of VQA models: each latent feature obtained from a
word vocabulary size. Each answer can be represented as a discriminative VQA model refers to an answer entry in the
text sequence with a dynamic length of L words: answer vocabulary, while each latent feature obtained from
a generative VQA model refers to an answer word.
w = (w1 , w2 , ..., wL ) (8) Given a testing input (v, q) and i-th training input
where w1 = [BOS] refers to a special start-of-sentence token (vi , qi ), the latent features for their multi-word answers
and wL = [EOS] refers to an end-of-sentence token. can be respectively represented as feature groups Z =
Similar to the discriminative model, Mgen can also be [z 1 , z 2 , ..., z L ] ∈ RL×d and Zi = [zi1 , zi2 , ..., ziLi ] ∈ RLi ×d ,
separated into a backbone MB where d is the common dimensionality of the latent answer
gen and a prediction head
space, L and Li refer to the answer lengths of Z and Zi ,
MH B
gen . The backbone Mgen corresponds to an encoder-
respectively. We define a simple score function as follows to
decoder or a pure decoder architecture that fuses multi-
average the dot-product similarity of each paired features
modal inputs v and q , and then generates latent feature of
each answer word using an autoregressive manner:
zj ∈ Z and zik ∈ Zi :
L Li
z l = MB [l−1] 1 XX z j zik
gen (v, q, w ) (9) πi = (12)
L ∗ Li j=1 k=1 ∥z ∥2 ∥zik ∥2
j
l
where z denotes the latent feature of l-th answer word.
On top of the latent feature z l , the prediction head MH gen Using the score function above, we obtain the top-N
applies a linear projection (or a MLP) followed by a softmax nearest neighbors of the query input in the training set
function to decode it into a score distribution y l ∈ RS over and then format them as the answer-aware examples E as
the whole word vocabulary: follows:
IAE = argTopN πi
y l = MH l
gen (z ) (10) i∈{1,2,...,M } (13)
E = {(vi , qi , ai ) | i ∈ IAE }
where the l-th answer word wl is obtained from y l by
greedily choosing the word with the highest score. Until an where IAE is an index set of the top-N nearest neighbors in
[EOS] token is generated, wl is appended to w[l−1] to obtain the training set D.
w[l] , which is iteratively fed into the model Mgen to predict
the next word.
3.3 Stage-2: Heuristics-enhanced Prompting
Answer candidates. Given a testing input (v, q), we can
obtain its most relevant answer using the greedy decoding After obtaining the answer heuristics (i.e., answer candi-
strategy above. However, how to obtain the answer candi- dates C and answer-aware examples E ) from the stage-1, we
dates consisting of the top-K answers and their confidence encode them into a heuristics-enhanced prompt to facilitate
scores is not straightforward. We resort to the beam search al- the few-shot learning capacity of the LLM for knowledge-
gorithm, which is widely used in neural machine translation based VQA.
[45] and visual captioning [46], to address the issue. A prompt consists of a prompt head, a set of in-context
Similar to Eq. (5), we denote the top-K answer candi- examples, and a testing input. The prompt head describes
dates as a set of tuples as follows: the VQA task in natural language. We refer to the prompt
head designed in PICa and supplement it with a new de-
C = {(w1 , s1 ), (w2 , s2 ), ..., (wK , sK )} (11) scription of the answer candidates. Although we encourage
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 6
LLM to generate answers according to the answer candi- questions, only the ‘IMG’ subset of 10.3K (48.7%) samples
dates, we also allow it to take broad explorations and gen- have image content, which is used in our experiments.
erate answers beyond the candidates. The complete format Consequently, the retained dataset consists of 6.2K training,
of our prompt head is shown in the yellow box of Fig. 2. 2.1K validation, and 2.0K testing samples. The questions
Our in-context examples are derived from the obtained require high school-level science knowledge to arrive at the
N answer-aware examples E = {e1 , e2 , ..., eN }. Based on correct answer chosen from multiple choices.
PICa’s template in §3.1, for example ei , we introduce its
TextVQA contains 28K images and 45K questions, where
answer candidates Ci by adding one line of code as follows:
each question requires models to read and reason about
Context: ci \n Question: qi \n the text in the image to give a correct answer [24]. The
Candidates: wj1 (sj1 ), wj2 (sj2 ),...,wjK (sjK ) \n dataset is split into three subsets of 34.6K training, 5K
Answer: ai validation, and 5.7K testing questions. Similar to OK-VQA,
each question is annotated with ten open-ended answers by
where j1 , j2 , · · · , jK correspond to the actual indices of the
humans and soft-voting accuracy is used as the evaluation
elements in Ci . Each answer candidate wjk is paired with
metric. Following the strategy in [48], [49], we supplement
its confidence score sjk within a bracket. The confidence
the training set with the augmented VQA samples from ST-
scores additionally offer the reliability of the corresponding
VQA [50].
answer candidates, which helps the LLM focus more on
the promising candidates and be more tolerant of the less
relevant candidates. For the testing input, its template is 4.2 Implementation Details
similar to that for the in-context examples, except that the Default settings on OK-VQA. We use the MCAN-large [18]
answer slot is left blank for the LLM to fill with. as our default VQA model to generate answer heuristics.
To better exploit available examples, we use the multi- To improve the model capability, we modify the original
query ensemble strategy [15]. Specifically, we increase the MCAN model by: (i) replacing the original bottom-up-
number of answer-aware examples to N *T to obtain T attention region-based features with the grid-based features
paralleled prompts, where each prompt still contains N extracted from CLIP’s visual encoder with a RN50×64 back-
examples. By prompting the LLM for T times, we obtain T bone [51]; and (ii) replacing the original LSTM network with
answer predictions. The majority voting is performed over a pretrained BERT-large model [44].
the T predictions to determine the final answer. The effects Similar to [11], we apply the transfer learning paradigm
of different N and T will be verified in the experiments. to further enhance the model capability. The model is first
pretrained on the VQAv2 dataset [47] and Visual Genome
4 E XPERIMENTS dataset [52]. To prevent data contamination, we remove
We mainly evaluate the performance of Prophet on two those samples from the pretraining dataset, whose images
prevalent knowledge-based VQA datasets: OK-VQA [7] and are used in the testing split of OK-VQA. After that, the
A-OKVQA [8]. We conduct comprehensive ablation exper- pretrained model is further finetuned on the training split
iments to explore the effectiveness of Prophet. By taking of OK-VQA to obtain our final VQA model. Note that
the ablation results into account, we perform thorough com- the answer vocabulary of the pretrained model (with 3,129
parisons of Prophet and state-of-the-art methods. Moreover, answers) is quite different from the vocabulary of OK-
we showcase the generalization ability of Prophet on two VQA. To bridge this gap, we merge the answer vocabulary
diverse knowledge-based VQA datasets ScienceQA [23] and of OK-VQA2 with the existing vocabulary, resulting in an
Text-VQA [24], which require external science and OCR expanded answer vocabulary with 4,477 answers for model
knowledge, respectively. finetuning. This model is trained on a single Nvidia RTX
3090 GPU, which is affordable for most people.
During the prompting stage using LLMs, we follow PICa
4.1 Datasets to use OSCAR+ as the captioning model [26]. Unless other-
OK-VQA is a commonly used knowledge-based VQA wise noted, we set the number of answer candidates K =10,
dataset [7]. The dataset contains 9K and 5K image-question the number of in-context examples N =16, and the number
pairs for training and testing, respectively. All questions of queries T =5 as our default settings. The default version
are manually filtered to ensure that outside knowledge is of GPT-3 used in our experiments is text-davinci-002 and
required to answer the questions. Each data sample is anno- the sampling temperature is set to 0.
tated with ten open-ended answers. The accuracy computed
Settings on other datasets. The settings and strategies
by the soft scores is used as the evaluation metric [47]. We
for OK-VQA can be directly transferred to A-OKVQA to
use the 1.1 version of OK-VQA in the experiments.
address its DA task. For the MC task, we follow the strategy
A-OKVQA is currently the largest knowledge-based VQA in [8] to project the predicted answer to the nearest answer
dataset [8]. The dataset is split into three subsets: 17K choice. Moreover, we design a Prophet variant for the MC
training, 1K validation, and 7K testing. Each question is task. It uses a slightly different prompt by adding the
annotated with ten open-ended answers for direct answer multiple choices to in-context examples and testing input,
(DA) evaluation. Besides, it provides a multiple choice (MC) and instructs the LLM to choose the correct one from four
evaluation to choose the correct answer from four choices. choices.
ScienceQA is a dataset that consists of about 21K questions 2. Similar to [25], we collect answers that appear more than eight
over a diverse set of science topics [23]. Out of the 21K times in the training set of OK-VQA, resulting in 2,794 answers.
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 7
VQA model, paradigm stage-1 acc. accuracy visual features stage-1 acc. accuracy #candidates (K ) hit rate accuracy
ViLBERT, retrieval [12] 35.20 40.28 (+5.08) Bottom-Up [25] 46.83 55.34 (+8.51) 0 - 49.63
ViLBERT, prompt† 35.28 44.97 (+9.69) VinVL [26] 47.88 56.23 (+8.35) 1 53.04 56.04
CLIP-ViT-L/14 [51] 52.03 60.12 (+8.09) 5 75.20 60.17
CLIP-RN50×64 [51] 53.04 60.84 (+7.80) 10 79.83 60.84
(a) Prompting vs. retrieval. Our prompting-based (b) Capability of VQA models. More powerful (c) Answer candidates. They are cru-
paradigm is more effective than the retrieval-based VQA models lead to higher accuracies, but obtain cial to Prophet and increasing K leads
one in MAVEx [12]. † : our re-implementation. slightly less relative improvements from stage-2. to better performance.
example selection hit rate accuracy #examples (N ) accuracy (T =1) accuracy (T =5) variants accuracy
(a) rand 5.31 58.66 0 49.97 49.97 (a) default 60.84
(b) ques + img [15] 59.58 59.82 1 54.89 56.75 (b) w/o prompt head 60.54
(c) fused 83.63 60.84 8 57.49 59.91 (c) w/o confidence scores 55.46
(d) fused + ques + img 82.45 60.38 16 57.52 60.84 (d) w/o image captions 58.27
(e) answer logits 79.25 60.40 20 57.91 61.10 (e) default+tags [15] 60.51
(d) Example selection strategy. Our answer-aware (e) Numbers of examples and queries. Increasing (f) Prompt contents. The default set-
example selection based on fused features is more N and T improves model performance at the ex- tings contain the exact necessary in-
effective than the others. pense of linearly increasing overheads. formation for prompting.
TABLE 1: Ablation experiments for Prophet. All the reported results are evaluated on the testing set of OK-VQA v1.1. The
best result in each table is bolded and the result with the default settings is marked in gray .
For ScienceQA, we reuse all the default settings for OK- MAVEx’s retrieval-based paradigm in external knowledge
VQA. If a training sample provides extra textual hint, we acquisition and integration.
simply append the text to the generated caption as the new
Capability of VQA models. In Table 1b, we study how
context of the corresponding image. For TextVQA, we use
the VQA models of different capabilities impact the per-
the commercial system from Amazon to extract OCR from
formance of Prophet. To better control the model capability,
images 3 , whose effectiveness has been verified in previous
we use the same MCAN model trained with four visual
work [49]. The extracted OCR texts are provided in both the
features: region-based Bottom-Up [25] and VinVL [26] fea-
in-context examples and testing input to instruct the LLM.
tures and grid-based CLIP features from two backbones
Settings of other VQA models. In addition to MCAN, we (ViT-L/14 and RN50×64) [51]. Results show that more
also experiment with one generative VQA model mPLUG powerful VQA models (reflected in the stage-1 accuracies)
[19], which is first pretrained on task-agnostic image- lead to better performance of Prophet, as they provide
text corpus and then finetuned on specific VQA dataset. answer heuristics of higher quality. Combining the results in
Following the aforementioned two-stage transfer learning Table 1a, we also observe that more powerful VQA models
paradigm for MCAN, the pretrained mPLUG model is first achieve less relative improvements from GPT-3, which can
finetuned on the VQAv2 dataset and then further finetuned be explained by the intrinsic diminishing return property.
on specific knowledge-based VQA dataset. As a by-product, we verify that the visual features are
important to the performance of knowledge-based VQA,
which is consistent with the observations in [17]. The models
4.3 Ablation Studies with CLIP-based visual features significantly outperform
We conduct ablation experiments for Prophet on OK-VQA those with region-based features, indicating that the CLIP’s
using the default settings above. Results shown in Table 1 visual features contain richer visual knowledge due to large-
and Fig. 4 are discussed in detail below. scale pretraining.
Prompting vs. retrieval. Prophet uses a prompting-based In addition to using different visual features for MCAN,
paradigm to predict the answer based on a set of promising we can also replace the whole MCAN model with any
answer candidates. In contrast, a previous work MAVEx generative models pretrained on large-scale multimodal
[12] exploits answer candidates but adopts a retrieval-based datasets as mentioned in §3.1. These results will be reported
paradigm to search knowledge from external KBs to de- in the main results.
termine the answer. As both Prophet and MAVEx train a Answer candidates. Table 1c varies the number of answer
VQA model to generate answer candidates (stage-1), we can candidates K from 0 to 10 to explore its effect on Prophet.
compare the superiority of the two paradigms (stage-2). In For each testing sample, if the ground-truth answer is hit
Table 1a, we show the performance of the two paradigms in by one of the K answer candidates, we accumulate the soft
terms of stage-1 accuracy and final accuracy, respectively. score of that ground-truth answer4 . The hit rate is calculated
For a fair comparison, we re-implement the VQA model over the testing set by dividing the accumulated score by the
used in MAVEx, i.e., ViLBERT [32], to generate answer number of samples.
heuristics for our Prophet. From the results, we can see that From the results, we can see that: (i) without any answer
based on the same VQA model, our Prophet outperforms candidates, Prophet’s accuracy drops by 6.4 points (K =0
MAVEx by a large margin (44.97% vs. 40.28%), showing
the superiority of our prompting-based paradigm over 4. In practice, multiple ground-truth answers are provided. If mul-
tiple answers are hit simultaneously, we choose the answer with the
3. https://aws.amazon.com/textract/ largest soft score for accumulation.
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 8
The results show that the accuracy is positively cor- TABLE 2: Prophet’s combinatorial prediction behaviors
related with the hit rate of answers, which verifies our in two stages. Prophet maintains the majority of correct
hypothesis that answer-aware examples contribute signif- predictions at stage-1, and the accuracy improvement by
icantly to the performance of Prophet. Compared with stage-2 is mainly because the number of wrong-to-correct
other strategies, our default strategy (c) achieves the best samples is larger than that of the correct-to-wrong samples.
performance with the highest hit rate. The strategy (d) that
integrates other information (ques + img) into the (c) leads
to worse performance due to the introduction of irrelevant the confidence scores are of critical importance to the per-
and noisy information. Finally, strategy (e) reports slightly formance of our Prophet. This is because they carry the
worse performance than (c). We conjecture that this is be- necessary information for GPT-3 to understand the answer
cause the answer logits have lost too much information of candidates. Second, without image captions, Prophet still
the input question and image, which is also useful for GPT-3 works steadily. This reflects the fact that our answer heuris-
to perform knowledge reasoning. tics in prompts already provide sufficient information for
Prophet to solve the task. Third, the prompt head is of less
Numbers of examples and queries. Table 1d contains
importance, indicating that GPT-3 is capable of understand-
the ablation studies for the numbers of examples and
ing the task directly from the in-context examples. Finally,
queries. We choose different numbers of examples N ∈
introducing extra information like object tags leads to a
{0, 1, 8, 16, 20} for each query and different numbers of
slight performance drop, which is contrary to the results
queries T ∈ {1, 5}, respectively. The results show that
in PICa. We conjecture this information has already been
the performance of Prophet improves with the increase of
encoded in answer heuristics implicitly.
N and T , which is consistent with the results in PICa.
By increasing T from 1 to 5, the entries with larger N Prediction behaviors in different stages. In Table 1b, we can
enjoy greater performance improvements at the expense of observe a significant performance improvement of Prophet
linearly increasing overheads. (stage-2) over its corresponding MCAN model (stage-1). To
Interestingly, the Prophet variant with N =0 delivers better understand this improvement, we conduct a statisti-
worse performance than the VQA model in stage-1 (49.97% cal analysis of Prophet’s prediction behaviors. As Prophet
vs. 53.04%), even though answer candidates are provided. takes K answer candidates from MCAN as inputs, we
Meanwhile, when given one example (N =1), the Prophet define three prediction behaviors for Prophet: “keep top-
variant distinctly surpasses the VQA model (56.75% vs. 1”, “in top 2-K ”, and “beyond top-K ”. All the testing
53.04%). This suggests the necessity of few-shot in-context samples can be categorized into one of the three classes. The
examples for GPT-3 to activate its capability to adapt to the statistical results in Figure 4 show that: (i) for 68.1% of the
knowledge-based VQA task. testing samples (green slice), Prophet keeps the top-1 pre-
dictions of MCAN. These samples achieve a 69% accuracy
Prompt contents. In Table 1f, we ablate the prompt contents and are mostly easy samples; (ii) for 21.8% of the testing
in the default settings by: (b) removing the prompt head; samples (blue slice), Prophet selects answers from the top
(c) removing the confidence scores for answer candidates; 2-K answer candidates. These samples are relatively hard,
(d) removing image captions; and (e) adding predicted tags so that MCAN delivers a 24% accuracy while Prophet has a
from external models [15]. much higher 40% accuracy; (iii) for the remaining 10.1% of
The results lead to the following observations: First, the testing samples (yellow slice), Prophet predicts answers
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 9
per-sample method accuracy
LLM (version or size) accuracy
average cost
methods with external knowledge bases
commercial models Mucko [10] 29.2∗
GPT-3 (text-davinci-002) $0.2 60.8 ConceptBERT [56] 33.7∗
GPT-3 (3.5-turbo-instruct) $0.015 58.9 KRISP [11] 38.9
open-source models Visual Retriever-Reader [42] 39.2
† LLaMA-1 (7B) [20] 2.6s 51.8 MAVEx [12] 40.3
† LLaMA-1 (13B) [20] 4.6s 56.1 TRiG [13] 49.4
† LLaMA-1 (30B) 8.7s 57.1 UnifER [59] 42.1
† LLaMA-1 (65B) [20] 16.6s 58.8 methods with multimodal pretraining
† Falcon (7B) [21] 2.7s 50.5 Unified-IO (2.8B) [57] 54.0
† Falcon (40B) [21] 14.1s 57.1 Flamingo (80B) [2] 57.8
LLaMA-2 (7B) [53] 2.7s 56.6 PALI (17B) [58] 64.5
LLaMA-2-Chat (7B) [53] 2.7s 54.0
methods with GPT-3 API
LLaMA-2 (13B) [53] 4.8s 57.9
PICa [15] 48.0
LLaMA-2-Chat (13B) [53] 4.8 56.5
KAT† [16] 53.1
LLaMA-2 (70B) [53] 18.3s 59.6
REVIVE† [17] 56.6
Mistral (7B) [54] 3.0s 59.7
PromptCap (OFA)† [43] 60.4
Prophet (MCAN) 61.1
TABLE 3: Ablation study of different LLMs. All variants
Prophet (mPLUG) 62.5
use the default settings and are evaluated on the testing
set of OK-VQA. The per-sample average costs of the open- TABLE 4: Comparisons to the state-of-the-art methods
source models are measured by the GPU running time on on OK-VQA testing set. The compared methods are split
a server with A100 GPUs while the costs of the commercial into three groups based on their knowledge resources and
models are measured by money. † indicates the LLM’s max usages. ∗ : accuracy is evaluated on OK-VQA v1.0. † : method
token length is insufficient for N =16 examples. For these needs to query GPT-3 during training.
LLMs, we reduce N to fit their maximum capacity.
of reproducibility6 . Finally, by replacing MCAN with the Context: a bedroom with a bed and a canopy.
Question: Name the type of curtains shown in
EXAMPLE #2
Context: A bed that has sheets, a cover, and
pillows.
pretrained generative model mPLUG, our method exhibits this picture?
Candidates: curtain (0.11), lace (0.02), cloth Question: What fabric is that bedspread
(0.02), fabric (0.01), canopy (0.01) made from?
a 1.4-point further improvement, showing the substantial Prophet: lace Candidates: silk (0.93), lace (0.04), cotton
(0.04), cloth (0.02), polyester (0.01)
GT: {drape: 1.0, rod pocket curtain: 0.6, big: 0.6,
contribution of a powerful VQA model for Prophet. lace: 0.6} Answer: lace
6. Flamingo-80B is trained on 1,536 TPUv4 for 15 days and PALI is 5 B ROADER I MPACT
trained on 1,024 TPUv4 for 7 days, which are unaffordable for most
researchers. In contrast, Prophet (MCAN) uses one RTX-3090 to train a From a multimodal LLM (MLLM) point of view, Prophet
VQA model for 4 days and a certain number of GPT-3 invocations. is a loosely-coupled MLLM consisting of a vision-language
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 11
(VL) model and a frozen LLM, aiming to endow the VL case OK-VQA accuracy
model with knowledge reasoning ability. Compared with original MCAN 43.6
+ CLIP visual feats 49.6
the tightly-coupled MLLMs (e.g., Flamingo [2] and LLaVa
+ RoPE mechanism 50.3
[60]) which jointly optimize the VL model and LLM in
+ BERT as the text encoder 53.0
an end-to-end manner, Prophet is more flexible that can
support any open-source or commercial LLM. TABLE 7: Ablations for model architectures. ‘+’ denotes
Moreover, Prophet can also be regarded as a learning-to- each modification is applied to the previous variant.
prompt paradigm that learns an external model to generate training strategy OK-VQA accuracy
prompts to better comprehend the target task, thus facil- (a) train from scratch 35.6
itating the capability of the pretrained LLM (or MLLM). (b) pretrain, w/o finetune 41.1
From this point of view, recent studies like VoxPoser [68] (c) w/ finetune, replace last layer 47.7
and SoM-Prompting [69] share a similar idea with our work. (d) w/ finetune, append new answers 53.0
We believe this paradigm can be widely used in a variety of TABLE 8: Ablations for training strategies. All variants use
LLM-related tasks. the improved model architecture in the last row in Table 7.
Training recipe. We first pretrain the model on the aug-
6 C ONCLUSION
mented train+val+vg dataset from VQAv2 [47] and Visual
In this paper, we present Prophet—a conceptually simple Genome [52], with excluding the samples whose images
framework which uses LLMs as the knowledge engine are used in the testing split of OK-VQA to avoid data
for knowledge-based VQA. To better activate the few-shot contamination. The settings for the pretraining stage are
learning capacity of LLMs, we introduce a novel paradigm identical to the original implementation of MCAN. After
to prompt LLMs with two types of complementary answer that, the model is finetuned on the downstream OK-VQA
heuristics. Extensive ablations, comparative experiments, and A-OKVQA datasets, respectively. For finetuning, the
and comprehensive analyses on four diverse knowledge- commonly used strategy is to replace the last linear layer
based VQA datasets show the superiority of Prophet over (i.e., the classification layer) with a new layer to adapt to the
all existing state-of-the-art methods. Notably, Prophet can answer vocabulary of the downstream dataset. However,
be instantiated with varied combinations of a wide range of the answer vocabularies of the pretraining and finetuning
VQA models and LLMs, showing its flexibility, scalability, datasets are partially overlapped. To maximally utilize the
and generalizability. We hope that our work can inspire pretrained model parameters in the last layer, we inherit the
future research on knowledge-based VQA and universal parameters of existing answers and append new parameters
multimodal learning in the era of LLMs. for the new answers. After that, we freeze all the pretrained
parameters and only update the new parameters for one
A PPENDIX A epoch as a warm-up, and then train all model parameters
M ORE I MPLEMENTATION D ETAILS for the rest training epochs.
Table 8 shows the effects of different training strategies.
A.1 The Default VQA Model
Even without finetuning, the pretrained model (b) is su-
Our default VQA model is carefully designed in terms of perior to the model trained from scratch (a), implying the
model architecture and training strategy. In the following importance of pretraining. Moreover, our new finetuning
table, we show the improvements of our default MCAN strategy (d) leads to significantly better performance than
model over the counterparts trained from scratch. More the commonly used strategy (c), showing the effectiveness
details are provided next. of inheriting model parameters for existing answers.
from scratch, from scratch, transfer learning,
original model [18] improved model improved model A.2 Prompt Formats
31.5 35.6 53.0
We show an exemplar prompt for the standard Prophet in
Improved model architecture. We introduce an improved Table 9 and an exemplar prompt for the variant designed
variant of MCAN [18] based on its open-sourced MCAN- for the MC task of A-OKVQA in Table 10. The exemplar
large implementation. Our modifications to the model ar- prompts for ScienceQA and TextVQA are illustrated in Table
chitecture include: (i) we replace the original bottom-up- 11 and 12, respectively.
attention features with the grid-based features extracted
from the CLIP’s visual encoder with RN50×64 backbone
A PPENDIX B
[51]; (ii) we introduce the RoPE mechanism [70] to each
image self-attention layer of MCAN to supplement the M ORE Q UALITATIVE AND Q UANTITATIVE A NALYSES
grid-based features with positional information; and (iii) We provide more in-depth analyses of Prophet’s perfor-
we replace the original LSTM network with a pre-trained mance on the testing set of OKVQA. All results are carried
BERT-large model [44] as the text encoder before MCAN. out using the default settings.
Table 7 shows the accuracies of different model variants We show the per-type accuracies of MCAN (stage-1)
on the testing set of OK-VQA. By progressively adding the and Prophet (stage-2) in Table 13. Prophet outperforms
modifications to the original MCAN model, our improved MCAN on all categories, indicating that generality of the
MCAN model reports a 53.0% accuracy, which is on par knowledge in GPT-3. The improvement on the “Science and
with current state-of-the-art methods like KAT [16]. Technology” category is not as large as the rest categories.
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 12
Please answer the question according to the context and the answer candidates. Each answer candidate is associated with a
confidence score within a bracket. The true answer may not be included in the candidates.
===
Context: The motorcycle racers are getting ready for a race.
===
Question: What sport are these guys doing?
===
Candidates: motorcross(0.94), motocross(0.79), bike(0.35), dirt bike(0.28), motorcycle(0.03),
bmx(0.03), cycling(0.02), motorbike(0.02), race(0.02), bicycle(0.02)
===
Answer: motorcross
===
Context: a black motorcycle parked in a parking lot.
===
Question: What sport can you use this for?
===
Candidates: race(0.53), motorcycle(0.41), motocross(0.19), bike(0.17), motorcross(0.15),
cycling(0.11), dirt bike(0.10), ride(0.08), bicycling(0.01), bicycle(0.01)
===
Answer:
TABLE 9: An exemplar prompt for the standard Prophet. We show one in-context example here due to space limitations.
Following the implementations in PICa [15] and KAT [16], we use a special symbol ‘===’ to separate each two lines.
Please choose the correct answer in the choices according to the context, the question and the answer candidates. Each answer
candidate is associated with a confidence score within a bracket. The true answer may not be included in the candidates.
===
Context: A young man riding a skateboard on a sidewalk.
===
Question: What part of his body will be most harmed by the item in his mouth?
===
Candidates: skateboard(0.02), nothing(0.02), table(0.01), leg(0.01), helmet(0.00), knees(0.00),
skateboarding(0.00), head(0.00), teeth(0.00), falling(0.00)
===
Choices: (A) back, (B) lungs, (C) feet, (D) eyes
===
Answer: (B)
===
Context: a young boy kneeling on a skateboard on the street.
===
Question: What did this lad likely injure here?
===
Candidates: skateboard(0.18), shoes(0.02), shoe(0.02), skateboarding(0.01), street(0.01),
flowers(0.01), skating(0.01), boy(0.01), head(0.00), skateboarder(0.00)
===
Choices: (A) knee, (B) elbow, (C) rear, (D) board
===
Answer:
TABLE 10: An exemplar prompt for the Prophet variant on the MC task of A-OKVQA. Compared to the standard prompt
in Table 9, we add one extra line of choices for the example and testing input, and change the output format to adapt to
the multiple-choice task. All the differences are marked in red.
which can be explained that the required knowledge for the potential of devising more powerful VQA models. The
this category is more specialized and professional. These cause of “(c) correct but differently expressed answer” also
questions are also challenging for humans. accounts for a considerable proportion. This reflects the
limitation of the annotations and evaluation metric of OK-
We perform human studies to analyze the causes of VQA.
wrong predictions in Table 14. For each category, we ran- Figure 6 demonstrates some testing samples from differ-
domly sample 10% testing samples that Prophet fails to get ent knowledge categories. In the 1st-3rd columns, we show
the correct answer. This results in 172 samples. We ask three the correctly answered samples with different prediction
annotators to categorize each sample into one of the follow- behaviors (i.e., keep top-1, in top 2-K , and beyond top-K ).
ing four failure causes: (a) insufficient visual understanding; The visualized results indicate that Prophet can adaptively
(b) incorrect knowledge reasoning; (c) correct but differently choose suitable answers from candidates. In the last column,
expressed answer; (d) others (e.g., the failure is caused by we show some failure samples, implying that there is still
the ambiguity of the question). From the results, we can room for future improvement.
see that the cause of “(b) incorrect knowledge reasoning”
accounts for the highest proportion, which suggests that
the bottleneck of Prophet still lies in the knowledge acqui-
sition and reasoning. The cause of “(a) insufficient visual
understanding” has the second highest proportion, showing
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 13
Please choose the correct answer in the choices according to the context, the question and the answer candidates. Each answer
candidate is associated with a confidence score within a bracket. The true answer may not be included in the candidates.
===
Context: A picture of a black and white model of a molecule. The model below represents graphite. Graphite is used to make
pencil lead.
===
Question: Complete the statement. Graphite is ().
===
Candidates: an elementary substance(1.00), a compound(0.02), an adult substance(0.01), an an elementary substance(0.01)
===
Choices: (A) a compound, (B) an elementary substance
===
Answer: (B)
===
Context: A pair of eye glasses with the word h on them. The model below represents a molecule of hydrogen. Hydrogen gas was
once used to make large airships, such as blimps, float. It is no longer used in airships because it catches fire easily.
===
Question:Complete the statement. Hydrogen is ().
===
Candidates: a compound(0.68), an elementary substance(0.32), the same substance(0.00), the same amount(0.00)
===
Choices: (A) an elementary substance, (B) a compound
===
Answer:
TABLE 11: An exemplar prompt for the Prophet variant on ScienceQA (IMG). The sentences marked in red are the
optional text hints provided by the dataset.
Please answer the question according to the context and the answer candidates. Each answer candidate is associated with a
confidence score within a bracket. The true answer may not be included in the candidates.
===
Context: A close up of a cell phone with a keyboard.
===
OCR: Market, 3, Facebook, Browser, 5, 4, 6, 1, 8, 30.
===
Question: How many apps are on this page excluding market?
===
Candidates: 6(0.20), 5(0.19), 8(0.18), 9(0.12), 7(0.08), answering does(0.05),10(0.05),13(0.05),12(0.04),4(0.04)
===
Answer: 7
===
Context: A screenshot of a yahoo mail page.
===
OCR: Free, Page, Nake WT My Page, ADVERTISEMENT, YAHOO!, FREE Camera Phone, Notepad, MAIL, Yaboo! Mail.
===
Question: What is free on this page?
===
Candidates: amera(0.40), video camera(0.29), video(0.13), photos(0.04), video call(0.04), webcam(0.03), videos(0.03),
photography(0.01), photoshop(0.01), internet explorer(0.01)
===
Answer:
TABLE 12: An exemplar prompt for the Prophet variant on TextVQA. Compared to the standard prompt, we additionally
introduce the OCR tokens (marked in red) extracted from an off-the-shelf OCR system.
Q: What would Q: What is the Q: If this chair where Q: What kind of glass
happen if these decorative fabric outside it might be is used to make that
items fall to the on the floor made from what shower enclosure?
objects, ground? called? reed like material? C: frosted(0.59)
clear(0.12)
C: break(0.06) C:carpet(0.84) C: wood(0.09)
material and crash(0.04) rug(0.79) canvas(0.08)
fancy(0.03)
large(0.02)
died(0.02) vacuum(0.03) wicker(0.07)
clothing float(0.01) cotton(0.00) cloth(0.06)
thick(0.02)
P: frosted
sell(0.01) blanket(0.00) cotton(0.05) G: tempered, clear,
P: break P: rug P: rattan pane
Fig. 6: Different categories and prediction behaviors. Each row contains four testing samples from a specific knowledge
category. The first to the third columns correspond to the correctly answered samples of different prediction behaviors (i.e.,
keep top-1, in top 2-K , and beyond top-K ). The last column contains failure samples.
A SUBMISSION TO IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 15