0% found this document useful (0 votes)
2 views

Prompting_Large_Language_Models_with_Knowledge-Injection_for_Knowledge-Based_Visual_Question_Answering

This paper presents a novel framework called 'Prompting Large Language Models with Knowledge-Injection' (PLLMKI) for knowledge-based Visual Question Answering (VQA), enhancing the inferential capacity of Large Language Models (LLMs) through knowledge injection rather than relying solely on knowledge graphs. The proposed method outperforms existing baselines on two knowledge-based VQA datasets, OK-VQA and A-OKVQA, achieving accuracy improvements of over 1.3 and 1.7, respectively. The framework utilizes open LLMs, making it cost-effective while addressing limitations in knowledge retrieval from traditional knowledge graphs.

Uploaded by

agarwalmahak2010
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Prompting_Large_Language_Models_with_Knowledge-Injection_for_Knowledge-Based_Visual_Question_Answering

This paper presents a novel framework called 'Prompting Large Language Models with Knowledge-Injection' (PLLMKI) for knowledge-based Visual Question Answering (VQA), enhancing the inferential capacity of Large Language Models (LLMs) through knowledge injection rather than relying solely on knowledge graphs. The proposed method outperforms existing baselines on two knowledge-based VQA datasets, OK-VQA and A-OKVQA, achieving accuracy improvements of over 1.3 and 1.7, respectively. The framework utilizes open LLMs, making it cost-effective while addressing limitations in knowledge retrieval from traditional knowledge graphs.

Uploaded by

agarwalmahak2010
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

BIG DATA MINING AND ANALYTICS

ISSN 2096-0654 19/25 pp843−857


DOI: 10.26599/BDMA.2024.9020026
Volume 7, Number 3, September 2024

Prompting Large Language Models with Knowledge-Injection for


Knowledge-Based Visual Question Answering

Zhongjian Hu, Peng Yang*, Fengyuan Liu, Yuan Meng, and Xingyu Liu

Abstract: Previous works employ the Large Language Model (LLM) like GPT-3 for knowledge-based Visual
Question Answering (VQA). We argue that the inferential capacity of LLM can be enhanced through knowledge
injection. Although methods that utilize knowledge graphs to enhance LLM have been explored in various
tasks, they may have some limitations, such as the possibility of not being able to retrieve the required
knowledge. In this paper, we introduce a novel framework for knowledge-based VQA titled “Prompting Large
Language Models with Knowledge-Injection” (PLLMKI). We use vanilla VQA model to inspire the LLM and
further enhance the LLM with knowledge injection. Unlike earlier approaches, we adopt the LLM for knowledge
enhancement instead of relying on knowledge graphs. Furthermore, we leverage open LLMs, incurring no
additional costs. In comparison to existing baselines, our approach exhibits the accuracy improvement of over
1.3 and 1.7 on two knowledge-based VQA datasets, namely OK-VQA and A-OKVQA, respectively.

Key words: visual question answering; knowledge-based visual question answering; large language model;
knowledge injection

1 Introduction knowledge-based VQA, any external knowledge can be


applied to answer questions. This paper focuses on
Knowledge-based Visual Question Answering
VQA with open-domain knowledge.
(VQA)[1] extends the VQA task[2], introducing the
Early researchers try to retrieve knowledge from
requirement for external knowledge to answer
external knowledge resources. More recently, some
questions. Early knowledge-based VQA benchmarks
also provide knowledge bases[3]. More recently, VQA works attempt to explore the utilization of implicit
benchmarks have been established that emphasizes knowledge in pre-trained language models, such as
open-domain knowledge[4, 5]. In open-domain KRISP[6]. With the emergence of Large Language
Models (LLMs), researchers have shifted towards
Zhongjian Hu, Peng Yang, and Yuan Meng are with the School employing them as knowledge acquisition engines.
of Computer Science and Engineering, Southeast University,
PICa[7] adopts GPT-3[8] for in-context learning in
and also with the Key Laboratory of Computer Network and
knowledge-based VQA. Given that GPT-3, as a
Information Integration (Southeast University), Ministry of
Education of the People’s Republic of China, Nanjing 211189, language model, lacks direct comprehension of images,
China. E-mail: [email protected]; [email protected]; the approach involves conversion of the image into its
[email protected]. corresponding textual caption through a captioning
Fengyuan Liu and Xingyu Liu are with the Southeast University - model.
Monash University Joint Graduate School (Suzhou), The VQA triplet image-question-answer will be
Southeast University, Suzhou 215125, China. E-mail:
converted to context-question-answer, thus unifying
[email protected]; [email protected].
the input into text. The context denotes the caption for
* To whom correspondence should be addressed.
Manuscript received: 2024-02-24; revised: 2024-04-01; the image. Despite the encouraging results of PICa, we
accepted: 2024-04-07 argue that the capacity of LLM can be further

© The author(s) 2024. The articles published in this open access journal are distributed under the terms of the
Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).
844 Big Data Mining and Analytics, September 2024, 7(3): 843−857

improved through knowledge enhancement. ● We propose a novel framework for knowledge-


Some studies have proposed to use knowledge based VQA that incorporates appropriate in-context
graphs to enhance LLMs, such as RAG[9]. RAG examples and background knowledge to predict the
retrieves documents from the knowledge graphs and answer. The framework is entirely built upon open
incorporates them as supplementary contextual LLMs and is free of cost.
information. While knowledge graphs have the ● To our knowledge, this is the first attempt to
potential to enhance LLMs, they still have limitations, utilize LLM to enhance knowledge for another LLM in
such as the possibility of not being able to retrieve the knowledge-based VQA task.
required knowledge. ● We conduct experiments on two knowledge-based
Inspired by previous works, we propose a novel VQA datasets, namely OK-VQA and A-OKVQA.
framework for knowledge-based VQA that enhances Experiments show that our approach outperforms the
LLMs through knowledge injection. Figure 1 shows existing baselines.
the comparison of PICa, RAG, and our method. In
2 Related Work
addition, we adopt the open LLMs, which are free
compared to GPT-3. Existing approaches, when The research of Artificial Intelligence (AI) is of great
applying the LLM to knowledge-based VQA, only significance, which not only promotes the progress of
utilise the knowledge of the LLM itself, ignoring the science and technology, but also has a far-reaching
role of external knowledge to inspire the LLM. We impact on the social and economic development, the
propose a novel framework that injects knowledge into improvement of human life and the construction of the
the prompt to further inspire the LLM. In addition, to future world. More and more AI research is emerging,
address the issue that knowledge resources such as which has an important impact on promoting the
knowledge graphs may not be able to retrieve the development of AI. SpectralGPT[10] is a remote sensing
required knowledge, we adopt a new idea of employing foundation model designed for spectral data. It has
another LLM to generate the knowledge instead of significant potential for advancing spectral remote
retrieving the knowledge from the knowledge graphs. sensing big data applications in geoscience across four
Main contributions are as follows: downstream tasks: single/multi-label scene
PICa classification, semantic segmentation, and change
Output
detection. Hong et al.[11] created the C2Seg dataset for
Input Prompt GPT-3
multimodal remote sensing. The C2Seg dataset is
RAG intended for use in the cross-city semantic
segmentation task. To improve the generalization
KGs
ability of AI models in multi-city environments, they
Knowledge
also proposed a high-resolution domain adaptation
Input Prompt LLM Output network, referred to as HighDAN. Hong et al.[12]
proposed the augmented LMM, a novel spectral
Ours
mixture model, to address spectral variability in inverse
LLM1
problems of hyperspectral unmixing. During the
Knowledge
imaging process, data are often affected by various
Vanillia
Input VQA model Prompt LLM2 Output variabilities. Our proposed method may also
experience some limitations when faced with various
Fig. 1 Comparison of PICa, RAG, and our method. PICa variabilities. For instance, if the image is damaged,
applies the in-context learning of GPT-3 to knowledge-based information loss may occur during the conversion of
VQA. RAG uses knowledge retrieved from the knowledge captioning. This loss of captioning information can
graphs to enhance the LLM. Our approach first adopts a
affect knowledge generation, as it is partly based on the
vanilla VQA model to generate in-context examples, then
takes the LLM1 to generate background knowledge instead
captioning information.
of retrieving knowledge from the knowledge graphs, and VQA. VQA[13, 14] is a popular multi-modal AI
finally integrates the knowledge into prompt to inspire task[15, 16]. Recent VQA studies can be generally
LLM2 . divided into these categories: better visual features[17, 18],
Zhongjian Hu et al.: Prompting Large Language Models with Knowledge-Injection for Knowledge-Based Visual... 845

better model architectures[19–21], and better learning applies in-context learning of GPT-3 to knowledge-
paradigms[22–24]. According to the research methods based VQA, achieving encouraging results. PICa
can be divided into: joint embedding methods, utilizes a captioning model to convert the image into
attention methods, modular methods, external corresponding caption, which can be processed by
knowledge-based methods, and so on. Most of the joint LLM. In-context learning, a powerful few-shot
embedding methods use Convolutional Neural learning technique, enables reasoning with a few task
Network (CNN) network to extract visual features, examples assembled as the prompt, eliminating the
Recurrent Neural Network (RNN) network to extract need for parameter updates. Prophet[29] adopts a vanilla
text features, and simply combine the two features. The VQA model to inspire LLM, further activating the
Neural-Image-QA model proposed by Malinowski et capability of LLM.
al.[25] is the first to leverage the joint embedding Knowledge-enhanced LLMs. LLMs have
method. The model is based on CNN and Long Short- demonstrated promising results across various tasks.
Term Memory (LSTM), treating the VQA task as a Researchers explore the use of knowledge graphs to
sequence-to-sequence task assisted by image enhance LLMs[30]. Knowledge graphs[31] offer a means
information. Nevertheless, the majority of joint to enhance LLMs by incorporating knowledge during
embedding methods commonly utilize all the features pre-training, a process that extends to the inference
extracted from both images and questions as the input stage as well.
for the VQA model. This approach may introduce a When integrating knowledge graphs into training
considerable amount of noise, potentially affecting the
objectives, ERNIE[32] adopts a method where both
performance. The objective of the attention method is
sentences and corresponding entities are input into
to concentrate the limited attention on crucial elements,
LLMs. The training process involves instructing the
significantly enhancing the comprehension capability
LLMs to predict alignment links. On the other hand,
of neural network. Yu et al.[26] introduced a multi-
ERNIE 3.0[33] represents a knowledge graph triple as
modal factorized bilinear pooling approach, where text
tokens, concatenating them with sentences. RAG[9]
attention is inferred based on the question, and visual
employs a distinctive approach by initially searching
attention is inferred by the involvement of text
and retrieving relevant documents from knowledge
attention. However, the VQA task is compositional.
graphs. These documents are then provided to the
For example, in a question like “What’s on the table?”,
language model as additional context information.
it is necessary to first determine the position of the
table, then identify the location above the table, and Despite the benefits of knowledge graphs in enhancing
finally ascertain the target object above the table. LLMs, they may face challenges in retrieving the
Hence, some studies have proposed the modular required knowledge. In this paper, a novel idea is
networks for VQA task. The modular approach proposed, suggesting the utilization of one LLM to
involves designing distinct modules for various enhance knowledge for another LLM, as an alternative
functions and connecting these modules based on to traditional knowledge graphs.
different questions. Andreas et al.[27] first applied 3 Methodology
neural modular networks to VQA. Additionally, there
exists a category of VQA that necessitates external The Prompting Large Language Models with
knowledge, often referred to as knowledge-based Knowledge-Injection (PLLMKI) framework is
VQA. illustrated in Fig. 2. The framework comprises three
Knowledge-based VQA. Some benchmarks for main components: (1) Utilizing a vanilla VQA model
knowledge-based VQA have been proposed, to obtain in-context examples, which are then
necessitating external knowledge to answer questions. processed by a captioning model to transform the
Early works retrieve knowledge from external image-question-answer into context-question-answer.
knowledge resources. More recently, Marino et al.[6] (2) Employing the LLM1 to generate background
proposed KRISP to retrieve implicit knowledge stored knowledge and integrating the knowledge into the
in pre-trained language models. MAVEx[28] proposes a prompt, resulting in context-question-knowledge-
validation method aimed at improving the utilization of answer. (3) Inputting the modified prompt into the
noisy knowledge. Yang et al.[7] proposed PICa, which LLM2 to predict the answer.
846 Big Data Mining and Analytics, September 2024, 7(3): 843−857

Example
Test input

Prompt for knowledge generation Q: What continent are Q: These long necked creatures Q: Where in Africa is this
photo taken? Q: What part of Africa
these animals native to? live in what environment?
A: Africa A: Africa A: Savannah do these animals live?
Please generate background knowledge
in English based on the context and
question.
Image to text
Context: A tennis player swinging a
racket at a ball.
Question: What move is this tennis
player currently using? C: Giraffes eating from C: Giraffes are herded by a C: A zebra standing in a open
C: Two giraffes are eating from a
Knowledge: A tennis player is an feeders on trees, at a zoo. group of people on horses. field with bushes.
individual who plays tennis, a popular Q: What continent are Q: These long necked creatures … Q: Where in Africa is this photo
tree branch.
Q: What part of africa do these
global sport. xxxxxx these animals native to? live in what environment? taken?
animals live?


A: Africa A: Africa A: Savannah

Context: Giraffes eating from feeders on


trees, at a zoo.
Question: What continent are these
animals native to?
Knowledge: LLM1 (Knowledge enhancement)

Background knowledge Knowledge: xxxxxx Knowledge: xxxxxx … Knowledge: xxxxxx Knowledge: xxxxxx

Context: Giraffes eating from feeders on trees, at a zoo. Context: A zebra standing in a open field with bushes. Context: Two giraffes are eating from a tree branch.
Please answer the question according to the context
and knowledge. The knowledge is the background Question: What continent are these animals native to? … Question: Where in Africa is this photo taken?
Knowledge: xxxxxx
Question: What part of Africa do these animals live?
Knowledge: xxxxxx
knowledge for the context and question. Knowledge: xxxxxx
Answer: Africa Answer: Savannah Answer:

Input prompt

LLM2 (Inference)

Prediction

Fig. 2 Overview of the proposed framework.

3.1 Preliminary Context: the caption.


Before introducing the PLLMKI framework, let us first Question: the question.
Answer: the answer.
demonstrate PICa. PICa leverages the in-context
The context-question-answer triplet corresponds to
learning of GPT-3 for knowledge-based VQA.
the image-question-answer from the training set. The
The in-context learning paradigm of GPT-3
test input is structured in the format:
demonstrates the capable learning ability. Specifically,
Context: the caption.
target y is predicted conditioned on prompt ( σ , ϵ , x ),
Question: the question.
where σ is the prompt head, ϵ denotes the in-context
Answer:.
examples, and x refers to test input. At each decoding
The format of the test input mirrors that of the in-
step l :
( ) context example, with the exception that the answer is
yl = arg max pLLM yl | σ, ϵ, x, y<l (1) left blank, allowing the LLM to make predictions. PICa
yl
defines the prompt head as follows: Please answer the
where the prompt head σ is the task description, question according to the above context.
ϵ = {(x1 , y1 ), (x2 , y2 ), ..., (xn , yn )} denotes the n in-context The prompt head, in-context examples, and test input
examples. The (xi , yi ) denotes the input-target pair for are assembled together to form the whole prompt. The
the task. prompt is then fed into the LLM, such as GPT-3, to
PICa applies the in-context learning paradigm of predict the answer.
GPT-3 to VQA. Since GPT-3 lacks the ability to The PLLMKI framework draws inspiration from
comprehend image input, the image is first converted previous works. We enhance the LLM through
to a caption using the image captioning model. The knowledge injection. Specifically, we generate
original triplet from the VQA dataset, consisting of background knowledge by prompting the LLM1 , and
image-question-answer, is transformed into context- then integrate the knowledge into the prompt to assist
question-answer format. PICa defines the in-context the LLM2 in making predictions. The in-context
example in the format: example is formatted as follows:
Zhongjian Hu et al.: Prompting Large Language Models with Knowledge-Injection for Knowledge-Based Visual... 847

Context: the caption. motivated by the belief that implicit knowledge


Question: the question. embedded in LLMs may be more suitable for open-
Knowledge: the background knowledge. domain knowledge-based VQA compared to
Answer: the answer. knowledge graphs, which may fail to retrieve the
We format the test input as follows: required knowledge.
Context: the caption. We generate background knowledge by prompting
Question: the question. LLM1 and subsequently integrate the knowledge into
Knowledge: the background knowledge. the prompt to aid the LLM2 in making predictions. The
Answer:. format of the prompt to generate background
The test input follows a format similar to the in- knowledge is as follows: {[# Prompt head] Please
context example, with the exception that the answer is generate the background knowledge xxxxxx. [# In-
left blank for the LLM to predict. Additionally, we context examples] Context: the caption. Question: the
modify the prompt head to enable the model to answer question. Knowledge: the background knowledge.
questions based on the context and knowledge. [# Input] Context: the caption. Question: the question.
Notably, our approach differs from previous works as Knowledge: }.
the knowledge is generated by prompting another The prompt includes the prompt head H, the in-
LLM, rather than being derived from knowledge context examples E, and the test input T. The prompt
graphs. head is a task description that allows the LLM to
3.2 In-context example selection generate background knowledge for the corresponding
context and question. The in-context example is
Existing works[7, 34] have shown the importance of in-
composed of context-question-knowledge and can be
context example selection. We denote the image-
denoted as ( c , q , k ).
question-answer triplet as (v, q, a) , where v denotes the
image, q denotes the question, and a denotes the E = {(ci , qi , ki )|ni=1 } (5)
answer. The VQA dataset is denoted as D , and the The format of the test input resembles that of the in-
model learned from D is denoted as M . context example, with the knowledge left blank for the
We can obtain the fused feature F of the image and LLM to generate.
the question through the encoder of the model M :
P = LLM1 ( H, E, T) (6)
F = M(v, q) (2)
where P denotes the knowledge generated by the
For each test input, the cosine similarity of the fused LLM.
feature between the test input and each training sample To explain how LLM1 can be used to generate
is calculated. We then select the samples from the background knowledge, we show a prompt example in
training set whose fused features are closest to that of
Fig. 3. We pick out some in-context examples for
the test input to form the in-context examples ϵ :
LLM1 learning. For better visualization, we only show
ϵ = {(vi , qi , ai )|ni=1 } (3) one in-context example.
The image is converted to the context by the 3.4 Prompting large language model
captioning model, and the image-question-answer
At this stage, we input the prompt into the LLM to
corresponds to context-question-answer. Denote
predict the answer. The background knowledge
context-question-answer as ( c , q , a ), where c , q , and a
generated in the previous step is integrated into the
refer to context, question, and answer, respectively.
prompt, which comprises the prompt head, in-context
ϵ = {(ci , qi , ai )|ni=1 } (4) examples, and test input: {[# Prompt head] Please
answer the question xxxxxx. [# In-context examples]
3.3 Knowledge enhancement Context: ci . Question: qi . Knowledge: ki . Answer: ai .
Previous works[30] have demonstrated the benefits of [# Test input] Context: c . Question: q . Knowledge: k .
enhancing LLMs with knowledge graphs. Unlike Answer: }
previous works, we use another LLM to inject Prompt head H is the task description. Unlike PICa,
knowledge instead of knowledge graphs. This idea is we set the prompt head to allow the model to make
848 Big Data Mining and Analytics, September 2024, 7(3): 843−857

Please generate background knowledge in English based 4 Experiment


on the key words in the context and question.
===
4.1 Dataset
Context: A tennis player swinging a racket at a ball.
Question: What move is this tennis player currently We adopt two knowledge-based VQA datasets for
using? evaluation, namely OK-VQA[4] and A-OKVQA[5]. OK-
Knowledge: A tennis player is an individual who plays VQA contains approximately 14 000 images and
tennis, a popular global sport. They use a racket to hit 14 000 questions, while A-OKVQA contains 24 000
a ball across a net in an attempt to outmaneuver their
images and 25 000 questions. OK-VQA encourages
opponent. In tennis, swinging refers to the action of
moving the racket to hit the ball. The way a player swings
answering questions based on external knowledge,
the racket can greatly affect the trajectory, speed, and spin covering various knowledge categories such as sports,
of the ball. In tennis, the racket is the tool that players use history, science, and more. A-OKVQA is also a
to hit the ball. It consists of a handle and a circular frame knowledge-based VQA dataset that requires more
with tightly interwoven strings. Rackets come in various knowledge categories than OK-VQA. For A-OKVQA,
sizes and materials to fit the individual player’s style and
we adopt the direct answer on validation set for
level of play. In tennis, the ball is a hollow, spherical object
that players hit back and forth across a net. It is designed
evaluation.
to have specific bounce characteristics and is covered in For evaluation metrics, we employ common VQA
a fibrous felt to alter its aerodynamic properties. In the metrics. Each question is associated with ten annotated
context of tennis, a move typically refers to the type of shot answers, and a generated answer is considered 100%
or stroke a player uses. There are several different moves accurate if at least three human annotators provided
or strokes a player can use, such as a forehand, backhand, that (correct answer. The accuracy metric is defined as
serve, volley, overhead smash, drop shot, or lob. Each )
Number of humans that provided that answer
of these moves has different tactical uses in a match and min ,1 .
3
requires different body positions, racket angles, and swing
paths. 4.2 Baseline and implementation
===
4.2.1 Baseline
Context: A woman in a coat and boots stops to check her
smartphone.
We compare our approach to the following baselines:
Question: What brand of purse might she be carrying? • MUTAN[35] is a multimodal fusion realized by
Knowledge: bilinear interaction. It is for modeling interactions
between image and text.
Fig. 3 Prompt example. We show the prompt with one in- • Mucko[36] focuses on multi-layer cross-modal
context example to explain how to prompt LLM1 to generate knowledge inference. It represents the image as a
background knowledge. We collate several in-context
multimodal heterogeneous graph.
examples for model learning to inspire LLM1 to generate
background knowledge.
• ConceptBert[37] learns and captures image-
question-knowledge interactions from visual, language,
predictions based on context and background and knowledge graph embeddings.
knowledge, not just context. The in-context example E • KRISP[6] employs a multimodal Bidirectional
is also different from PICa. PICa is a triplet of context- Encoder Representations from Transformers (BERT) to
question-answer, while ours is a quadruple of context- process both the image and question, leveraging the
question-knowledge-answer. implicit knowledge in BERT.
• MAVEx[28] uses external knowledge for
E = {(ci , qi , ki , ai )|ni=1 } (7) multimodal answer validation. It validates the
promising answers according to answer-specific
The format of test input T is similar to that of the in-
knowledge retrieval.
context example, but the answer is left blank for the
• Visual-retriever-reader[38] is designed for
LLM to predict. The prompt ( H, E, T ) will be input knowledge-based VQA. The visual retriever initially
into the LLM. fetches relevant knowledge, and subsequently, the
visual reader predicts the answer based on the provided
P = LLM2 (H, E, T ) (8)
knowledge.
where P is the prediction made by the LLM. • TRiG[39] is a transform-retrieve-generate
Zhongjian Hu et al.: Prompting Large Language Models with Knowledge-Injection for Knowledge-Based Visual... 849

framework that can be used with image-to-text models Table 1 Results on OK-VQA.
and knowledge bases. Method Accuracy (%)
• UnifER[40] is a knowledge-based VQA framework MUTAN+AN (Ben-Younes et al.[35]) 27.8
based on the unified end-to-end retriever-reader. Mucko (Zhu et al.[36]) 29.2
• PICa[7] applies the in-context learning paradigm of ConceptBert (Gardères et al.[37]) 33.7
GPT-3 to knowledge-based VQA. It utilizes a KRISP (Marino et al.[6]) 38.9
captioning model to convert the image into text. MAVEx (Wu et al.[28]) 39.4
• Pythia[41] is a bottom-up top-down framework. It is Visual-retriever-reader (Luo et al.[38]) 39.2
improved through modifications to the model structure VLC-BERT (Ravi et al.[46]) 43.1
and the data augmentation. TRiG (Gao et al.[39]) 49.4
• ViLBERT[42] is a model for learning the joint UnifER (Guo et al.[40]) 42.1
PICa-Base (Yang et al.[7]) (Caption) 42.0
representation of image and text. It extends the BERT
PICa-Base (Yang et al.[7]) (Caption+Tags) 43.3
architecture to support multi-modality.
PICa-Full (Yang et al.[7]) (Caption) 46.9
• ClipCap[43] is a captioning method that employs
PICa-Full (Yang et al.[7]) (Caption+Tags) 48.0
pre-trained models for processing visual and text.
Prophet-LLaMA (Shao et al.[29]) 52.8
• LXMERT[44] constructs a large transformer model
Ours 54.1
comprising three encoders: object relationship encoder,
language encoder, and cross-modality encoder. Table 2 Results on A-OKVQA.
• GPV-2[45] is based on General Purpose Vision Method Accuracy (%)
(GPV) and is designed to address a wide range of Pythia (Jiang et al.[41]) 25.2
visual tasks without necessitating changes to the ClipCap (Mokady et al.[43]) 30.9
architecture. ViLBERT (Lu et al.[42]) 30.6
• VLC-BERT[46] is a model designed to integrate LXMERT (Tan and Bansal[44]) 30.7
common sense knowledge into the visual language
KRISP (Marino et al.[6]) 33.7
BERT.
GPV-2 (Kamath et al.[45]) 48.6
• Prophet[29] is proposed to inspire GPT-3 using a
Prophet-LLaMA (Shao et al.[29]) 51.2
vanilla VQA model. We replace GPT-3 with LLaMA
Ours 52.9
and keep the settings consistent for a fair comparison.
4.2.2 Implementation
On OK-VQA, our method outperforms other
For the captioning model, we follow PICa[7], which
baselines by more than 1.3. It is evident that baselines
uses OSCAR+[18]. For the in-context example
utilizing LLMs consistently achieve better results
selection, we follow the previous works[29] and use the
compared to those without LLMs. LLMs are trained on
MCAN-large[21] model pre-trained on VQAv2[47] and extensive corpora, acquiring rich knowledge.
visual genome[48]. We use LLaMA1[49] as the LLM for Therefore, baselines using LLMs tend to outperform
knowledge enhancement, because LLaMA1 is an methods that do not leverage LLMs. Our approach,
excellent and open LLM with powerful capability. We also grounded in LLMs, leverages the more powerful
use LLaMA2[50] as the LLM for inference. Compared inference ability of such models, thereby
to LLaMA1, LLaMA2 can support a longer context, outperforming the baselines that lack LLMs.
which is conducive to injecting knowledge into the Furthermore, our approach achieves better performance
context. We use the LLaMA 7B version. Considering compared to existing LLM-based baselines. By
the length limit of the context, we set the number of in- selecting more appropriate in-context examples and
context examples to 8 and set the length of the further enhancing the LLM through knowledge
knowledge to no more than 256. We use the default injection, our framework demonstrates its capability to
settings unless otherwise specified. achieve superior results. On A-OKVQA, our approach
consistently outperforms existing baselines, exhibiting
4.3 Experimental result
an accuracy improvement of more than 1.7 compared
We report the results on OK-VQA and A-OKVQA. to the existing baselines. These results once again
Tables 1 and 2 show the results. underscore the effectiveness of our method.
850 Big Data Mining and Analytics, September 2024, 7(3): 843−857

Notably, our framework relies entirely on open of in-context examples, model performance will be
LLMs, making it a cost-effective solution that is significantly degraded, which also shows that in-
accessible to most researchers. In contrast, utilizing context examples are crucial to the in-context learning
GPT-3 can be expensive and may result in significant paradigm of LLM.
costs. Our inference model employs the 7B version of 4.4.2 Different LLMs for knowledge enhancement
LLaMA, featuring just 7 billion parameters. Notably, Figure 4 shows the results obtained by employing
we achieve superior performance using only a single various LLMs for knowledge enhancement. Here, we
V100 32 G GPU for inference, a setup that is define LLM1 as the LLM responsible for generating
affordable for most researchers. background knowledge, and LLM2 as the LLM used
4.4 Ablation study for inference. The approach involves using LLM1 to
generate background knowledge, which is then injected
We report the results of ablation experiments
into the prompt to guide LLM2 during inference. Our
conducted on both OK-VQA and A-OKVQA. The
aim is to investigate the impact of utilizing different
ablation experiments focus on two key aspects: (1)
LLMs as LLM1 . Specifically, we employ LLaMA1-7B,
Knowledge enhancement under different shots: We
LLaMA1-13B, ChatGLM1, and ChatGLM2 as LLM1 ,
investigate the impact of knowledge enhancement
respectively, and present the corresponding
under different shots, exploring how our approach
experimental results.
performs with different numbers of in-context
Upon analyzing the results, it becomes apparent that
examples. (2) Using different LLMs for knowledge
more capable models tend to yield superior results.
injection: We examine the influence of employing
Specifically, LLaMA1-13B stands out as the top
different LLMs for knowledge injection, evaluating the
performer, attributed to its status as the largest model.
performance variations.
The extensive parameterization of LLaMA1-13B
4.4.1 Knowledge enhancement under different
57
shots
Table 3 shows the results of ablation experiments. ✘ 56
54.9 54.8
Knowledge denotes that no knowledge is incorporated 55
Accuracy (%)

54.1
to enhance the LLM2 . Ours (+ LLM1 ) means using 54
LLM1 for knowledge enhancement. Ours (+ KGs) 53 52.4
means using Knowledge Graphs (kGs)[51] for 52
knowledge enhancement. We perform ablation
51
experiments with and without in-context examples. 0-
50
shot indicates no in-context examples.
B

3B

2
LM

LM
-7

-1

Based on the results obtained from both datasets, it is


A1

G
A1

t
aM

ha

ha
aM

evident that performance will decline in the absence of


C

C
LL

LL

knowledge enhancement. We find that knowledge (a) OK-VQA


enhancement using LLM1 outperforms KGs both with 54.0
and without in-context examples, confirming our view 53.5
53.1
52.9
that the implicit knowledge of LLM is more suitable 53.0 52.7
Accuracy (%)

for open-domain questions. In addition, in the absence 52.5


51.9
52.0
Table 3 Ablation study.
51.5
Accuracy (%)
Shot Method 51.0
OK-VQA A-OKVQA Average
50.5
Ours (+ LLM1 ) 54.1 52.9 53.5
50.0
8-shot Ours (+ KGs) 53.6 52.1 52.9
B

3B

2
LM

LM
-7

-1

✘ Knowledge 52.9 51.5 52.2


A1

G
A1

t
aM

ha

ha
M

Ours (+ LLM1 ) 28.4 24.8 26.6


C

C
LL

a
LL

0-shot Ours (+ KGs) 27.9 24.5 26.2 (b) A-OKVQA


✘ Knowledge 27.2 24.2 25.7
Fig. 4 Different LLMs for knowledge enhancement.
Zhongjian Hu et al.: Prompting Large Language Models with Knowledge-Injection for Knowledge-Based Visual... 851

results in a richer hidden knowledge compared to other 55

models, contributing to its superior outcomes. 50


ChatGLM1, being trained in both Chinese and English,

Accuracy (%)
45
exhibits relatively lower performance due to its English
knowledge not matching up to that of LLaMA. On the 40

other hand, ChatGLM2, as the second-generation 35


version of ChatGLM1, outperforms ChatGLM1, Kg-enhanced
30 No Kg-enhanced
showcasing the positive impact of model
advancements. 0 1 2 3 4 5 6 7 8
Number of in-context examples
4.5 Parameter sensitivity study (a) OK-VQA
We also investigate the impact of the number of in-
context examples on the model performance. We 50
analyze how performance varies with the number of in- 45

Accuracy (%)
context examples on both OK-VQA and A-OKVQA 40
datasets. Kg-enhanced denotes the utilization of
35
knowledge enhancement, while No Kg-enhanced
signifies the absence of knowledge enhancement. 30 Kg-enhanced
No Kg-enhanced
Figure 5 illustrates the variation in model 25
performance with the number of in-context examples. 0 1 2 3 4 5 6 7 8
Generally, the model performance exhibits an upward Number of in-context examples
trend as the number of in-context examples increases. (b) A-OKVQA
Specifically, in a 0-shot scenario (when the number of Fig. 5 Performance when varying the number of in-context
in-context examples is 0), the model performance is examples.
poor without in-context examples. However, a
noticeable improvement is observed when the number correct answer is hit in the background knowledge, the
of in-context examples is increased to 1, indicating the model can still predict failure. This also shows that
significance of in-context examples in model learning. background knowledge does not always help models
As the number of in-context examples continues to make 100% correct predictions.
increase, the model performance gradually reaches a Figure 7 depicts the impact of in-context examples
saturation point, suggesting that constantly adding in- on inference. The correct answer hit in the in-context
context examples does not always lead to a example is highlighted in red font, while the correct
proportional improvement in model performance. answer itself is denoted in green font. It is evident that
4.6 Case study the correct answer is present in the in-context
examples, underscoring the increased likelihood of
We select specific cases to examine the influence of predicting the correct answer when leveraging in-
background knowledge and in-context examples on context examples.
model inference. The findings indicate that both To explain the construction details of our prompt, we
background knowledge and in-context examples show a concrete example in Fig. 8. For better
contribute positively to model inference. visualization, we show the prompt with four in-context
Figure 6 illustrates the impact of background examples.
knowledge on inference. The correct answer contained
in the background knowledge is denoted in blue font, 4.7 Inference time
while the correct answer itself is highlighted in green We record the inference time for different context
font. It is evident that the correct answer is present in lengths on OK-VQA. We adopt the LLaMA2-7B
the background knowledge, indicating that the model is version as inference model and use a Tesla V100 GPU.
more likely to make accurate predictions when The statistics for inference time are presented in
leveraging background knowledge. We also present a Fig. 9. The X-axis represents time in minutes, while the
failure case. In the last case, we can see that even if the Y-axis corresponds to different shots, indicating the
852 Big Data Mining and Analytics, September 2024, 7(3): 843−857

Context: Two men are standing on top Context: A large display of apples
Context: A black and white cat
of a zebra. in a grocery store.
laying on top of a pile of clothes.
Question: Where would you find this Question: What its this a fruit or
Question: Is that a panda or a
animal in the wild? vegetable?
cat?
Knowledge: In the context given, a Knowledge: In the context given,
Knowledge: In this context,
zebra is a type of animal. It is a large, fruit refers to a type of plant that
black and white refers to the
striped mammal that lives in Africa. produces seeds and fleshy pulp. xxx
color of the cat xxx
Zebras are xxx Answer: fruits
Answer: cat
Answer: Africa

(a) (b) (c)

Fig. 6 Impact of background knowledge on inference. Figures 6a and 6b are successful cases and Fig. 6c is a failed case. In
Figs. 6a and 6b, the correct answer is hit in the background knowledge, which helps the model predict the correct answer. In
Fig. 6c, the correct answer is hit in background knowledge, but the model makes the failed prediction.

Example 3
Example 1
Context: A group of business
Context: A man in a suit sits
men in suits and ties talking to
inside of a car.
kids.
Question: What type of outfit
Question: What world
is this man wearing?
describes the men’s attire?
Knowledge: xxx
Knowledge: xxx
Answer: suit
Answer: dress

Example 2 Example 4
Context: A smiling business Context: Two men in suits
woman is standing next to a standing next to a women.
Context: A group of men in suits standing in business man. Question: What type of outfit
front of a robot. Question: What type of attire is the woman wearing?
Question: What type of clothing are the men is this? Knowledge: xxx
wearing? Knowledge: xxx Answer: suit
Knowledge: xxx Answer: business
Answer: suit

(a)

Example 1 Example 3
Context: Freight train going Context: Vehicles drive past
over a bridge on a river. an obelisk topped by a clock.
Question: What is this Question: What is the clock
structure made of? tower made out of?
Knowledge: xxx Knowledge: xxx
Answer: steel Answer: brick

Example 2 Example 4
Context: Three images of the Context: An intersection
process of a bathroom remodel. containing a large brick
Question: What material is the building, traffic lights, and a
Context: A fire place sitting in a living
bathtub made out of? street sign.
room next to a window.
Knowledge: xxx Question: What are the
Question: What is this fireplace made
Answer: ceramic buildings made of?
of?
Knowledge: xxx
Knowledge: xxx
Answer: brick
Answer: brick

(b)

Fig. 7 Impact of in-content examples on inference. Because in-context examples play a key role in the in-context learning
paradigm of LLMs, we show cases where in-context examples help the model predict. The red font indicates that the correct
answer is hit in the in-context examples.
Zhongjian Hu et al.: Prompting Large Language Models with Knowledge-Injection for Knowledge-Based Visual... 853

Please answer the question according to the context and knowledge.


======
Context: A yellow diamond shaped road sign on the right side of a road.
Question: What does the sign indicate?
Knowledge: A road sign is a sign that is placed on a road to provide information to drivers.
They are often used to indicate speed limits, road conditions, and other information. xxxxxx
Answer: roundabout
===
Context: A black and white shot of two people on their motorcycle.
Question: Is which type of road are the people riding?
Knowledge: In this context, people are the two people riding the motorcycle. A motorcycle
is a two-wheeled vehicle that is powered by anengine. It is a popular form of transportation
in many countries. In this context, the term “road” refers to a paved surface that is used for
driving. xxxxxx
Answer: highway
===
Context: A car is parked a the end of a wooded street.
Question: What traffic sign is backwards?
Knowledge: A car is a type of vehicle. It is a four-wheeled motor vehicle that is powered by an
internal combustion engine. Cars are typically used for transportation, but can also be used for
recreation or racing. A traffic sign is a sign that is used to communicate information to drivers.
xxxxxx
Answer: yield
===
Context: Many different cars driving down a city road.
Question: What kind of road is this?
Knowledge: Cars are vehicles that are driven on roads. They are a common form of
transportation in urban areas. A road is a path or route that is used for traveling. It can be paved
or unpaved, and can be used for a variety of purposes, including transportation, recreation, or
agriculture. xxxxxx
Answer: highway
======
Context: A street sign is shown with a tree in the background.
Question: Where does the word shown here come from?
Knowledge: A street sign is a sign that provides information about a street or road. It can be
used to provide directions, identify the name of the street, and more. A tree is a woody plant
that grows in the ground. Trees can be found in many different environments, including forests,
parks, and backyards. xxxxxx
Answer:

Fig. 8 Prompt consisting of three parts: (1) The prompt head is used to describe the task; (2) Some in-context examples are
used for LLM learning; (3) Test input is in the same format as the in-context example, but the answer is left blank for LLM
prediction. We show the prompt with four in-context examples for better visualization.

number of in-context examples. It is evident that as the dataset. To facilitate evaluation, we select one-tenth of
number of in-context examples increases, the inference the test set as the evaluation set to report the results.
time also extends. This correlation is attributed to the We compare with the best baseline model, Prophet, and
augmented length of the context, leading to longer show comparisons under different in-context example
processing time for the model.
settings. Figure 10 illustrates the results. We find that
4.8 Evaluation on additional datasets our method outperforms the best baseline model under
To further validate the robustness of our method, we different parameter settings. The results again
conduct additional evaluations on the DAQUAR demonstrate the robustness of our method.
854 Big Data Mining and Analytics, September 2024, 7(3): 843−857

8-shot
variety of tools for knowledge enhancement, including
retrieval from knowledge graphs and knowledge bases,
4-shot generation using LLMs, and so on. The agent
technology is then utilised to automate the planning in
1-shot order to inject knowledge in a flexible manner.
0-shot Acknowledgment
0 10 20 30 40 50
This work was supported by the National Natural Science
Time (min)
Foundation of China (No. 62272100), Consulting Project
Fig. 9 Inference time required for different numbers of in- of Chinese Academy of Engineering (No. 2023-XY-09),
context examples. and Fundamental Research Funds for the Central
40
Universities.
Ours References
35 Baseline
Accuracy (%)

[1] Q. Wu, P. Wang, X. Wang, X. He, and W. Zhu,


30
Knowledge-based VQA, in Visual Question Answering, Q.
Wu, P. Wang, X. Wang, X. He, and W. Zhu, eds.
25
Singapore: Springer, 2022, pp. 73–90.
[2] S. Manmadhan and B. C. Kovoor, Visual question
20
answering: a state-of-the-art review, Artif. Intell. Rev., vol.
53, no. 8, pp. 5705–5745, 2020.
0 1 2 3 4 5 6 7 8
[3] P. Wang, Q. Wu, C. Shen, A. Dick, and A. van den
Number of in-context example settings
Hengel, FVQA: Fact-based visual question answering,
Fig. 10 Comparison on DAQUAR dataset. On the IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 10,
DAQUAR dataset, we compare against the best baseline pp. 2413–2427, 2018.
model under different numbers of in-context example [4] K. Marino, M. Rastegari, A. Farhadi, and R. Mottaghi,
settings. OK-VQA: A visual question answering benchmark
requiring external knowledge, in Proc. IEEE/CVF Conf.
5 Conclusion Computer Vision and Pattern Recognition (CVPR), Long
Beach, CA, USA, 2019, pp. 3190–3199.
We introduce PLLMKI, a knowledge-based VQA [5] D. Schwenk, A. Khandelwal, C. Clark, K. Marino, and R.
framework utilizing the knowledge-enhanced LLM. Mottaghi, A-OKVQA: A benchmark for visual question
Unlike previous works, PLLMKI leverages the LLM to answering using world knowledge, in European
generate background knowledge, integrating it into Conference on Computer Vision, S. Avidan, G. Brostow,
another LLM for inference, as opposed to using M. Cissé, G. M. Farinella, and T. Hassner, eds. Cham,
Switzerland: Springer, 2022, pp. 146–162.
knowledge graphs to enhance the LLM. Our [6] K. Marino, X. Chen, D. Parikh, A. Gupta, and M.
framework consists of three components: (1) A vanilla Rohrbach, KRISP: Integrating implicit and symbolic
VQA model is used to get in-context examples, and a knowledge for open-domain knowledge-based VQA, in
captioning model is adopted to convert image- Proc. IEEE/CVF Conf. Computer Vision and Pattern
question-answer to context-question-answer; (2) LLM1 Recognition (CVPR), Nashville, TN, USA, 2021, pp.
is used to generate background knowledge, and then 14106–14116.
[7] Z. Yang, Z. Gan, J. Wang, X. Hu, Y. Lu, Z. Liu, and L.
the knowledge is integrated into the prompt, i.e., Wang, An empirical study of GPT-3 for few-shot
context-question-knowledge-answer; (3) The prompt is knowledge-based VQA, Proc. AAAI Conf. Artif. Intell.,
input into LLM2 to predict the answer. It is noteworthy vol. 36, no. 3, pp. 3081–3089, 2022.
that our framework relies solely on open LLMs, [8] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan,
incurring no additional costs. Experiments demonstrate P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A.
Askell, et al., Language models are few-shot learners,
the superior performance of our approach on two
arXiv preprint arXiv: 2005.14165, 2020.
knowledge-based VQA datasets, OK-VQA and A- [9] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N.
OKVQA. As a next step, we consider applying the Goyal, H. Küttler, M. Lewis, W. Yih, T. Rocktäschel, et
agent concept to knowledge-based VQA by designing a al., Retrieval-augmented generation for knowledge-
Zhongjian Hu et al.: Prompting Large Language Models with Knowledge-Injection for Knowledge-Based Visual... 855

intensive NLP tasks, in Advances in Neural Information Proc. IEEE/CVF Conf. Computer Vision and Pattern
Processing Systems, H. Larochelle, M. Ranzato, R. Recognition (CVPR), Long Beach, CA, USA, 2019, pp.
Hadsell, M. F. Balcan, and H. Lin, eds. New York, NY, 6274–6283.
USA: Curran Associates, Inc., 2020, pp. 9459–9474. [22] Y. Cui, Z. Yu, C. Wang, Z. Zhao, J. Zhang, M. Wang, and
[10] D. Hong, B. Zhang, X. Li, Y. Li, C. Li, J. Yao, N. Yokoya, J. Yu, ROSITA: Enhancing vision-and-language semantic
H. Li, P. Ghamisi, X. Jia, et al., SpectralGPT: Spectral alignments via cross- and intra-modal knowledge
remote sensing foundation model, IEEE Trans. Pattern integration, in Proc. 29th ACM Int. Conf. Multimedia,
Anal. Mach. Intell., doi: 10.1109/TPAMI.2024.3362475. Virtual Event, 2021, pp. 797–806.
[11] D. Hong, B. Zhang, H. Li, Y. Li, J. Yao, C. Li, M. Werner, [23] J. Li, D. Li, C. Xiong, and S. Hoi, BLIP: Bootstrapping
J. Chanussot, A. Zipf, and X. X. Zhu, Cross-city matters: language-image pre-training for unified vision-language
A multimodal remote sensing benchmark dataset for cross- understanding and generation, arXiv preprint arXiv:
city semantic segmentation using high-resolution domain 2201.12086, 2022.
adaptation networks, Remote. Sens. Environ., vol. 299, p. [24] M. Zhou, L. Yu, A. Singh, M. Wang, Z. Yu, and N.
113856, 2023. Zhang, Unsupervised vision-and-language pre-training via
[12] D. Hong, N. Yokoya, J. Chanussot, and X. X. Zhu, An retrieval-based multi-granular alignment, in Proc.
augmented linear mixing model to address spectral IEEE/CVF Conf. Computer Vision and Pattern
variability for hyperspectral unmixing, IEEE Trans. Image Recognition (CVPR), New Orleans, LA, USA, 2022, pp.
Process., vol. 28, no. 4, pp. 1923–1938, 2019. 16464–16473.
[13] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. [25] M. Malinowski, M. Rohrbach, and M. Fritz, Ask your
Zitnick, and D. Parikh, VQA: Visual question answering, neurons: A neural-based approach to answering questions
in Proc. IEEE Int. Conf. Computer Vision (ICCV), about images, in Proc. IEEE Int. Conf. Computer Vision
Santiago, Chile, 2015, pp. 2425–2433. (ICCV), Santiago, Chile, 2015, pp. 1–9.
[14] Y. Srivastava, V. Murali, S. R. Dubey, and S. Mukherjee, [26] Z. Yu, J. Yu, J. Fan, and D. Tao, Multi-modal factorized
Visual question answering using deep learning: A survey bilinear pooling with co-attention learning for visual
and performance analysis, in Computer Vision and Image question answering, in Proc. IEEE Int. Conf. Computer
Processing, S. K. Singh, P. Roy, B. Raman, and P. Vision (ICCV), Venice, Italy, 2017, pp. 1839–1848.
Nagabhushan, eds. Singapore: Springer, 2021, pp. 75–86. [27] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, Neural
[15] P. Sun, W. Zhang, S. Li, Y. Guo, C. Song, and X. Li, module networks, in Proc. IEEE Conf. Computer Vision
Learnable depth-sensitive attention for deep RGB-D and Pattern Recognition (CVPR), Las Vegas, NV, USA,
saliency detection with multi-modal fusion architecture 2016, pp. 39–48.
search, Int. J. Comput. Vis., vol. 130, no. 11, pp. [28] J. Wu, J. Lu, A. Sabharwal, and R. Mottaghi, Multi-modal
2822–2841, 2022. answer validation for knowledge-based VQA, Proc. AAAI
[16] Y. Wang, Q. Mao, H. Zhu, J. Deng, Y. Zhang, J. Ji, H. Li, Conf. Artif. Intell., vol. 36, no. 3, pp. 2712–2721, 2022.
and Y. Zhang, Multi-modal 3D object detection in [29] Z. Shao, Z. Yu, M. Wang, and J. Yu, Prompting large
autonomous driving: A survey, Int. J. Comput. Vis., vol. language models with answer heuristics for knowledge-
131, no. 8, pp. 2122–2152, 2023. based visual question answering, in Proc. IEEE/CVF
[17] H. Jiang, I. Misra, M. Rohrbach, E. Learned-Miller, and Conf. Computer Vision and Pattern Recognition (CVPR),
X. Chen, In defense of grid features for visual question Vancouver, Canada, 2023, pp. 14974–14983.
answering, in Proc. IEEE/CVF Conf. Computer Vision [30] S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, and X. Wu,
and Pattern Recognition (CVPR), Seattle, WA, USA, Unifying large language models and knowledge graphs: A
2020, pp. 10264–10273. roadmap, IEEE Trans. Knowl. Data Eng., doi:
[18] P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, Y. 10.1109/TKDE.2024.3352100.
Choi, and J. Gao, VinVL: Revisiting visual representations [31] X. Zou, A survey on application of knowledge graph, J.
in vision-language models, in Proc. IEEE/CVF Conf. Phys.: Conf. Ser., vol. 1487, no. 1, p. 012016, 2020.
Computer Vision and Pattern Recognition (CVPR), [32] Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, and Q. Liu,
Nashville, TN, USA, 2021, pp. 5575–5584. ERNIE: Enhanced language representation with
[19] L. Li, Z. Gan, Y. Cheng, and J. Liu, Relation-aware graph informative entities, in Proc. 57th Annual Meeting of the
attention network for visual question answering, in Proc. Association for Computational Linguistics, Florence, Italy,
IEEE/CVF Int. Conf. Computer Vision (ICCV), Seoul, 2019, pp. 1441–1451.
Republic of Korea, 2019, pp. 10312–10321. [33] Y. Sun, S. Wang, S. Feng, S. Ding, C. Pang, J. Shang, J.
[20] Z. Yu, Y. Cui, J. Yu, M. Wang, D. Tao, and Q. Tian, Deep Liu, X. Chen, Y. Zhao, Y. Lu, et al., ERNIE 3.0: Large-
multimodal neural architecture search, in Proc. 28th ACM scale knowledge enhanced pre-training for language
Int. Conf. Multimedia, Seattle, WA, USA, 2020, pp. understanding and generation, arXiv preprint arXiv:
3743–3752. 2107.02137, 2021.
[21] Z. Yu, J. Yu, Y. Cui, D. Tao, and Q. Tian, Deep modular [34] J. Liu, D. Shen, Y. Zhang, B. Dolan, L. Carin, and W.
co-attention networks for visual question answering, in Chen, What makes good in-context examples for GPT-3?
856 Big Data Mining and Analytics, September 2024, 7(3): 843−857

in Proc. Deep Learning Inside Out (DeeLIO 2022): The [43] R. Mokady, A. Hertz, and A. H. Bermano, Clipcap: Clip
3rd Workshop on Knowledge Extraction and Integration prefix for image captioning, arXiv preprint arXiv:
for Deep Learning Architectures, Dublin, Ireland, 2022, 2111.09734, 2021.
pp. 100–114. [44] H. Tan and M. Bansal, LXMERT: Learning cross-
[35] H. Ben-Younes, R. Cadene, M. Cord, and N. Thome, modality encoder representations from transformers, in
MUTAN: Multimodal tucker fusion for visual question Proc. 2019 Conf. Empirical Methods in Natural Language
answering, in Proc. IEEE Int. Conf. Computer Vision Processing and the 9th Int. Joint Conf. Natural Language
(ICCV), Venice, Italy, 2017, pp. 2631–2639. Processing (EMNLP-IJCNLP), Hong Kong, China, 2019,
[36] Z. Zhu, J. Yu, Y. Wang, Y. Sun, Y. Hu, and Q. Wu, pp. 5100–5111.
Mucko: Multi-layer cross-modal knowledge reasoning for [45] A. Kamath, C. Clark, T. Gupta, E. Kolve, D. Hoiem, and
fact-based visual question answering, in Proc. 29th Int. A. Kembhavi, Webly supervised concept expansion for
Joint Conf. Artificial Intelligence, Yokohama, Japan, general purpose vision models, in Computer
2020, pp. 1097–1103. Vision—ECCV 2022, S. Avidan, G. Brostow, M. Cissé, G.
[37] F. Gardères, M. Ziaeefard, B. Abeloos, and F. Lecue, M. Farinella, and T. Hassner, eds. Cham, Switzerland:
ConceptBert: Concept-aware representation for visual Springer, 2022, pp. 662–681.
question answering, in Proc. Findings of the Association [46] S. Ravi, A. Chinchure, L. Sigal, R. Liao, and V. Shwartz,
for Computational Linguistics: EMNLP 2020, Virtual VLC-BERT: Visual question answering with
Event, 2020, pp. 489–498. contextualized commonsense knowledge, in Proc.
[38] M. Luo, Y. Zeng, P. Banerjee, and C. Baral, Weakly- IEEE/CVF Winter Conf. Applications of Computer Vision
supervised visual-retriever-reader for knowledge-based (WACV), Waikoloa, HI, USA, 2023, pp. 1155–1165.
question answering, in Proc. 2021 Conf. Empirical [47] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D.
Methods in Natural Language Processing, Punta Cana, Parikh, Making the V in VQA matter: Elevating the role of
Dominican Republic, 2021, pp. 6417–6431. image understanding in visual question answering, in
[39] F. Gao, Q. Ping, G. Thattai, A. Reganti, Y. N. Wu, and P. Proc. IEEE Conf. Computer Vision and Pattern
Natarajan, Transform-retrieve-generate: Natural language- Recognition (CVPR), Honolulu, HI, USA, 2017, pp.
centric outside-knowledge visual question answering, in 6325–6334.
Proc. IEEE/CVF Conf. Computer Vision and Pattern [48] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J.
Recognition (CVPR), New Orleans, LA, USA, 2022, pp. Kravitz, S. Chen, Y. Kalantidis, L. J. Li, D. A. Shamma, et
5057–5067. al., Visual genome: Connecting language and vision using
[40] Y. Guo, L. Nie, Y. Wong, Y. Liu, Z. Cheng, and M. crowdsourced dense image annotations, International
Kankanhalli, A unified end-to-end retriever-reader Journal of Computer Vision, vol. 123, pp. 32–73, 2017.
framework for knowledge-based VQA, in Proc. 30th ACM [49] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. A.
Int. Conf. Multimedia, Lisboa, Portugal, 2022, pp. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F.
2061–2069. Azhar, et al., Llama: Open and efficient foundation
[41] Y. Jiang, V. Natarajan, X. Chen, M. Rohrbach, D. Batra, language models, arXiv preprint arXiv: 2302.13971, 2023.
and D. Parikh, Pythia v0.1: The winning entry to the VQA [50] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi,
challenge 2018, arXiv preprint arXiv: 1807.09956, 2018. Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S.
[42] J. Lu, D. Batra, D. Parikh, and S. Lee, Vilbert: Pretraining Bhosale, et al., Llama 2: Open foundation and fine-tuned
task-agnostic visiolinguistic representations for vision- chat models, arXiv preprint arXiv: 2307.09288, 2023.
and-language tasks, in Advances in Neural Information [51] F. Ilievski, P. Szekely, and B. Zhang, CSKG: The
Processing Systems, H. Wallach, H. Larochelle, A. commonsense knowledge graph, in The Semantic Web, R.
Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett, eds. Verborgh, K. Hose, H. Paulheim, P. Champin, M.
New York, NY, USA: Curran Associates, Inc., 2019, pp. Maleshkova, O. Corcho, P. Ristoski, and M. Alam, eds.
13–23. Cham, Switzerland, Springer, 2021, pp. 680–696.

Zhongjian Hu is currently pursuing the Peng Yang is a professor at the School of


PhD degree at the School of Computer Computer Science and Engineering,
Science and Engineering, Southeast Southeast University, Nanjing, China. His
University, Nanjing, China. His research research interests include artificial
interests include artificial intelligence, intelligence, natural language processing,
natural language processing, etc. big data, etc.
Zhongjian Hu et al.: Prompting Large Language Models with Knowledge-Injection for Knowledge-Based Visual... 857

Fengyuan Liu is currently pursuing the Yuan Meng is currently pursuing the PhD
master degree at the Southeast University - degree at the School of Computer Science
Monash University Joint Graduate School and Engineering, Southeast University,
(Suzhou), Southeast University, China. His Nanjing, China. Her research interests
research interests include artificial include knowledge graphs, natural
intelligence, large language models, etc. language processing, etc.

Xingyu Liu is currently pursuing the


master degree at the Southeast University -
Monash University Joint Graduate School
(Suzhou), Southeast University, China.
Her research interests include AI agents,
large language models, etc.

You might also like