Prompting_Large_Language_Models_with_Knowledge-Injection_for_Knowledge-Based_Visual_Question_Answering
Prompting_Large_Language_Models_with_Knowledge-Injection_for_Knowledge-Based_Visual_Question_Answering
Zhongjian Hu, Peng Yang*, Fengyuan Liu, Yuan Meng, and Xingyu Liu
Abstract: Previous works employ the Large Language Model (LLM) like GPT-3 for knowledge-based Visual
Question Answering (VQA). We argue that the inferential capacity of LLM can be enhanced through knowledge
injection. Although methods that utilize knowledge graphs to enhance LLM have been explored in various
tasks, they may have some limitations, such as the possibility of not being able to retrieve the required
knowledge. In this paper, we introduce a novel framework for knowledge-based VQA titled “Prompting Large
Language Models with Knowledge-Injection” (PLLMKI). We use vanilla VQA model to inspire the LLM and
further enhance the LLM with knowledge injection. Unlike earlier approaches, we adopt the LLM for knowledge
enhancement instead of relying on knowledge graphs. Furthermore, we leverage open LLMs, incurring no
additional costs. In comparison to existing baselines, our approach exhibits the accuracy improvement of over
1.3 and 1.7 on two knowledge-based VQA datasets, namely OK-VQA and A-OKVQA, respectively.
Key words: visual question answering; knowledge-based visual question answering; large language model;
knowledge injection
© The author(s) 2024. The articles published in this open access journal are distributed under the terms of the
Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).
844 Big Data Mining and Analytics, September 2024, 7(3): 843−857
better model architectures[19–21], and better learning applies in-context learning of GPT-3 to knowledge-
paradigms[22–24]. According to the research methods based VQA, achieving encouraging results. PICa
can be divided into: joint embedding methods, utilizes a captioning model to convert the image into
attention methods, modular methods, external corresponding caption, which can be processed by
knowledge-based methods, and so on. Most of the joint LLM. In-context learning, a powerful few-shot
embedding methods use Convolutional Neural learning technique, enables reasoning with a few task
Network (CNN) network to extract visual features, examples assembled as the prompt, eliminating the
Recurrent Neural Network (RNN) network to extract need for parameter updates. Prophet[29] adopts a vanilla
text features, and simply combine the two features. The VQA model to inspire LLM, further activating the
Neural-Image-QA model proposed by Malinowski et capability of LLM.
al.[25] is the first to leverage the joint embedding Knowledge-enhanced LLMs. LLMs have
method. The model is based on CNN and Long Short- demonstrated promising results across various tasks.
Term Memory (LSTM), treating the VQA task as a Researchers explore the use of knowledge graphs to
sequence-to-sequence task assisted by image enhance LLMs[30]. Knowledge graphs[31] offer a means
information. Nevertheless, the majority of joint to enhance LLMs by incorporating knowledge during
embedding methods commonly utilize all the features pre-training, a process that extends to the inference
extracted from both images and questions as the input stage as well.
for the VQA model. This approach may introduce a When integrating knowledge graphs into training
considerable amount of noise, potentially affecting the
objectives, ERNIE[32] adopts a method where both
performance. The objective of the attention method is
sentences and corresponding entities are input into
to concentrate the limited attention on crucial elements,
LLMs. The training process involves instructing the
significantly enhancing the comprehension capability
LLMs to predict alignment links. On the other hand,
of neural network. Yu et al.[26] introduced a multi-
ERNIE 3.0[33] represents a knowledge graph triple as
modal factorized bilinear pooling approach, where text
tokens, concatenating them with sentences. RAG[9]
attention is inferred based on the question, and visual
employs a distinctive approach by initially searching
attention is inferred by the involvement of text
and retrieving relevant documents from knowledge
attention. However, the VQA task is compositional.
graphs. These documents are then provided to the
For example, in a question like “What’s on the table?”,
language model as additional context information.
it is necessary to first determine the position of the
table, then identify the location above the table, and Despite the benefits of knowledge graphs in enhancing
finally ascertain the target object above the table. LLMs, they may face challenges in retrieving the
Hence, some studies have proposed the modular required knowledge. In this paper, a novel idea is
networks for VQA task. The modular approach proposed, suggesting the utilization of one LLM to
involves designing distinct modules for various enhance knowledge for another LLM, as an alternative
functions and connecting these modules based on to traditional knowledge graphs.
different questions. Andreas et al.[27] first applied 3 Methodology
neural modular networks to VQA. Additionally, there
exists a category of VQA that necessitates external The Prompting Large Language Models with
knowledge, often referred to as knowledge-based Knowledge-Injection (PLLMKI) framework is
VQA. illustrated in Fig. 2. The framework comprises three
Knowledge-based VQA. Some benchmarks for main components: (1) Utilizing a vanilla VQA model
knowledge-based VQA have been proposed, to obtain in-context examples, which are then
necessitating external knowledge to answer questions. processed by a captioning model to transform the
Early works retrieve knowledge from external image-question-answer into context-question-answer.
knowledge resources. More recently, Marino et al.[6] (2) Employing the LLM1 to generate background
proposed KRISP to retrieve implicit knowledge stored knowledge and integrating the knowledge into the
in pre-trained language models. MAVEx[28] proposes a prompt, resulting in context-question-knowledge-
validation method aimed at improving the utilization of answer. (3) Inputting the modified prompt into the
noisy knowledge. Yang et al.[7] proposed PICa, which LLM2 to predict the answer.
846 Big Data Mining and Analytics, September 2024, 7(3): 843−857
Example
Test input
Prompt for knowledge generation Q: What continent are Q: These long necked creatures Q: Where in Africa is this
photo taken? Q: What part of Africa
these animals native to? live in what environment?
A: Africa A: Africa A: Savannah do these animals live?
Please generate background knowledge
in English based on the context and
question.
Image to text
Context: A tennis player swinging a
racket at a ball.
Question: What move is this tennis
player currently using? C: Giraffes eating from C: Giraffes are herded by a C: A zebra standing in a open
C: Two giraffes are eating from a
Knowledge: A tennis player is an feeders on trees, at a zoo. group of people on horses. field with bushes.
individual who plays tennis, a popular Q: What continent are Q: These long necked creatures … Q: Where in Africa is this photo
tree branch.
Q: What part of africa do these
global sport. xxxxxx these animals native to? live in what environment? taken?
animals live?
…
…
A: Africa A: Africa A: Savannah
Background knowledge Knowledge: xxxxxx Knowledge: xxxxxx … Knowledge: xxxxxx Knowledge: xxxxxx
Context: Giraffes eating from feeders on trees, at a zoo. Context: A zebra standing in a open field with bushes. Context: Two giraffes are eating from a tree branch.
Please answer the question according to the context
and knowledge. The knowledge is the background Question: What continent are these animals native to? … Question: Where in Africa is this photo taken?
Knowledge: xxxxxx
Question: What part of Africa do these animals live?
Knowledge: xxxxxx
knowledge for the context and question. Knowledge: xxxxxx
Answer: Africa Answer: Savannah Answer:
Input prompt
LLM2 (Inference)
Prediction
framework that can be used with image-to-text models Table 1 Results on OK-VQA.
and knowledge bases. Method Accuracy (%)
• UnifER[40] is a knowledge-based VQA framework MUTAN+AN (Ben-Younes et al.[35]) 27.8
based on the unified end-to-end retriever-reader. Mucko (Zhu et al.[36]) 29.2
• PICa[7] applies the in-context learning paradigm of ConceptBert (Gardères et al.[37]) 33.7
GPT-3 to knowledge-based VQA. It utilizes a KRISP (Marino et al.[6]) 38.9
captioning model to convert the image into text. MAVEx (Wu et al.[28]) 39.4
• Pythia[41] is a bottom-up top-down framework. It is Visual-retriever-reader (Luo et al.[38]) 39.2
improved through modifications to the model structure VLC-BERT (Ravi et al.[46]) 43.1
and the data augmentation. TRiG (Gao et al.[39]) 49.4
• ViLBERT[42] is a model for learning the joint UnifER (Guo et al.[40]) 42.1
PICa-Base (Yang et al.[7]) (Caption) 42.0
representation of image and text. It extends the BERT
PICa-Base (Yang et al.[7]) (Caption+Tags) 43.3
architecture to support multi-modality.
PICa-Full (Yang et al.[7]) (Caption) 46.9
• ClipCap[43] is a captioning method that employs
PICa-Full (Yang et al.[7]) (Caption+Tags) 48.0
pre-trained models for processing visual and text.
Prophet-LLaMA (Shao et al.[29]) 52.8
• LXMERT[44] constructs a large transformer model
Ours 54.1
comprising three encoders: object relationship encoder,
language encoder, and cross-modality encoder. Table 2 Results on A-OKVQA.
• GPV-2[45] is based on General Purpose Vision Method Accuracy (%)
(GPV) and is designed to address a wide range of Pythia (Jiang et al.[41]) 25.2
visual tasks without necessitating changes to the ClipCap (Mokady et al.[43]) 30.9
architecture. ViLBERT (Lu et al.[42]) 30.6
• VLC-BERT[46] is a model designed to integrate LXMERT (Tan and Bansal[44]) 30.7
common sense knowledge into the visual language
KRISP (Marino et al.[6]) 33.7
BERT.
GPV-2 (Kamath et al.[45]) 48.6
• Prophet[29] is proposed to inspire GPT-3 using a
Prophet-LLaMA (Shao et al.[29]) 51.2
vanilla VQA model. We replace GPT-3 with LLaMA
Ours 52.9
and keep the settings consistent for a fair comparison.
4.2.2 Implementation
On OK-VQA, our method outperforms other
For the captioning model, we follow PICa[7], which
baselines by more than 1.3. It is evident that baselines
uses OSCAR+[18]. For the in-context example
utilizing LLMs consistently achieve better results
selection, we follow the previous works[29] and use the
compared to those without LLMs. LLMs are trained on
MCAN-large[21] model pre-trained on VQAv2[47] and extensive corpora, acquiring rich knowledge.
visual genome[48]. We use LLaMA1[49] as the LLM for Therefore, baselines using LLMs tend to outperform
knowledge enhancement, because LLaMA1 is an methods that do not leverage LLMs. Our approach,
excellent and open LLM with powerful capability. We also grounded in LLMs, leverages the more powerful
use LLaMA2[50] as the LLM for inference. Compared inference ability of such models, thereby
to LLaMA1, LLaMA2 can support a longer context, outperforming the baselines that lack LLMs.
which is conducive to injecting knowledge into the Furthermore, our approach achieves better performance
context. We use the LLaMA 7B version. Considering compared to existing LLM-based baselines. By
the length limit of the context, we set the number of in- selecting more appropriate in-context examples and
context examples to 8 and set the length of the further enhancing the LLM through knowledge
knowledge to no more than 256. We use the default injection, our framework demonstrates its capability to
settings unless otherwise specified. achieve superior results. On A-OKVQA, our approach
consistently outperforms existing baselines, exhibiting
4.3 Experimental result
an accuracy improvement of more than 1.7 compared
We report the results on OK-VQA and A-OKVQA. to the existing baselines. These results once again
Tables 1 and 2 show the results. underscore the effectiveness of our method.
850 Big Data Mining and Analytics, September 2024, 7(3): 843−857
Notably, our framework relies entirely on open of in-context examples, model performance will be
LLMs, making it a cost-effective solution that is significantly degraded, which also shows that in-
accessible to most researchers. In contrast, utilizing context examples are crucial to the in-context learning
GPT-3 can be expensive and may result in significant paradigm of LLM.
costs. Our inference model employs the 7B version of 4.4.2 Different LLMs for knowledge enhancement
LLaMA, featuring just 7 billion parameters. Notably, Figure 4 shows the results obtained by employing
we achieve superior performance using only a single various LLMs for knowledge enhancement. Here, we
V100 32 G GPU for inference, a setup that is define LLM1 as the LLM responsible for generating
affordable for most researchers. background knowledge, and LLM2 as the LLM used
4.4 Ablation study for inference. The approach involves using LLM1 to
generate background knowledge, which is then injected
We report the results of ablation experiments
into the prompt to guide LLM2 during inference. Our
conducted on both OK-VQA and A-OKVQA. The
aim is to investigate the impact of utilizing different
ablation experiments focus on two key aspects: (1)
LLMs as LLM1 . Specifically, we employ LLaMA1-7B,
Knowledge enhancement under different shots: We
LLaMA1-13B, ChatGLM1, and ChatGLM2 as LLM1 ,
investigate the impact of knowledge enhancement
respectively, and present the corresponding
under different shots, exploring how our approach
experimental results.
performs with different numbers of in-context
Upon analyzing the results, it becomes apparent that
examples. (2) Using different LLMs for knowledge
more capable models tend to yield superior results.
injection: We examine the influence of employing
Specifically, LLaMA1-13B stands out as the top
different LLMs for knowledge injection, evaluating the
performer, attributed to its status as the largest model.
performance variations.
The extensive parameterization of LLaMA1-13B
4.4.1 Knowledge enhancement under different
57
shots
Table 3 shows the results of ablation experiments. ✘ 56
54.9 54.8
Knowledge denotes that no knowledge is incorporated 55
Accuracy (%)
54.1
to enhance the LLM2 . Ours (+ LLM1 ) means using 54
LLM1 for knowledge enhancement. Ours (+ KGs) 53 52.4
means using Knowledge Graphs (kGs)[51] for 52
knowledge enhancement. We perform ablation
51
experiments with and without in-context examples. 0-
50
shot indicates no in-context examples.
B
3B
2
LM
LM
-7
-1
G
A1
t
aM
ha
ha
aM
C
LL
LL
3B
2
LM
LM
-7
-1
G
A1
t
aM
ha
ha
M
C
LL
a
LL
Accuracy (%)
45
exhibits relatively lower performance due to its English
knowledge not matching up to that of LLaMA. On the 40
Accuracy (%)
context examples on both OK-VQA and A-OKVQA 40
datasets. Kg-enhanced denotes the utilization of
35
knowledge enhancement, while No Kg-enhanced
signifies the absence of knowledge enhancement. 30 Kg-enhanced
No Kg-enhanced
Figure 5 illustrates the variation in model 25
performance with the number of in-context examples. 0 1 2 3 4 5 6 7 8
Generally, the model performance exhibits an upward Number of in-context examples
trend as the number of in-context examples increases. (b) A-OKVQA
Specifically, in a 0-shot scenario (when the number of Fig. 5 Performance when varying the number of in-context
in-context examples is 0), the model performance is examples.
poor without in-context examples. However, a
noticeable improvement is observed when the number correct answer is hit in the background knowledge, the
of in-context examples is increased to 1, indicating the model can still predict failure. This also shows that
significance of in-context examples in model learning. background knowledge does not always help models
As the number of in-context examples continues to make 100% correct predictions.
increase, the model performance gradually reaches a Figure 7 depicts the impact of in-context examples
saturation point, suggesting that constantly adding in- on inference. The correct answer hit in the in-context
context examples does not always lead to a example is highlighted in red font, while the correct
proportional improvement in model performance. answer itself is denoted in green font. It is evident that
4.6 Case study the correct answer is present in the in-context
examples, underscoring the increased likelihood of
We select specific cases to examine the influence of predicting the correct answer when leveraging in-
background knowledge and in-context examples on context examples.
model inference. The findings indicate that both To explain the construction details of our prompt, we
background knowledge and in-context examples show a concrete example in Fig. 8. For better
contribute positively to model inference. visualization, we show the prompt with four in-context
Figure 6 illustrates the impact of background examples.
knowledge on inference. The correct answer contained
in the background knowledge is denoted in blue font, 4.7 Inference time
while the correct answer itself is highlighted in green We record the inference time for different context
font. It is evident that the correct answer is present in lengths on OK-VQA. We adopt the LLaMA2-7B
the background knowledge, indicating that the model is version as inference model and use a Tesla V100 GPU.
more likely to make accurate predictions when The statistics for inference time are presented in
leveraging background knowledge. We also present a Fig. 9. The X-axis represents time in minutes, while the
failure case. In the last case, we can see that even if the Y-axis corresponds to different shots, indicating the
852 Big Data Mining and Analytics, September 2024, 7(3): 843−857
Context: Two men are standing on top Context: A large display of apples
Context: A black and white cat
of a zebra. in a grocery store.
laying on top of a pile of clothes.
Question: Where would you find this Question: What its this a fruit or
Question: Is that a panda or a
animal in the wild? vegetable?
cat?
Knowledge: In the context given, a Knowledge: In the context given,
Knowledge: In this context,
zebra is a type of animal. It is a large, fruit refers to a type of plant that
black and white refers to the
striped mammal that lives in Africa. produces seeds and fleshy pulp. xxx
color of the cat xxx
Zebras are xxx Answer: fruits
Answer: cat
Answer: Africa
Fig. 6 Impact of background knowledge on inference. Figures 6a and 6b are successful cases and Fig. 6c is a failed case. In
Figs. 6a and 6b, the correct answer is hit in the background knowledge, which helps the model predict the correct answer. In
Fig. 6c, the correct answer is hit in background knowledge, but the model makes the failed prediction.
Example 3
Example 1
Context: A group of business
Context: A man in a suit sits
men in suits and ties talking to
inside of a car.
kids.
Question: What type of outfit
Question: What world
is this man wearing?
describes the men’s attire?
Knowledge: xxx
Knowledge: xxx
Answer: suit
Answer: dress
Example 2 Example 4
Context: A smiling business Context: Two men in suits
woman is standing next to a standing next to a women.
Context: A group of men in suits standing in business man. Question: What type of outfit
front of a robot. Question: What type of attire is the woman wearing?
Question: What type of clothing are the men is this? Knowledge: xxx
wearing? Knowledge: xxx Answer: suit
Knowledge: xxx Answer: business
Answer: suit
(a)
Example 1 Example 3
Context: Freight train going Context: Vehicles drive past
over a bridge on a river. an obelisk topped by a clock.
Question: What is this Question: What is the clock
structure made of? tower made out of?
Knowledge: xxx Knowledge: xxx
Answer: steel Answer: brick
Example 2 Example 4
Context: Three images of the Context: An intersection
process of a bathroom remodel. containing a large brick
Question: What material is the building, traffic lights, and a
Context: A fire place sitting in a living
bathtub made out of? street sign.
room next to a window.
Knowledge: xxx Question: What are the
Question: What is this fireplace made
Answer: ceramic buildings made of?
of?
Knowledge: xxx
Knowledge: xxx
Answer: brick
Answer: brick
(b)
Fig. 7 Impact of in-content examples on inference. Because in-context examples play a key role in the in-context learning
paradigm of LLMs, we show cases where in-context examples help the model predict. The red font indicates that the correct
answer is hit in the in-context examples.
Zhongjian Hu et al.: Prompting Large Language Models with Knowledge-Injection for Knowledge-Based Visual... 853
Fig. 8 Prompt consisting of three parts: (1) The prompt head is used to describe the task; (2) Some in-context examples are
used for LLM learning; (3) Test input is in the same format as the in-context example, but the answer is left blank for LLM
prediction. We show the prompt with four in-context examples for better visualization.
number of in-context examples. It is evident that as the dataset. To facilitate evaluation, we select one-tenth of
number of in-context examples increases, the inference the test set as the evaluation set to report the results.
time also extends. This correlation is attributed to the We compare with the best baseline model, Prophet, and
augmented length of the context, leading to longer show comparisons under different in-context example
processing time for the model.
settings. Figure 10 illustrates the results. We find that
4.8 Evaluation on additional datasets our method outperforms the best baseline model under
To further validate the robustness of our method, we different parameter settings. The results again
conduct additional evaluations on the DAQUAR demonstrate the robustness of our method.
854 Big Data Mining and Analytics, September 2024, 7(3): 843−857
8-shot
variety of tools for knowledge enhancement, including
retrieval from knowledge graphs and knowledge bases,
4-shot generation using LLMs, and so on. The agent
technology is then utilised to automate the planning in
1-shot order to inject knowledge in a flexible manner.
0-shot Acknowledgment
0 10 20 30 40 50
This work was supported by the National Natural Science
Time (min)
Foundation of China (No. 62272100), Consulting Project
Fig. 9 Inference time required for different numbers of in- of Chinese Academy of Engineering (No. 2023-XY-09),
context examples. and Fundamental Research Funds for the Central
40
Universities.
Ours References
35 Baseline
Accuracy (%)
intensive NLP tasks, in Advances in Neural Information Proc. IEEE/CVF Conf. Computer Vision and Pattern
Processing Systems, H. Larochelle, M. Ranzato, R. Recognition (CVPR), Long Beach, CA, USA, 2019, pp.
Hadsell, M. F. Balcan, and H. Lin, eds. New York, NY, 6274–6283.
USA: Curran Associates, Inc., 2020, pp. 9459–9474. [22] Y. Cui, Z. Yu, C. Wang, Z. Zhao, J. Zhang, M. Wang, and
[10] D. Hong, B. Zhang, X. Li, Y. Li, C. Li, J. Yao, N. Yokoya, J. Yu, ROSITA: Enhancing vision-and-language semantic
H. Li, P. Ghamisi, X. Jia, et al., SpectralGPT: Spectral alignments via cross- and intra-modal knowledge
remote sensing foundation model, IEEE Trans. Pattern integration, in Proc. 29th ACM Int. Conf. Multimedia,
Anal. Mach. Intell., doi: 10.1109/TPAMI.2024.3362475. Virtual Event, 2021, pp. 797–806.
[11] D. Hong, B. Zhang, H. Li, Y. Li, J. Yao, C. Li, M. Werner, [23] J. Li, D. Li, C. Xiong, and S. Hoi, BLIP: Bootstrapping
J. Chanussot, A. Zipf, and X. X. Zhu, Cross-city matters: language-image pre-training for unified vision-language
A multimodal remote sensing benchmark dataset for cross- understanding and generation, arXiv preprint arXiv:
city semantic segmentation using high-resolution domain 2201.12086, 2022.
adaptation networks, Remote. Sens. Environ., vol. 299, p. [24] M. Zhou, L. Yu, A. Singh, M. Wang, Z. Yu, and N.
113856, 2023. Zhang, Unsupervised vision-and-language pre-training via
[12] D. Hong, N. Yokoya, J. Chanussot, and X. X. Zhu, An retrieval-based multi-granular alignment, in Proc.
augmented linear mixing model to address spectral IEEE/CVF Conf. Computer Vision and Pattern
variability for hyperspectral unmixing, IEEE Trans. Image Recognition (CVPR), New Orleans, LA, USA, 2022, pp.
Process., vol. 28, no. 4, pp. 1923–1938, 2019. 16464–16473.
[13] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. [25] M. Malinowski, M. Rohrbach, and M. Fritz, Ask your
Zitnick, and D. Parikh, VQA: Visual question answering, neurons: A neural-based approach to answering questions
in Proc. IEEE Int. Conf. Computer Vision (ICCV), about images, in Proc. IEEE Int. Conf. Computer Vision
Santiago, Chile, 2015, pp. 2425–2433. (ICCV), Santiago, Chile, 2015, pp. 1–9.
[14] Y. Srivastava, V. Murali, S. R. Dubey, and S. Mukherjee, [26] Z. Yu, J. Yu, J. Fan, and D. Tao, Multi-modal factorized
Visual question answering using deep learning: A survey bilinear pooling with co-attention learning for visual
and performance analysis, in Computer Vision and Image question answering, in Proc. IEEE Int. Conf. Computer
Processing, S. K. Singh, P. Roy, B. Raman, and P. Vision (ICCV), Venice, Italy, 2017, pp. 1839–1848.
Nagabhushan, eds. Singapore: Springer, 2021, pp. 75–86. [27] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, Neural
[15] P. Sun, W. Zhang, S. Li, Y. Guo, C. Song, and X. Li, module networks, in Proc. IEEE Conf. Computer Vision
Learnable depth-sensitive attention for deep RGB-D and Pattern Recognition (CVPR), Las Vegas, NV, USA,
saliency detection with multi-modal fusion architecture 2016, pp. 39–48.
search, Int. J. Comput. Vis., vol. 130, no. 11, pp. [28] J. Wu, J. Lu, A. Sabharwal, and R. Mottaghi, Multi-modal
2822–2841, 2022. answer validation for knowledge-based VQA, Proc. AAAI
[16] Y. Wang, Q. Mao, H. Zhu, J. Deng, Y. Zhang, J. Ji, H. Li, Conf. Artif. Intell., vol. 36, no. 3, pp. 2712–2721, 2022.
and Y. Zhang, Multi-modal 3D object detection in [29] Z. Shao, Z. Yu, M. Wang, and J. Yu, Prompting large
autonomous driving: A survey, Int. J. Comput. Vis., vol. language models with answer heuristics for knowledge-
131, no. 8, pp. 2122–2152, 2023. based visual question answering, in Proc. IEEE/CVF
[17] H. Jiang, I. Misra, M. Rohrbach, E. Learned-Miller, and Conf. Computer Vision and Pattern Recognition (CVPR),
X. Chen, In defense of grid features for visual question Vancouver, Canada, 2023, pp. 14974–14983.
answering, in Proc. IEEE/CVF Conf. Computer Vision [30] S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, and X. Wu,
and Pattern Recognition (CVPR), Seattle, WA, USA, Unifying large language models and knowledge graphs: A
2020, pp. 10264–10273. roadmap, IEEE Trans. Knowl. Data Eng., doi:
[18] P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, Y. 10.1109/TKDE.2024.3352100.
Choi, and J. Gao, VinVL: Revisiting visual representations [31] X. Zou, A survey on application of knowledge graph, J.
in vision-language models, in Proc. IEEE/CVF Conf. Phys.: Conf. Ser., vol. 1487, no. 1, p. 012016, 2020.
Computer Vision and Pattern Recognition (CVPR), [32] Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, and Q. Liu,
Nashville, TN, USA, 2021, pp. 5575–5584. ERNIE: Enhanced language representation with
[19] L. Li, Z. Gan, Y. Cheng, and J. Liu, Relation-aware graph informative entities, in Proc. 57th Annual Meeting of the
attention network for visual question answering, in Proc. Association for Computational Linguistics, Florence, Italy,
IEEE/CVF Int. Conf. Computer Vision (ICCV), Seoul, 2019, pp. 1441–1451.
Republic of Korea, 2019, pp. 10312–10321. [33] Y. Sun, S. Wang, S. Feng, S. Ding, C. Pang, J. Shang, J.
[20] Z. Yu, Y. Cui, J. Yu, M. Wang, D. Tao, and Q. Tian, Deep Liu, X. Chen, Y. Zhao, Y. Lu, et al., ERNIE 3.0: Large-
multimodal neural architecture search, in Proc. 28th ACM scale knowledge enhanced pre-training for language
Int. Conf. Multimedia, Seattle, WA, USA, 2020, pp. understanding and generation, arXiv preprint arXiv:
3743–3752. 2107.02137, 2021.
[21] Z. Yu, J. Yu, Y. Cui, D. Tao, and Q. Tian, Deep modular [34] J. Liu, D. Shen, Y. Zhang, B. Dolan, L. Carin, and W.
co-attention networks for visual question answering, in Chen, What makes good in-context examples for GPT-3?
856 Big Data Mining and Analytics, September 2024, 7(3): 843−857
in Proc. Deep Learning Inside Out (DeeLIO 2022): The [43] R. Mokady, A. Hertz, and A. H. Bermano, Clipcap: Clip
3rd Workshop on Knowledge Extraction and Integration prefix for image captioning, arXiv preprint arXiv:
for Deep Learning Architectures, Dublin, Ireland, 2022, 2111.09734, 2021.
pp. 100–114. [44] H. Tan and M. Bansal, LXMERT: Learning cross-
[35] H. Ben-Younes, R. Cadene, M. Cord, and N. Thome, modality encoder representations from transformers, in
MUTAN: Multimodal tucker fusion for visual question Proc. 2019 Conf. Empirical Methods in Natural Language
answering, in Proc. IEEE Int. Conf. Computer Vision Processing and the 9th Int. Joint Conf. Natural Language
(ICCV), Venice, Italy, 2017, pp. 2631–2639. Processing (EMNLP-IJCNLP), Hong Kong, China, 2019,
[36] Z. Zhu, J. Yu, Y. Wang, Y. Sun, Y. Hu, and Q. Wu, pp. 5100–5111.
Mucko: Multi-layer cross-modal knowledge reasoning for [45] A. Kamath, C. Clark, T. Gupta, E. Kolve, D. Hoiem, and
fact-based visual question answering, in Proc. 29th Int. A. Kembhavi, Webly supervised concept expansion for
Joint Conf. Artificial Intelligence, Yokohama, Japan, general purpose vision models, in Computer
2020, pp. 1097–1103. Vision—ECCV 2022, S. Avidan, G. Brostow, M. Cissé, G.
[37] F. Gardères, M. Ziaeefard, B. Abeloos, and F. Lecue, M. Farinella, and T. Hassner, eds. Cham, Switzerland:
ConceptBert: Concept-aware representation for visual Springer, 2022, pp. 662–681.
question answering, in Proc. Findings of the Association [46] S. Ravi, A. Chinchure, L. Sigal, R. Liao, and V. Shwartz,
for Computational Linguistics: EMNLP 2020, Virtual VLC-BERT: Visual question answering with
Event, 2020, pp. 489–498. contextualized commonsense knowledge, in Proc.
[38] M. Luo, Y. Zeng, P. Banerjee, and C. Baral, Weakly- IEEE/CVF Winter Conf. Applications of Computer Vision
supervised visual-retriever-reader for knowledge-based (WACV), Waikoloa, HI, USA, 2023, pp. 1155–1165.
question answering, in Proc. 2021 Conf. Empirical [47] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D.
Methods in Natural Language Processing, Punta Cana, Parikh, Making the V in VQA matter: Elevating the role of
Dominican Republic, 2021, pp. 6417–6431. image understanding in visual question answering, in
[39] F. Gao, Q. Ping, G. Thattai, A. Reganti, Y. N. Wu, and P. Proc. IEEE Conf. Computer Vision and Pattern
Natarajan, Transform-retrieve-generate: Natural language- Recognition (CVPR), Honolulu, HI, USA, 2017, pp.
centric outside-knowledge visual question answering, in 6325–6334.
Proc. IEEE/CVF Conf. Computer Vision and Pattern [48] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J.
Recognition (CVPR), New Orleans, LA, USA, 2022, pp. Kravitz, S. Chen, Y. Kalantidis, L. J. Li, D. A. Shamma, et
5057–5067. al., Visual genome: Connecting language and vision using
[40] Y. Guo, L. Nie, Y. Wong, Y. Liu, Z. Cheng, and M. crowdsourced dense image annotations, International
Kankanhalli, A unified end-to-end retriever-reader Journal of Computer Vision, vol. 123, pp. 32–73, 2017.
framework for knowledge-based VQA, in Proc. 30th ACM [49] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. A.
Int. Conf. Multimedia, Lisboa, Portugal, 2022, pp. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F.
2061–2069. Azhar, et al., Llama: Open and efficient foundation
[41] Y. Jiang, V. Natarajan, X. Chen, M. Rohrbach, D. Batra, language models, arXiv preprint arXiv: 2302.13971, 2023.
and D. Parikh, Pythia v0.1: The winning entry to the VQA [50] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi,
challenge 2018, arXiv preprint arXiv: 1807.09956, 2018. Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S.
[42] J. Lu, D. Batra, D. Parikh, and S. Lee, Vilbert: Pretraining Bhosale, et al., Llama 2: Open foundation and fine-tuned
task-agnostic visiolinguistic representations for vision- chat models, arXiv preprint arXiv: 2307.09288, 2023.
and-language tasks, in Advances in Neural Information [51] F. Ilievski, P. Szekely, and B. Zhang, CSKG: The
Processing Systems, H. Wallach, H. Larochelle, A. commonsense knowledge graph, in The Semantic Web, R.
Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett, eds. Verborgh, K. Hose, H. Paulheim, P. Champin, M.
New York, NY, USA: Curran Associates, Inc., 2019, pp. Maleshkova, O. Corcho, P. Ristoski, and M. Alam, eds.
13–23. Cham, Switzerland, Springer, 2021, pp. 680–696.
Fengyuan Liu is currently pursuing the Yuan Meng is currently pursuing the PhD
master degree at the Southeast University - degree at the School of Computer Science
Monash University Joint Graduate School and Engineering, Southeast University,
(Suzhou), Southeast University, China. His Nanjing, China. Her research interests
research interests include artificial include knowledge graphs, natural
intelligence, large language models, etc. language processing, etc.