0% found this document useful (0 votes)
71 views

From Image To Language A Critical Analysis of Visual Question Answering (VQA)

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views

From Image To Language A Critical Analysis of Visual Question Answering (VQA)

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

From Image to Language: A Critical Analysis of Visual Question Answering (VQA)

Approaches, Challenges, and Opportunities

Md Farhan Ishmama,b , Md Sakib Hossain Shovonb,c , M.F. Mridhab,c , and Nilanjan Deyd
a Department of Computer Science and Engineering, Islamic University of Technology, Dhaka, Bangladesh
b Advanced Machine Intelligence Research Lab, Dhaka, Bangladesh
c Department of Computer Science and Engineering, American International University, Dhaka, Bangladesh
d Department of Computer Science and Engineering, Techno International New Town, Kolkata, India

Abstract
arXiv:2311.00308v2 [cs.CV] 23 Sep 2024

The multimodal task of Visual Question Answering (VQA) encompassing elements of Computer Vision (CV) and Natural Language
Processing (NLP), aims to generate answers to questions on any visual input. Over time, the scope of VQA has expanded from
datasets focusing on an extensive collection of natural images to datasets featuring synthetic images, video, 3D environments, and
various other visual inputs. The emergence of large pre-trained networks has shifted the early VQA approaches relying on feature
extraction and fusion schemes to vision language pre-training (VLP) techniques. However, there is a lack of comprehensive surveys
that encompass both traditional VQA architectures and contemporary VLP-based methods. Furthermore, the VLP challenges in the
lens of VQA haven’t been thoroughly explored, leaving room for potential open problems to emerge. Our work presents a survey
in the domain of VQA that delves into the intricacies of VQA datasets and methods over the field’s history, introduces a detailed
taxonomy to categorize the facets of VQA, and highlights the recent trends, challenges, and scopes for improvement. We further
generalize VQA to multimodal question answering, explore tasks related to VQA, and present a set of open problems for future
investigation. The work aims to navigate both beginners and experts by shedding light on the potential avenues of research and
expanding the boundaries of the field.
Keywords: Visual Question Answering, Vision Language Pre-Training, Multimodal Learning, Multimodal Large Language
Models

1. Introduction However, the term VQA is interchangeably used to define VQA


on a single image and generalized visual input. Being a multi-
Over the past decade, advancements in deep learning-based modal domain, visual question answering naturally works with
systems led to breakthroughs in visual and textual comprehen- visual and textual modalities but also works with auditory input
sion enabling AI models to rival human performance in these as seen in multimodal video question answering [5]. VQA is
domains. While understanding an image has been a domain of also closely related to other vision-language problems starting
expertise for humans, it hasn’t been so for AI. An AI-generated from early works like image-sentence retrieval [6] to more re-
answer has been viewed as lackluster and simplistic. How- cent tasks – Visual Reasoning [7], Image Captioning [8], Visual
ever, in recent times, the ability to distinguish between human- Dialogue [9], etc.
generated and AI-generated texts has become increasingly dif-
ficult, exemplified by the fact that the specific line in question In the early days, VQA struggled with defining a dataset that
has been crafted by an AI system. Visual Question Answer- could be used as a benchmark for the problem. Initially, the
ing (VQA) is an AI problem that inherited all its principles and problem definitions were restrictive but were later expanded to
methodologies from the realms of Computer Vision (CV) and free-form open-ended question answering with the advent of
Natural Language Processing (NLP). But as VQA evolved, it the VQA dataset [1], regarded as the first standard dataset in this
carved its own identity with unique traits and nuances. domain. VQA 2.0 [10], the successor to the VQA dataset, also
Visual Question Answering (VQA) has been traditionally de- gained immense popularity while addressing some of the major
fined as the problem of answering a question with an image as limitations of the VQA dataset. The subsequent datasets in-
the context [1]. The current scope of VQA is not limited to a creased the task complexity by requiring a high level of reason-
single image as the visual input but can be generalized to any ing [11, 12, 13], incorporated external knowledge [14, 15, 16],
form of visual input e.g. set of images [2] or videos [3, 4]. and extended the task to different variations of visual input
[3, 17, 18, 19]. Recently VQA datasets saw substantial work
in Video Question Answering [4], Medical VQA [20], and
Email addresses: [email protected] (Md Farhan datasets on plots, figures, and graphs [21, 22, 23].
Ishmam), [email protected] (Md Sakib Hossain Shovon),
[email protected] (M.F. Mridha), [email protected] The VQA methodologies have also undergone several phases
(and Nilanjan Dey) but have permanently shifted to deep learning-based methods.
Preprint submitted to Information Fusion September 25, 2024
Visual Question Answering

Answering a question based on any form of


visual input

Domain-Based Modality-Based Unimodal Video Answer Generation

Realistic Reasoning & Understanding Video Question Answering Classification


No audio input i.e. Video is
equivalent to a series of Images
Natural Images, Scenes with Complex Images and
Answer will be picked from a set
open-ended free-form Questions to evaluate QA on Video Input
of answer classes
questions reasoning capabilities
Multimodal Video

Document/Text Grounded VQA Image Question Answering Comprehensive


Has audio input and questions
associated to it
Locating the object or element Free-form answer will be
Images of Documents and
in the image most relevant to QA on one or more image inputs generated that is
Texts in Scenes
the question understandable by the user

Diagrams, Charts, Graphs Medical Knowledge-based Traditional Setting Grounded

Requires external knowledge or Answer will include co-ordinates


Any form of Infographic or Medical Images with
information retrieval from QA on a single image or properties of bounding box to
Chart associated questions
knowledge bases specify parts of the image

Bias Reduction Miscellaneous Zero-Shot Setting Image-Pair QA Image Generation

QA on a pair of images with


Resplitting and creating Counting tasks, 360-degree Uses auxillary information to An image will be generated as
questions associated to the
balanced datasets images etc. predict beyond training data the answer to the question
relationship between the pairs

Figure 1: Taxonomy and definitions of several VQA tasks based on domain (where and how VQA is applied), the modality (type and source of information), and
answer generation (type of output and how is produced).

The earlier methods [1, 24, 25] primarily relied on a visual and Machine Learning, Zero-Shot Learning, etc.
textual encoder to extract features from the multimodal inputs Our work aspires to serve as a beginner’s roadmap to VQA
followed by some form of fusion strategy to combine the encod- by highlighting the major works in VQA datasets, methods, and
ings. The fused output is then passed to a classifier or generator metrics over the last decade covered in section-5, 6, and 7 re-
depending on answer generation being treated as a classifica- spectively. The review also aims to navigate future researchers
tion [24] or generative problem [25]. Modern VQA methods by directing them toward the research challenges faced in mod-
have shifted from this joint encoding scheme to Vision Lan- ern VQA as seen in section 8. Furthermore, section-9 posi-
guage Pre-training (VLP) [26, 27] employing transformer [28] tions VQA in the broader domain of multimodal learning and
architectures trained on generalized tasks with large image-text explores related domains and sub-domains. Finally, section-
pair datasets and then fine-tuned to downstream tasks like VQA 10 highlights the trends, presents unsolved open problems, and
[29, 30, 31]. discusses the future of the domain. We hope to see this sur-
vey leading to potentially ground-breaking work in VQA and
We believe that the rapidly evolving field of VQA is far its related domains.
from saturation and has plenty of open problems, research chal-
lenges, and scope for improvement. While existing VQA sur-
veys thoroughly explored pre-VLP architectures [32, 33], con- 2. VQA Applications
temporary VLP surveys [26, 27] do not explicitly delve into the
domain through the lens of VQA. Our survey bridges the gap While the current applications of VQA methodologies are
by extending the traditional VQA surveys to encompass VLP scarce, there are a lot of potential areas that can improve sig-
techniques, thereby providing a comprehensive overview of the nificantly using VQA. Real-world applications of VQA with its
whole domain. In this review, we introduced a comprehensive emerging use cases have been extensively studied by Barra et al.
taxonomy of VQA problems, datasets, and methods in order [35]. In the subsequent subsections, we shall explore some of
to organize the vast amount of research work throughout the the domains that use VQA.
years. In Section-2, we explore various applications of VQA
in visually impaired assistance, medical domain, educational
2.1. Assistance of the Visually Impaired and VizWiz Challenges
domain, visual chatbots, etc. Section-3 defines the problem of
VQA and the scope of the domain. Then, section-4 delves into Vision can be considered the most important and frequently
the existing generalized surveys in VQA along with specialized utilized sensor of the human body. Assistive technologies for
VQA surveys on fusion techniques, language bias, video QA, the visually impaired [34] have been a field of interest even be-
etc., and related surveys in VLP, Image Captioning, Multimodal fore the popularity of automatic VQA systems. Inspired by
2
swerability of a visual question. As visually impaired people
have no knowledge of the image taken by the app, the answer-
ability of that particular image is the first problem in creating a
completely automatic system.
In the following year, Gurari et al. [36] proposed the VizWiz
grand challenge through the creation of the VizWiz dataset with
two tasks - answerability of a visual question from the VizWiz
app and automatically generating an answer if the question is
What is in
front of me? answerable. With more than 30k visual questions, the VizWiz
dataset paved the path for subsequent VQA models that aimed
to assist the visually impaired. The challenge had been substan-
tially difficult compared to other VQA problems as the images
The image in front of you shows
multiple cans of Green Giant taken in the app do no not have ideal conditions. A standard
Whole Kernel Sweet Corn.
image from the traditional datasets like VQAv2 [10] will be
clear with good lighting conditions while the images from the
VizWiz dataset can be blurry, noisy, have unfavorable lighting
Figure 2: Overview of a visually impaired assistance system similar to conditions, and in often cases, are unanswerable by humans.
the VizWiz Mobile App [34] utilizing Multimodal Large Language Models
(MLLMs). The app allows visually impaired users to take pictures of their Gurari et al. [37] also proposed the VizWiz-private dataset that
environment along with a voice-recorded question. Automatic Speech Recog- deals with images containing private content in the same setup
nition (ASR) processes the audio input, generating a textual question that is as VizWiz.
sent to the MLLM along with the image. The MLLM then produces a compre-
hensive answer, which is converted back to audio through Speech Synthesis.
Recently, the mobile application Be My Eyes1 was launched
Alternatively, the audio can be sent directly to the MLLM for Audio VQA, to to provide automated assistance to the visually impaired.
and dedicated modules for ASR, Speech Synthesis, and VQA can also be used With a user base exceeding 500,000 individuals from 150 coun-
separately. tries, the app utilizes the multimodal LLM GPT-4 [43] to gen-
erate VQA responses. The VizWiz challenge has also been
expanded to other domains e.g image captioning, VizWiz-
computer vision datasets and techniques, the continuous re- Captions by Gurari et al. [44], a few-shot localization setting
search enabled the creation of the VizWiz dataset by Gurari by Tseng et al. [38] and Visual Grounding by Chen et al. [45].
et al. [36]. The primary purpose of the dataset is to train VQA
models that can automatically provide assistance to visually
impaired people by answering questions on real-world images 2.2. Med-VQA
based on data collected by a mobile application. The Med-VQA subdomain focuses on images and questions
Works on visual challenges faced by visually impaired peo- related to the medical field and has gained immense popularity
ple date back to 2010 when Bigham et al. [34] proposed the due to the versatility of its use cases. Existing work shows med-
VizWiz mobile application that relied on Amazon Mechanical ical VQA models acting as radiologists, pathologists, or knowl-
Turk workers to manually respond to visual questions. The edgeable medical assistants [20]. Furthermore, these models
users will send an image and voice-recorded question to the can potentially decrease the workload of healthcare workers
server which will respond with an answer produced by the prone to fatigue and burnout. Healthcare burnout has been a
workers. Although the task of question answering was done concerning issue for decades as studies found a negative corre-
manually, it showcased the necessity of an automated VQA lation between workers experiencing burnout and the quality of
system. The VizWiz organization established a common goal healthcare [46]. Hence, integrating VQA systems in the med-
of assisting visually impaired people through the development ical domain can reduce instances of disease misdiagnosis and
of datasets and algorithms for assistive technologies and car- spread of misinformation. Consequently, the safety and overall
ried out substantial research work [36, 37, 38] in the following quality of healthcare services is enhanced.
decade. Automated VQA systems can also serve as a reliable consul-
The launch of the VizWiz app was followed by several simi- tancy system for patients seeking professional advice, follow-
lar works for the visually impaired e.g. perceiving fashion [39] ing up on previous checkups, or fact-checking medical infor-
which deals with answering subjective fashion-based questions, mation. Online Med-VQA systems have the potential to make
exploring the visual challenges faced in daily life [40] through healthcare accessible to a broader demographic by eliminating
the creation of a dataset with more than 40k visual questions physical barriers and reducing costs. An integral application of
on 5k visually impaired users, and assisting users on a video Medical VQA lies in emergency healthcare scenarios, particu-
stream in longer conversations [41] by highlighting the problem larly when non-experts need to take swift actions within a short
of Video Question Answering. While the works are prominent time. For instance, in the event of a snakebite, an individual can
in the pre-deep learning VQA era, they strongly advocated for capture an image of the wound to verify whether the snake was
the case of development of an automatic system in this domain
and subsequently was introduced by Gurari and Grauman [42]
through Crowdverse, an automatic system for predicting the an- 1 https://www.bemyeyes.com/

3
Conversation History

Generate caption for this image


Conversation History

Hello, do you remember my name? Conversational


Memory QA

Yes, you are X. Image


Captioning

This is an image of a cat wearing a


Reasoning- birthday hat.
based VQA
Generate a realistic image of a cat

What's funny about this image?

The cat is playing a trumpet which is Image


typically played by humans. Generation

What are trumpets made of?


Knowledge-
Based QA
Trumpets are typically made of brass

Figure 3: A standard visual dialog system should be capable of a multitude of unimodal and bimodal tasks. Some of the tasks include but are not limited to
tracking conversational history during question answering, performing any form of textual or visual reasoning, being able to answer questions that require external
knowledge, generating image captions, and generating synthetic images.

venomous. The person can further ask about the steps required for QA generation. The work has been expanded to a separate
to treat a venomous snakebite. subdomain called Educational Robotics [48], often relying on
An ideal medical VQA system should be able to generate VQA-based techniques with comprehensive answer generation.
a comprehensive answer for free-form and open-ended ques- During the pandemic, Sophia and Jacob [49] proposed visual
tions on medical images encompassing the whole or majority chatbots to assist students in education. Current, state-of-the-
of the domain. The model must also possess knowledge and art generative AI models [50, 43] can provide better assistance
reasoning capabilities beyond the scope of its training data as as they excel at Visual Dialog [9]. Another promising technique
unseen cases are frequently encountered in the medical domain. for educative VQA is the gamification of VQA systems as pro-
The systems should also exhibit lower false positive rates in posed by Suresh et al. [51] where game-based systems can in-
the identification of medical conditions, similar to the low false centivize students to learn new visual concepts while the core
positive rates of the corresponding classifiers. Additionally, the of the assessment will rely on automated VQA. Visual question
systems should be reliable both in terms of answer quality and generation [52] can complement VQA systems to create new
accessibility. The accuracy of the system’s responses should content for evaluating learned concepts.
closely align with that of professionals in the domain. Current VQA on infographics [23], diagrams [53], plots [17], and
Med-VQA models most satisfy these criteria suggesting that a documents [21] are also opening scopes of improvement in the
future where Med-VQA systems are fully integrated into global education domain, especially for business analysts and similar
healthcare services may be closer than we can imagine. professions. Bongini et al. [54] introduced VQA to automate
museum guides and can be used as a source of knowledge by
2.3. Education active learners to know more about cultural heritage and art-
works. Several VQA works [55, 56, 57] were proposed for
The education sector is responsible for nurturing the re-
textbooks, PDFs, and slides which are rich sources of knowl-
searchers of the next generation. The integration of AI in ed-
edge for both students and professionals. In the future, we can
ucation can potentially improve the accessibility and scope of
expect to see education domain-specific generative models built
education. It should be noted that the scope of this field is quite
on top of foundational models [58] to fully automate any form
broad starting from teaching pre-scholars to taking courses for
and modality of educative question-answering.
the professionals. Question-answering is a fundamental tech-
nique used to reinforce and evaluate the learned concepts. VQA
systems can be great learning assistants to students, hobbyists, 2.4. Visual Dialogue/ChatBot
and professionals. Visual Chatbots are generative AI models that are capable
He et al. [47] proposed an automatic robot system that uses of performing Visual Dialogue [9] and have shown substantial
VQA to teach pre-scholars and act as their companion. The potential fueled by the popularity of Large Language Models
robot captures images of objects from the environment and asks (LLMs). Before exploring the realms of generative AI, a term
questions regarding them using various multimodal modules that will be repeatedly used throughout the literature, let us first
4
define what generative AI or genAI is. GenAI models are ca- dependent on conversing between images and texts will benefit
pable of creating new content including but not limited to texts, from the state-of-the-art VLP systems.
images, and videos. Foundational models [58] can be defined
as generalized AI models trained on a large amount of data and
3. Definition and Scope
capable of excelling in a variety of AI tasks. The foundational
models are usually fine-tuned for domain-specific downstream The task definition of VQA has evolved throughout the years
tasks like Visual Question Answering (VQA). – starting from a single image-based question answering to QA
Detouring from the field of generative AI, we will explore on any type of visual input. The visual input can take forms
language modeling in natural language processing. Language including but not limited to images [1], video [3, 4], gif [64],
Models primarily deal with assigning probability values to set of images [2], diagrams [17], slides [57], and 360◦ images
words being trained on a text corpus. Large Language Models [65]. VQA systems are widely used to solve subtasks of other
(LLMs) are upscaled versions of language models with model complex multimodal problems. EmbodiedQA [66], discussed
parameters ranging from a few hundred million to billions. The in section-9.2.5, requires a Reinforcement Learning (RL) agent
GPT-based models [59] are notable among the foundational to use VQA to answer questions in a 3D synthetic environment.
models being used as LLMs following the success of the revo- Visual Dialog [9], explored in section-2.4, can also have VQA
lutionary transformer architecture [28] in drastically improving subtasks as depicted in fig-3 but needs to incorporate infor-
the performance of language models and enabling paralleliza- mation retrieval, commonsense reasoning, and conversational
tion on sequential data. memory along with the standard setting of VQA.
A proprietary GPT-based model, ChatGPT, derived from The scope of VQA is bounded by any form of multimodal
GPT-3 [60], popularized conversational AI which piqued users’ question-answering problem related to generating a textual an-
interest in visual conversational AI. Wu et al. [50] introduced swer, Yt , to a textual question,Xq on a visual input, Xv . We can
Visual ChatGPT by using ChatGPT with computer vision- formally define VQA as V, such that,
based models and developing a modular architecture for vision-
language tasks. The promising results from Visual ChatGPT V : Xv , Xt → Yt (1)
followed by groundbreaking results from the multimodal LLM
GPT-4 [43] ensure that Visual Chatbots will soon be deployed The flexibility of having any form of visual input allows
in real-world applications. A key point to note is that these VQA to have various sub-tasks that can be based on the do-
GPT-based models are generative while VQA models are tradi- main, modality, or answer generation procedure. Referring to
tionally discrimative – both of these properties will be explored fig-1, most of the works in VQA are done on realistic or nat-
in section-10.1.4. ural images. The performance of the model on generalized vi-
sual input is primarily evaluated by the ability of the model to
correctly answer natural images. Most of the natural images
2.5. Miscellaneous
are sourced from the COCO dataset by Lin et al. [67] while
Apart from the aforementioned applications of VQA, there MSVD and MSR-VTT [68] have been popular sources for nat-
are several miscellaneous use cases worth exploring. The Vi- ural videos. The sources of VideoQA tend to be more diverse
sual Chatbots described in section-2.4 may not be considered due to the large variation of video content in general.
an end-product but can be integrated into various systems to A popular setting of VQA uses graphs, charts, documents,
provide customer service. The domains of application include and texts in natural scenes as the visual input. Document and
but are not limited to product recommendation, troubleshoot- text interpretation by VQA models have work techniques sim-
ing, and website tutorials. Furthermore, VQA can be exten- ilar to Optical Character Recognition (OCR) [69] systems i.e.
sively used for qualitative data analysis from visual diagrams, by extracting textual information from the images and using
visual accessibility, streamlining the user experience, etc. QA them as a context to answer the question. As seen in section-
on diagrams enabled us to develop better ways to interact with 2.2, VQA is also gaining popularity in the medical domain, pri-
digital visualizations that dominate today’s businesses. marily due to the scope of developing potential educational, re-
Toor et al. [61] proposed a VQA system called C2VQA- search, and service-related medical applications.
BOARS for the surveillance of individual biometric data on A separate branch of VQA is dedicated to criticizing exist-
both images and videos. VQA can also be part of informa- ing datasets and methods by highlighting biases and the lim-
tion retrieval systems applicable in several domains like medi- ited ability of VQA models to reason. Sec-5.3 will discuss
cal, geospatial, military, remote sensing, etc. Sarkar and Rah- the datasets applicable for reasoning and bias mitigation. Com-
nemoonfar [62], Sarkar et al. [63] utilized VQA architectures monsense reasoning is often treated as a separate task known as
to assess post-disaster damage which can useful in prioritizing Visual Reasoning [70]. Ideally, VQA models must have a rea-
aid to regions affected by disasters. Furthermore, current VLP sonably high capability of understanding both visual and textual
techniques used in VQA improve the performance of image- concepts. Johnson et al. [11] challenged existing models with
text alignment in general. Paired with big data systems, such the CLEVR dataset that incorporated complex questions requir-
VLP systems will be the key to the utilization of large visual ing a good depth of image and textual semantic understanding.
corpora and establishing meaningful relationships between vi- On the other hand, models might exhibit linguistic bias due to
sual and textual pairs. Consequently, any application domain dataset distribution [71]. Linguistic bias causes VQA models to
5
learn subtle correlations between visual and/or textual entities models, metrics, challenges, and opportunities. Table-2 high-
in the dataset causing the models to generate a correct answer lights some of the prominent generalized surveys along with the
without following the proper path of reasoning. Bias is also ex- challenges, open problems, and key contributions. Generalized
hibited by the annotators due to various factors including but surveys primarily serve as a guide to entry-level researchers in
not limited to age, sex, gender, and location [72]. the domain, are more comprehensive in nature, and introduce
VQA also has several miscellaneous domains e.g. counting- basic forms of categorization.
based questions by Acharya et al. [73], QA on change detec- Recently, the domain of VQA has been revolutionized by
tion by Yuan et al. [74], QA on 360◦ images [65], etc. Further both generative AI and new approaches to the zero-shot set-
variations of VQA based on the modality will be discussed for- ting. Such, trends haven’t been explored by the existing surveys
mally in 9.3. The definition of VQA also varies based on the which might halt the fast-paced research in the domain. Addi-
procedure of answer generation and will be discussed exten- tionally, most of the taxonomy introduced in these surveys has
sively in sec-6.1.4. While traditionally, the problem of VQA been outdated as the domain is rapidly evolving. The surveys
has been treated as a classification problem [24], contempo- rarely placed VQA in the world of multi-modal problems and
rary LLM-based architectures are shifting towards generative compared it with related domains.
answers [75].
4.2. Specialized Surveys
4. Surveys in VQA Specialized surveys in VQA can be dedicated to a partic-
ular sub-domain of VQA [4, 20], particular phases of VQA
Surveys and reviews are gateways to new research openings [77, 78], or any topic of interest [35]. In addition to surveys
and introduce newcomers to a particular domain. A compre- strictly on VQA, related surveys on generalized topics like Vi-
hensive survey is beneficial to beginners by navigating them sion and Language problems [79] or Multimodal Learning [76]
through the intricacies of the domain and providing high-level can provide researchers a broader view of the domain. Such
research direction. Similarly, domain experts and experienced surveys are key to establishing new challenges and open prob-
researchers equally benefit from surveys that highlight the chal- lems in the domain. Researchers will also benefit from surveys
lenges, recent trends, and open problems in the field. Surveys that are closely related to VQA, e.g. Image Captioning [8]. Be-
can also save time in reviewing literature and help organize ing a multimodal problem that deals with image understanding
works in the domain. and text generation, image captioning methodologies are often
Over the years, VQA has garnered significant attention in the embraced in VQA due to the high levels of similarity. Further-
research community resulting in numerous high-quality surveys more, surveys on Vision Language Pre-training (VLP) [26, 27]
[33, 32]. The surveys can be roughly categorized into general- are valuable guides to researchers wishing to explore contem-
ized surveys that provide a bird’s eye view on the whole domain porary architectures.
itself and specialized or critical surveys that extensively focus
on a part of the domain. Hence, generalized surveys tend to
Name Year Topic
explore the breadth of the domain by expanding on newer top-
ics while specialized surveys explore the depth of certain topics Barra et al. [35] 2021 Applications in VQA
through critical analysis. Although our work tries to provide a Zhang et al. [77] 2019
broad overview of the field, we also provide an in-depth analy- Fusion Techniques
Lu et al. [78] 2023
sis of some of the challenges in the domain.
Yuan [80] 2021 Language Bias
Researchers also benefited by exploring surveys on tasks
closely related to VQA [8] or more generalized surveys that en- Yusuf et al. [81] 2022 Graph Convolutional Networks
compass various multimodal problems along with VQA [76]. Zhong et al. [4] 2022 Video QA
The VLP techniques used by contemporary models are clos-
ing the gap between various vision-language problems. Hence, Lin et al. [20] 2023 Medical VQA
researchers will equally benefit by exploring VLP surveys Kafle et al. [79] 2019
[26, 27] that can be considered specialized reviews in the do- Vision and Language Research
Mogadala et al. [82] 2021
main. In recent years, there has been a steady decline in the
”VQA-specific” surveys due to the rise of VLP. To the best of Gan et al. [26] 2022
Vision Language Pre-training
our knowledge, there’s no existing review that encompasses the Chen et al. [27] 2023
domain of VQA from the early deep-learning era relying on Baltrušaitis et al. [76] 2018 Multimodal Machine Learning
CNNs and LSTMs to the modern VLP techniques. Our work
Fu et al. [83] 2018 Zero-shot Recognition
solely aims to bridge the gap between these two classes of sur-
Chen et al. [84] 2021 Zero-shot Learning
vey and provide a single comprehensive review encapsulating
the evolution of the domain. Hossain et al. [8] 2019 Image Captioning

4.1. Generalized Surveys Table 1: Topics covered by prominent specialized surveys related to VQA
The generalized surveys emphasize the traditional single
image-based setting [1] and explore the associated datasets,
6
Name Year Challenges & Open Problems Contributions
Question Constraints, Visual and Textual Comprehensive, the most cited VQA survey
Wu et al. [33] 2017 Understanding, External Knowledge, Categorization and generalization of early datasets and methods
Preference for Computer Vision-based methods Discusses emerging works like Structure Scene Text Annotation [85]
Method superiority, Dataset Bias, Emphasis on problem formulation related to Vision and Language tasks
Kafle and Kanan [32] 2017 Attention in VQA, Open-Ended (OE), Qualitative comparison of methods and evaluation metrics
and Multiple Choice (MC) Evaluation Elaborative discussion on research challenges with recommendations
Answer Type Prediction Models, Simple and straightforward survey providing an introductory view
Gupta [86] 2017
Hybrid models, Answer Generation Focuses on a few important datasets and methods
Reviews fundamental techniques with phase-by-phase generalization
Dataset Bias, Zero-Shot VQA, External
Teney et al. [87] 2017 Emphasizes attention-based and memory-augmented architectures
Knowledge, Modular Architectures
Highlights advanced methodologies and domain trends
Complex reasoning, short-term memory
Hassantabar [88] 2018 Introductory survey similar to [86] focusing on few datasets and models
and counting-based questions
R-CNN [90] based Image Featurization, Guide for newcomers expositing fundamental concepts
Manmadhan and Kovoor [89] 2020 Out of Vocabulary words, Transformer-based Highlights computer vision subtasks to solve VQA
Architectures, Sentence-based Embeddings Phase-wise comparison of methods in different VQA architectures
OE and MC Evaluation, Dataset Bias, Compares traditional VQA models to scene text VQA models
Sharma and Jalal [91] 2021 real VQA Image Featurization, Detailed result analysis on 13 prominent VQA datasets
7

Conversational, and Scene Text Questions Introduces a few open challenges in the domain
Incorporating CV and NLP strategies, Highlights major breakthroughs in VQA
Srivastava et al. [92] 2021
Real-life Datasets Discussions and analysis based on architectural paradigms

Table 2: Overview of existing generalized VQA surveys


5. Datasets 5.1.1. DAQUAR and COCO-QA
Malinowski and Fritz [93] introduced the DAQUAR dataset
The domain of VQA has a large number of rich datasets, as an early VQA benchmark and afterward, initiated the Visual
introduced throughout the years. The challenge of establish- Turing Challenge [98]. DAQUAR was a relatively small dataset
ing a VQA dataset dates back to the pre-deep learning era of restricted to indoor images but was extended with additional an-
VQA. The early VQA methods relied on manual answer gen- notated answers to the DAQUAR-Consensus dataset [25]. Fol-
eration [34] or used probabilistic methods [93]. Prior to es- lowing approaches for textually annotating large image datasets
tablishing a standard dataset, the domain faced challenges in done in image captioning [99], Ren et al. [24] proposed the
formulating a problem statement as most of the early datasets COCO-QA dataset by algorithmically generating QA pairs for
had restricted settings [93]. As deep learning-based techniques the COCO dataset [67]. COCO has been the primary source of
evolved rapidly in both vision and language, the necessity of es- natural images for all the subsequent traditional datasets and re-
tablishing a benchmark dataset for VQA became evident. Sev- mains the most popular source of images for single-image QA.
eral benchmarks [24, 1] were proposed in the early deep learn-
ing era of VQA followed by datasets that criticized various
5.1.2. VQA v1 and v2
aspects of these benchmarks [11, 94, 95]. The latter class of
datasets, often referred to as diagnostic datasets, challenged as- Antol et al. [1] introduced the VQA dataset by using unre-
pects like visual reasoning, image understanding, and language stricted or free-form questions on both real and synthetic data.
bias in existing VQA benchmarks. The VQA dataset, later known as VQA v1 or 1.0, became a
Concurrently, the early deep-learning era saw prominent domain benchmark but was quickly criticized due to linguistic
works in Knowledge-based datasets [14, 15]. that focused on bias [100, 10]. VQA was termed ”unbalanced” meaning that the
extracting knowledge from an external source or answering models trained on the dataset were emphasized on the question
fact-based questions. The problem was often extended to an- while often disregarding the image. The problem was countered
swering questions beyond the training data [96] resulting in by VQAv2 [10] by introducing a counterexample that will pro-
defining VQA in zero-shot settings. VQA also experienced duce a different answer to the same question and hence, force
growth in the number of application-based datasets primarily the model to comprehend or look at the image. Both VQA and
VizWiz [36] and VQA-Med [97] – both described in-depth in VQA v2 have been popular benchmarks for model evaluation
section-2. Recent dataset trends are focused on Video Question throughout the years.
Answering [3, 4] and Figure-Text-based question answering
[17, 21]. Video QA continues to rapidly evolve as a sub-domain 5.1.3. Visual Genome and Visual7W
with notable trends towards multimodal Video QA datasets en- The Visual Genome project has been the source of two rich
compassing other modalities along with vision. The follow- VQA datasets, Visual7W [101] and Visual Genome [85]. With
ing sub-sections will explore the broader categories of VQA 1.7M QA pairs, the visual genome project is one of the richest
datasets. datasets in both VQA and the introduced task of Visual Ground-
ing [101, 102]. For every image in the visual genome dataset,
5.1. Traditional Datasets there is a scene graph that is used to automatically generate can-
didate answers.
The datasets and task specifications evolve in every domain;
In order to introduce diversity in questions and create VQA
often leading to outcomes that differ significantly from the ini-
challenges that correspond to computer vision challenges, the
tial setting. VQA is no exception as it started off as question-
datasets introduced the ”W”s in creating QA annotations. The
answering on natural images but is currently, extended to any
7 ”W”s – ”what”, ”who”, ”when”, ”where”, ”why”, ”how”, and
form of visual input. Although the constraints on the input and
”which” are commonly used in political journalism for com-
the nature of the output have changed, the use of a single image
plete storytelling [103]. The visual7W dataset uses all 7 ”W”s
remains a popular VQA setting.
while the visual genome dataset uses 6 ”W”s, omitting ”which”.
Traditional datasets or benchmark datasets aim to create a Visual7W was described as a subset of the Visual Genome
standard for training and evaluating VQA models by providing project with a few additional annotations.
a large amount of data that closely resembles real-world scenar-
ios i.e. utilizing natural images with human-annotated QA pairs
[24, 1]. However, the datasets are restricted to a single image 5.1.4. Other Datasets
only. Traditional datasets are popular performance benchmarks Yu et al. [104] took a different approach to VQA by intro-
as they provide an overall assessment of the model. Coupled ducing fill-in-the-blanks with multiple options. Gao et al. [105]
with the fact that the traditional datasets bear a strong resem- annotated the QA pairs in Chinese and then translated them to
blance to the deployed environment, they are extremely popu- English to create a multilingual dataset. The work was pivotal
lar among the VQA community. However, these datasets have in introducing non-English VQA systems and subsequently in-
been subject to several limitations, notably, language bias [10], spired many non-English VQA datasets [106, 107, 108]. Kafle
limited reasoning on both visual and textual modalities [11], and Kanan [109] introduced 12 categories of questions with a
and not incorporating questions beyond the knowledge of the new evaluation scheme to fight biases due to the abundance of
training data [96]. a certain question category.
8
2015 2016 2017 2018 2019 2020 2021 2022 2023

OK-VQA
VQA v2
VQA GQA DocVQA
Visual Genome
FM-IQA KVQA InfographicVQA PMC-VQA
CLEVR
KB-VQA VCR VQA-CE SlideVQA
TDIUC
Visual Madlibs TextVQA ZS-F-VQA WHOOPS!
TQA
COCO-QA ST-VQA IconQA FVQA 2.0
TGIF-QA
TallyQA

VizWiz
Visual7W VQA-Med DocVQA
MovieQA ChartQA
TVQA VQA-CP
SHAPES A-OKVQA
FVQA PlotQA
Yin-Yang ViQuAE
DVQA LeafQA
AI2D AVQA
VQA-CP KnowIT VQA
CD-VQA
HowMany-QA

Figure 4: Timeline of popular VQA datasets

Ok-VQA [110] has been a popular benchmark on the single- of external knowledge. The KB-VQA dataset was relatively
image VQA, primarily used to evaluate many modern Zero- small and focused on pre-defined template-based questions.
Shot VQA models [75, 111] and Multimodal LLMs [43, 112]. The FVQA dataset proposed by Wang et al. [15] extended the
The dataset will be categorized separately due to incorporating previous setting by ensuring that each question will have a sup-
external knowledge but the readers are encouraged to use this porting fact that can be extracted from the associated KB. Fur-
dataset as a benchmark for standard, few-shot, and zero-shot thermore, utilizing multiple knowledge bases [114, 115] along
settings of VQA models. with DBpedia resulted in the reduction of dependency on a sin-
gle KB.
5.2. Knowledge-based (KB) Datasets Lu et al. [116] proposed the R-VQA dataset based on the ex-
traction of relational facts using samples from Visual Genome
The VQA datasets were usually limited by the training data [85] and the associated knowledge graph (KG). R-VQA aimed
i.e. models generalizing beyond the training data wasn’t ex- to address the semantic gap between languages and images dur-
pected. However, in realistic scenarios expect the models to ing information extraction from images or knowledge sources.
possess general knowledge and common sense. KB-VQA [14] Lin et al. [117] further addressed the imbalance of the FVQA
is defined as the setting of VQA that deals with answering by introducing adversarial samples, ensuring a higher degree
knowledge-based questions by extracting the data from a sec- of robustness. However, all the KB-specified challenges are
ondary source i.e. it employs a form of information retrieval to dependent on their associated KBs resulting in the inability to
the standard setting of VQA. The source is often referred to as a generalize beyond the specified KBs.
secondary source of knowledge as we are considering training
data as the primary knowledge source for a VQA model. Ad- 5.2.2. Open Knowledge VQA (OK-VQA) Datasets
ditionally, there can be multiple ways to represent a secondary Marino et al. [110] proposed a generalized dataset like VQA
source e.g. knowledge bases, knowledge graphs, etc. [1] but for KB systems. The dataset inherits all the attributes
A knowledge base (KB) can be defined as a collection of of traditional datasets along with questions related to external
knowledge triplets, (e1 , r, e22 ), such that, e1 and e2 represent knowledge. The diverse set of questions with the challenges of
two entities and r represent the relationship between them. The incorporating external knowledge, resulted in a large yet diffi-
set of triplets can also be formulated as a graph called a Knowl- cult dataset. As the dataset wasn’t restricted to a particular KB,
edge Graph (KG). However, unless specified, we use the term it shaped a new challenge of VQA with open knowledge or open
knowledge base to represent any external knowledge source in- domain.
cluding knowledge graphs. Knowledge bases typically have Jain et al. [118] argued that models trained on OK-VQA
large-scale structured data stored in relational database man- were able to achieve high scores while predicting based on
agement systems (RDBMS) accessible using any form of query the answer distributions. They introduced their own version of
language. KB-VQA systems can have single, multiple, or no OK-VQA as OK-VQAS 3 with another new dataset, S3VQA.
specified KBs. Following the steps of OK-VQA, Schwenk et al. [16] intro-
duced A-OKVQA as a difficult benchmark by incorporating
5.2.1. KB-specified Datasets reasoning-based questions and adding rationale to the QA an-
Wang et al. [14] proposed a test dataset to evaluate the per- notations.
formance of VQA models on external data and thereby formu-
lated the first knowledge-based setting of VQA systems. The
proposed Ahab model relied on DBpedia [113] as the source
9
Task Number Description
Name Year
Category # Images #Questions Image Description Question Description Answer Description
Synthetic template-based
Natural Images from Limited to 37 or 894
DAQUAR [93] 2014 1,449 12,468 and human annotated
OE NYU-Depth V2 [119] answer classes
Mostly object related
Generated using description One-word, class-based
COCO-QA [24] 117,684 117,684
to QA algorithm answers
Natural Images from
OE Each question is answered
2015 MS-COCO [67] Automatically generated
Visual Madlibs [104] MC 10,738 360,001 by 3 workers on Amazon
from COCO image captions [99]
FITB Mechanical Turk (AMT)
FM-IQA [105] OE 158,392 316,193 Human annotations from Baidu’s online crowdsourcing
VQA real [1] OE 204,721 614,163 Human annotations by Each question is answered
MC Cartoon-like Synthetic AMT workers with at least by 10 AMT workers
VQA abstract [1] 50,000 150,000
Clipart Images 3 questions per image
MC Natural Images from Subset of Visual Genome QA pairs1 with additional
Visual7W [101] 2016 47,300 327,939
VG MS-COCO annotations on VG, MC, etc by AMT workers
Natural Images from
OE Free-from and Region-based QA pairs based on
Visual Genome [85] 108,077 1,773,258 MS-COCO and
VG six ”W”s2 annotated by AMT workers
2017 YFCC100M [120]
OE Natural Images from Same as VQA with additional QA pairs
VQA v2 [10] 204,721 1,105,904
MC MS-COCO related to complementary images
Natural Images from
10

12 types of questions both automatically and


TDIUC [109] OE 167,437 1,654,167 MS-COCO and
manually generated with 12 volunteers
Visual Genome

Table 3: Comparison of task category, images, and QA pairs of traditional VQA datasets. OE - Open Ended, MC - Multiple Choice, FITB - Fill In The Blanks, VG - Visual Grounding

1 The Visual Genome is a large-scale crowdsourced project. Although the corresponding paper was published after Visual7W, the Visual Genome dataset had been publicly available prior to its official publication.
2 The 6 ”W”s are what, who, what, why, when, where, how. The additional ”W” introduced in Visual7W is which.
Name Contributions Limitations

Insufficient data to train large models


First VQA benchmark to attempt Visual Turing Test [98] Limited to indoor scenes with unfavorable lighting conditions
DAQUAR [93]
Various question categories with extensible templates Questions are restricted to templates and answers are limited to classes
Complicated model evaluation due to multiple metrics
Larger dataset with standardized image source [67] Unnatural and grammatically inaccurate questions
COCO-QA [24] QA algorithm is extensible to other image captioning datasets [121, 122] Limited question diversity
Easier evaluation due to formulation as classification problem Answers are limited to a single word only
Proposes the novel task of fill-in-the-blanks (FITBS) with multiple choices Insufficient answers for open-ended evaluation
Visual Madlibs [104]
Diversified questions prompts FITBs based on declarative sentences are easily answerable
Multilingual Dataset (English and Chinese)
Visual turing test-based manual evaluation is unscalable
FM-IQA [105] Free-form questions with diversified answer choices
English QA pairs may not be accurate due to automated translation
Rigorous quality assurance for Chinese QA pairs
Benchmark for free-form VQA used for evaluating many models Unbalanced dataset resulting in questions answerable without images
VQA [1] Diversified dataset with realistic and synthetic images Lack of reasoning-based and complex questions
High answer-to-question ratio with automatic evaluation Subjective questions without a single correct answer
Introduces the task of visual grounding Lacks binary (yes/no) questions
Visual7W [101]
QA diversity corresponding to multiple standard vision tasks Wide performance gap between humans and AI
11

Largest free-form dataset based on QA pairs


Difficult to evaluate long answers
Visual Genome [85] Diversified QA pairs on multiple image regions
Inherently too large resulting in preference for its subset [101]
Attribute and Relationship-based QA pairs using Scene Graphs
Introduces counter-examples to create a balanced dataset Lacks questions on general knowledge
VQA v2 [10] Reduction of Language bias as seen in VQA [1] Insufficient reasoning-based questions especially on synthetic data
Counter-examples can be used as a modality for model explainability Question category biases1 might result in poor real-world performance [109]
Wide category of questions with a balanced distribution Similar questions belonging to a certain category, especially colors
TDIUC [109] Absurd/meaningless questions for image-based reasoning The majority (around 40%) are binary questions on object presence
Evaluation strategy counters question-category biases Manual annotations come from a small sample space

Table 4: Contributions and limitations of traditional VQA datasets

1 An abundance of a certain category of questions like ”Is/Are” questions will result in models being trained better in that particular question category.
Open knowledge-based datasets have been popular evalua- leading to a form of overfitting. Usually, the models learn to
tion benchmarks of the recent VQA models [123, 124], few- correlate questions to answers based on the answer distribution
shot models [125, 126], zero-shot models [75, 111], and multi- of certain question categories and hence, exhibit linguistic bias
modal LLMs [127, 43]. The zero-shot setting closely aligns - a form of multimodal shortcut.
with the task of open-knowledge VQA as both models rely on The linguistic bias can also be viewed as the model’s inabil-
predicting beyond what they have seen during training. The ity to understand or extract features from the image. The asso-
performance difference between standard evaluation and open ciated model can be analogous to a blind model i.e. the visual
knowledge evaluation is significantly lower for zero-shot mod- input doesn’t affect the generated answer. The model’s inability
els [75] than standard VQA models. We believe that future could be attributed to either the model’s architectural incompe-
VQA datasets will draw inspiration from OK-VQA to establish tence or the dataset’s inability to train the model’s visual rea-
the groundwork for generalized VQA evaluation. soning. Several works [100] were presented to highlight the
linguistic bias and imbalance of the benchmarks [1, 24], one of
5.2.3. KB-VQA on Named Entities (KB-VQA-NE) them being the popular iteration VQA v2 [10].
Challenges in VQA attempt to replicate realistic scenarios in The seemingly unrelated concepts of bias and inability to rea-
order to develop models mimicking human capabilities or go- son are, in fact, two sides of the same coin. A biased model
ing beyond that. Recognizing a person, object, or landmark is will take various multimodal shortcuts to conclude a particular
a commonly encountered real-life task. In this context, recog- answer that will subsequently cause the model’s inaptitude for
nition refers to recalling the name of the entity. For instance, reasoning. Similarly, a model incapable of reasoning will rely
a picture of the Eiffel Tower is shown and asked, ”What is on multimodal shortcuts to generate the desired answer. Dataset
the name of this tower?” with additional questions, ”Where is redistribution might mitigate the linguistic bias resulting in bet-
this tower located?”. The equivalent KB-VQA problem set- ter visual and textual comprehension [132, 95]. Recent surveys
tings have been termed as Knowledge-aware VQA (KVQA) [133] also group datasets on bias and multimodal shortcuts in
[128] and Knowledge-based VQA on named Entities (KVQAE) VQA with commonsense and reasoning datasets [12, 134].
[129]. Instead of using multiple pre-established terms, we will
refer to the class of datasets as KB-VQA-NE by simply using 5.3.1. Reasoning-based Datasets
the extension Named Entity (NE) to KB-VQA. Andreas et al. [135] introduced a synthetic dataset compris-
Shah et al. [128] introduced a large-scale dataset with 183k ing various arrangements of colored shapes with compositional
questions related to people from Wikidata [130]. A knowl- questions to evaluate visual reasoning. Johnson et al. [11] intro-
edge graph containing the relationship between entities was duced the popular CLEVR dataset, termed diagnostic dataset.
used to retrieve information on those particular entities. Lerner The dataset was instrumental in highlighting several drawbacks
et al. [129] extended the setting to landmarks and named ob- of VQA models trained on the traditional datasets [1, 24]. The
jects along with highlighting few-shot and zero-shot methods in CLEVR dataset has been extended to various other tasks includ-
the sub-domain. KB-VQA-NE datasets can also be considered ing referring expressions [136], visual dialog [137], explainable
a reformulation of the face-recognition and object-recognition AI [138], natural language explanations [139], domain robust-
problems in computer vision for VQA systems. However, ques- ness [140], and more.
tions provide the flexibility of performing different tasks and In the following years, several large-scale datasets [12, 7]
can make the KB-VQA-NE systems an excellent choice for re- aimed to replicate the generalization capabilities of established
alistic use cases. benchmarks [1, 85], while incorporating reasoning capabilities.
Zellers et al. [7] introduced a novel task termed Visual Com-
5.3. Reasoning and Bias Reduction (RBR) Datasets mon Sense reasoning that emphasizes the rationale along with
With the advent of large standardized VQA datasets [1], peo- the answer. The GQA dataset [12] addressed both reasoning
ple attempted to criticize the benchmarks by highlighting the and linguistic bias along with being a large-scale generalized
underwhelming performance of the associated models on com- benchmark. The dataset has been a popular choice of evalu-
plex reasoning-based questions [11]. As the training data didn’t ation for state-of-the-art VQA models and zero-shot methods.
include complex questions, the trained models were unable to Zhang et al. [13] used questions from general intelligence tests.
comprehend the new challenge. Several datasets [11, 12, 131] Bitton-Guetta et al. [141] used weird, unconventional images to
were proposed to enhance the reasoning capabilities of the test a model’s reasoning capabilities.
models and incorporate challenges related to visual and tex-
tual reasoning. Some of the corresponding datasets are often 5.3.2. Bias Reduction Datasets
termed diagnostic datasets [11] but will simply be called rea- Zhang et al. [100] simplified the linguistic bias and imbal-
soning datasets in this literature. ance in VQA datasets by introducing a binary classification
VQA datasets aim to train models with a comprehensive un- problem on the VQA abstract dataset [1]. The popular VQA
derstanding of both visual and textual modalities but without v2 [10] can also be considered a bias-reduction dataset as it re-
exploiting any other correlation in the training data. As pre- duces the imbalance in the VQAv1 dataset [1] using counterex-
viously discussed in section-1, datasets can exhibit different amples.
forms of association within their data entries and these asso-
ciations can be easily captured by VQA models, potentially
12
Name Year #Img #Ques Image Description QA Description KB/KG Contributions
Natural images from MS
Generated by 5 human annotators Introduces KB test setting of VQA models
KB-VQA [14] 2015 700 3-5/img COCO [67] val set covering DBpedia [113]
based on 23 templates Multi-reasoning on image and KB
150 objects and 100 scenes
Natural images from MS Collected from 38 annotators DBpedia Larger dataset with multiple knowledge bases
FVQA [15] 2190 5826 COCO val set and encompassing 32 types of ConceptNet [115] Supporting facts enable answering complex
2018
ImageNet [142] test set questions WebChild [114] questions that requires deep reasoning
Instances of natural images, QA pairs, and relation facts from Visual Genome [85] Ensures retrieval of relevant concepts
R-VQA [116] 335k
Facts are filtered using a ranking algorithm and evaluated by humans Utilization of semantic knowledge in images
Support Introduces KB-VQA on named entities
Natural images of 18k people Templated questions
KVQA [128] 24k 183k set from Reasoning on KGs to answer KB questions
2019 from Wikidata [130] annotated by humans
Wikidata Largest KG-based VQA dataset
Knowledge-base independent setting
Generated by AMT workers
Natural images randomly Large-scale and difficult dataset
OK-VQA [110] 14,031 14,055 with 5 answer labels and then
sampled from MS COCO None Diverse knowledge categories
13

filtered to KB questions only


(Open Rigorous filtering ensures high-quality QA
OK-VQAS 3 [118] 2640 Reannotated subset of OK-VQA Knowledge) Answerable only by consulting a KB/KG
2021
Automatically generated on Wiki Prevents guessing from answer distribution
Natural images from
S3VQA [118] 6765 pages using T5 model [144] Enforces information retrieval
OpenImages collection [143]
Exactly 1 answer/question Naturally explainable approach
Natural images from QA and rationale annotated Rationale enhances reasoning, information
A-OKVQA 23,692 24,903
2022 MS-COCO by 437 AMT workers retrieval and model explainability
Automatic annotations on Covers a wide range of entities for VQA on
Natural images from
ViQuAE 3.7k 3.3k TriviaQA [146] from KILT [147] Wikipedia named entities extending KVQA
Wikimedia Commons [145]
Manually rephrased and filtered Diverse entity types
Automatically generated using Dbpedia Reduces language bias due to question
Natural images from
FVQA 2.0 2023 9899* question templates from KG ConceptNet patters and answer distribution
MS-COCO and ImageNet
triplets and filtered manually WebChild Improves robustness through augmentation

Table 5: Knowledge-based VQA datasets: KB - Knowledge Base, KG - Knowledge Graph, AMT - Amazon Mechanical Turk
To grasp the concept of counterexamples, let’s consider a [151] worked with 5 diagram classes - horizontal and verti-
dataset where there is an image of a man wearing glasses with cal bar, line, dotted line, and pie charts. Methani et al. [17]
the associated question, ”Who is wearing the glasses?” and an- and Masry et al. [22] shifted the focus to real-world charts that
swer, ”The Man”. Hence, there is an imbalanced distribution of were either extracted or generated from actual data. Chaudhry
men wearing glasses resulting in the model associating glasses et al. [154] tried to evaluate the reasoning capabilities of figures
with men. A counterexample of women wearing glasses with by using GRE-based questions; drastically increasing the task
the associated QA pair can help mitigate the dataset imbalance. complexity.
Redistributing the VQA datasets has been a popular choice to Several datasets took a specialized approach to specific types
address biases. Redistribution means splitting the training, vali- of visualization. For instance, Kafle et al. [152] studied bar
dation, and test sets differently. Agrawal et al. [95] ensured dif- charts only for structural understanding and reasoning on a par-
ferent answer distributions in training and val-test splits. [132] ticular form of diagram. Similarly, Mathew et al. [23] worked
redistributed by identifying training counterexamples from val- on infographics which are significantly harder to analyze due
test splits. Chen et al. [149] proposed a redistribution of the to the content variety and the dominance of text in the image.
F-VQA dataset [15] in order to evaluate models in zero-shot Comprehending infographics requires high-level understanding
settings. and reasoning over both visual and textual content in an image.
Kembhavi et al. [55] explored a unique QA task on middle-
5.3.3. Counting Datasets school science lessons. The evaluation MCQs at the end of a
Counting-based challenges in VQA are closely related to the lesson were used as the questions. The dataset enabled us to
reasoning challenges. In fact, many reasoning datasets [11] em- assess the current state of machine intelligence in academics.
ploy questions that require counting different objects to eval-
uate visual reasoning. Trott et al. [150] proposed a count-
ing dataset by filtering ”how many” questions from traditional 5.4.2. Text VQA
datasets [10, 85] and highlighted the limitations of VQA models Although most visualizations contain a good amount of tex-
in counting. Acharya et al. [73] extended the previous work by tual content, the models trained for QA on visualizations em-
incorporating complex and manually annotated counting QAs phasize the structural understanding of the image and the rela-
with associated images. VQA models are still struggling with tionship with the visualization entities. A separate set of VQA
counting-based questions, opening opportunities for improve- datasets focuses on reading and understanding text either from
ment. real scenes or digital images of texts. Singh et al. [153] and
Biten et al. [19] proposed datasets with natural images of texts
5.4. Vis-Text Datasets in realistic scenes extracted from large natural image datasets.
The task of VQA primarily used a single image as the visual Mathew et al. [21] considered images of documents for QA
input. With the increase in architectural complexity, researchers in order to extensively test the reading capabilities of VQA
attempted to test the models on other vision-related tasks. The models. A similar task was previously proposed by Mishra
scope of VQA was broadened to any form of image includ- et al. [18] on images of book covers. Their work relied on using
ing diagrams [53], charts [22], infographics [23], slides [57], an Optical Character Recognition (OCR) module and was later
etc. Throughout the literature, the term visualization is used as extended by Zeng et al. [156] on a more robust setting. Tanaka
an umbrella term to represent any kind of visual representation et al. [57] proposed the unique task of performing QA on slide
including but not limited to figures, charts, graphs, plots, dia- decks. Both documents and slide decks require a high-level un-
grams, infographics, digital graphics, and slides. Similar works derstanding of non-textual content like diagrams, charts, and
highlighting the inability of models to interpret texts in a natural table structures along with the textual content; removing the
image [19, 153] also grabbed the attention of researchers. boundaries between visualization and text-based datasets.
Both the tasks of understanding visualizations and interpret-
ing texts in natural scenes require some form of text-reading
skill. While comprehending figures requires a higher level 5.5. Miscellaneous Datasets
of structural understanding, works on document interpretation
[21] require a higher level of text recognition. Document pro- As seen in fig-1, VQA has been diversified, venturing into
cessing VQA systems are likely to benefit from external mod- different domains and modalities. The trending sub-domain
ules like Optical Character Recognition (OCR). However, as VideoQA [3, 4] extends the classical visual input from spatial
the similarities between figure VQA and text VQA outweigh to spatiotemporal. Furthermore, VideoQA expands the scope of
their dissimilarities, they have been grouped together. VQA by incorporating other modalities, e.g., audio and knowl-
edge base. Datasets on assisting the visually-impaired and med-
5.4.1. Visualization Datasets ical images were explored in section-2.1 and 2.2 respectively.
Siegel et al. [155] worked on parsing figures from research Yuan et al. [74] changed the task setting of VQA by asking
papers and established the groundwork for Vis-Text VQA. questions change-related questions on a pair of images and
Kembhavi et al. [53] proposed the task of interpreting scientific thereby merging VQA with Change Detection. Bansal et al.
diagrams for QA using parse graphs. Digital chart interpreta- [2] extended the VQA setting to a set of images, unlike the tra-
tion was emphasized in the subsequent years as Kahou et al. ditional setting restricted to a single image.
14
Name Year Description Contributions
SHAPES [135] 2016 Evaluate understanding of spatial relations and complex questions
Reasoning-based
CLEVR [11] 2017 Compositional questions to test complex visual and logical reasoning
synthetic dataset
IconQA [131] 2021 Incorporates diagram understanding and basic knowledge
Large-scale reasoning- Introduces new task of producing rationale behind an answer
VCR[7] based dataset from Generalized benchmark for evaluation of common-sense reasoning
2019 movie scenes of VQA systems
Large-scale reasoning- Controlling the answer distribution to address language bias
GQA [12] based dataset from Multi-step reasoning-based and compositional questions combine
Visual Genome [85] bias reduction and reasoning challenges
RPM [148] problem- Prioritizes visual reasoning over visual recognition
RAVEN [13]
based reasoning dataset Addresses compositional reasoning and visual memory
AI generated reasoning-
WHOOPS! [141] 2023 Challenges the reasoning on unconventional images
based dataset
Reannotates VQA
Address language bias allowing to focus on complex semantics
Yin-Yang [100] 2016 abstract [1] for binary
and gain a better image understanding
classification
VQA-CP [95] 2018 Redistribution of VQA Reduces textual bias through different answer distribution
datasets [1, 10] Detection of multimodal shortcuts in VQA datasets
VQA-CE [132]
2021 Evaluation of robustness based on the aforementioned shortcuts
Redistribution of F-VQA
ZS-F-VQA [149] Evaluation of VQA models in Zero-shot settings
dataset [15]
Subset of VQA v2 [10] Challenges VQA systems using questions that require counting
HowMany-QA [150] 2018
and Visual Genome different objects in an image
Manually annotated and
TallyQA [73] 2019 Includes complex and novel counting-based questions
imported dataset

Table 6: Reasoning and Bias Reduction Datasets in VQA

5.5.1. Video Question Answering (VideoQA) by highlighting the fundamentals, advanced techniques, and
VideoQA emerged as a popular subfield of VQA with more evolution throughout the history of the domain.
datasets being proposed to expand the modalities. Tapaswi et al.
[157] proposed QA on movies and Lei et al. [5] proposed the 6.1. Fundamental Techniques
same task on TV series - both works process videos with audio VQA has been approached in a multitude of ways but the
feed. However, a lack of audio-related questions led to Yang standard approach can be broken down into three separate
et al. [158] proposing a dataset to specifically address audio- phases: Feature extraction, Feature conjugation, and An-
visual questions. Xu et al. [3] introduced two popular VideoQA swer generation.
datasets generated from large-scale video description datasets. Feature extraction aims to extract meaningful information
Garcia et al. [159] introduced a knowledge-based setting for from the multi-modal visual and textual inputs. While most
VideoQA similar to KB-VQA. architectures rely on two separate encoders for the two modal-
Other interesting settings of VideoQA are gameplay videos ities, a few contemporary architectures employ a single multi-
by Mun et al. [160] and GIFs by Jang et al. [64]. Das et al. modal encoder for feature extraction. If a monolithic architec-
[66] formulated the task of embodiedQA, discussed elaborately ture has not been used, the next phase will be feature conju-
in sec-9.2.5, as an extension of VideoQA and an intersection gation which deals with the aggregation or combination of the
of VQA with Reinforcement Learning (RL) by proposing the unimodal features extracted in the previous phase. The fused
EQA dataset. VideoQA is a promising field to bring forward output is forwarded for answer generation that can either be a
new datasets and challenges. An in-depth literature review on classifier [eq-6] or a natural language generator [eq-7] depend-
VideoQA is beyond the scope of this paper and readers are en- ing on the problem formulation.
couraged to study the VideoQA survey by Zhong et al. [4]. For the VQA model, M : (V, Q) → A, we define the visual
and textual encoders are encv and encq respectively. The encod-
ings are then combined using multimodal fusion, Φ, discussed
6. Methods in section-6.1.3. For the visual features, v f , and textual features,
Over the years, VQA architectures evolved from non-deep q f , the visual, textual, and fused encodings can be represented
learning paradigms relying on probabilistic techniques [93] to as,
large-scale monolithic architectures [123]. This section ex- hv = encv (v f ) (2)
plores the methodologies in the traditional single-image VQA hq = encq (q f ) (3)
15
Table 7: Figure and Text-based Datasets in VQA

Name Year #Img #Ques Descrition Contributions


Multiple choice (MC) Task involving diagram structure, elements,
manually annotated QA and relationships using parse graphs
AI2D [53] 2016 5k 15k
on scientific diagrams Complex visual input with higher levels of
scrapped from Google element relationship increases task difficulty
Template-based QA on Logical linking between multiple plot elements
FigureQA [151] 100k 1M digital charts from 5 of different types of graphs
2017
diagram classes Additional data for secondary objectivesa
Images from middle Variations in questions based on text, diagrams
school science lessons or both with high difficulty levels
TQA [55] 1k 26k
with MCQs given in the Questions with additional textual context
lesson evaluate high-level reasoning
Template-based QA on Structural understanding, retrieval of data, and
DVQA [152] 2018 3M 3M
digital bar charts reasoning on bar charts
3-stage crowdsourcing pipeline ensuring high
TextVQA [153] 28k 45k Manually annotated, quality images with texts
2019 images of texts in real Variety in questions for a particular image
scenes from natural Addresses language bias due to strong answer
ST-VQA [19] 23k 31.8k image datasets dependency on texts in scenes
Lower evaluation ambiguity
Template-based QA Intersection of Optical Character Recognition
OCR-VQA [18] 207k 1M
on book cover images (OCR) and VQA tasks
Semi-automated QA on Data label variation with real-world QA pairs
PlotQA [17] 2020 224k 28.9M
real-world plots Questions with out-of-vocabulary words
Digital images of charts Real-world diagrams on multiple categories
generated from public Structural and relational QA pairs
Leaf-QA [154] 2020 250k 2M
data sources with Out-of-vocabulary answers
template-based QAb GRE-based questions increase complexity
Manually annotated,
Increased tasks complexity due to the inclu-
images of documents
DocVQA [21] 12k 50k sion of document elements like tables, charts,
2021 from UCSF Industry
forms, etc.
Documents Library 1
Digital images of info-
Annotation verification to ensure high quality
graphics scrapped from
InfographicVQA [23] [23] 5.4k 30k Emphasizes questions requiring basic
the internet with manual
reasoning and arithmetic
QA annotations
Hybrid annotation on Reasoning at both visual and logical levels
ChartQA [22] 2022 21.9kc 32.7kd digital charts crawled Hybrid annotation allows training flexibility
from online sources and comparison of annotation quality
Manual annotation on Sequential format of slides introduces new
SlideVQA [57] 2023 2.6k 14.5k slide decks of 20 slides challenges for current VQA models
from slideshare2 Single/multi-hop and numerical reasoning

a
Additional data including numerical values used to chart generation, bounding box annotations of plot elements, etc. can be used
for objectives like Visual Grounding.
b
Multiple paraphrases of the question templates were generated using Google Translate and one of them was randomly selected.
c
ChartQA-H (human annotated) has 4.8k images and ChartQA-M (automatically annotated) has 17.1k images.
d
ChartQA-H has 9.6k questions and ChartQA-M has 23.1k questions.

1
https://www.industrydocuments.ucsf.edu/
2
https://www.slideshare.net/

16
Table 8: Miscellaneous Datasets in VQA

Task Category Name Year Description


MSRVTT-QA,
Video 2017 Automatic open-ended QA from video description datasets
MSVD-QA [3]
MovieQA [157] 2016 MCQ-based questions on movies
Multimodal Videoa
TVQA [5] 2018 MCQ-based questions on TV series
KB Video KnowIT VQA [159] 2020 MCQ-based on a single TV series
Gameplay Video MarioQA [160] 2017 Open-ended QA on gameplay videos of a 2D Mario game
Frame and Video TGIF-QA [64] 2017 Hybrid annotation on Tumblr GIF (TGIF) dataset [161]
Visually Impaired Conversational questions on realistic images captured by
VizWiz [36] 2018
Assisstance visually impaired people
VQA-Med [97] 2018 Open-ended QA on radiology images
Medical VQA
PMC-VQA [162] 2023 MCQ-based questions on a mixture of modalitiesb
Image Set IS-VQA [2] 2021 Open-ended QA on indoor and outdoor scenes
Change Detection CD-VQA [74] 2022 Automatically generated QA on the SECOND dataset [163]

a
Standard VideoQA doesn’t have an audio feed while multimodal videos include the audio modality.
b
”Modality” refers to the types of medical images e.g. radiology, pathology, etc.

h = Φ(hv , hq ) (4) of deep learning, Neural networks [173, 174, 175] have be-
come very popular among researchers to retrieve image fea-
Finally, the fused output is passed to the answer generator to tures. Convolutional Neural Networks (CNN) [176, 177] sig-
produce the answer, a, based on the formulation in section- nificantly outperformed regular Feed Forward Neural Networks
6.1.4. for image feature extraction. Over the years, the CNNs boasted
a = gena (h) (5) a higher number of layers with a larger parameter count. Mod-
els like LeNet [177], AlexNet [176], VGG [178], ResNet [179],
It should be noted that this is a basic formulation of the tra- InceptionNet [180] are just a few examples — all fueled by the
ditional VQA architectures relying on the fundamental tech- ImageNet [142] competition.
niques only while subsequent models have incorporated ad-
Several VQA models [24, 1] use CNN models for visual fea-
vanced techniques like utilization of attention and memory. The
ture extraction and employ transfer learning to some degree.
technique is often termed as Joint Embedding as the generated
Transfer learning is an early concept in machine learning re-
answer is based on the joint visual and textual embedding. The
search [181] denoting how an ML model trained in one task can
following subsections will elaborate on feature extraction, fea-
transfer its knowledge stored in the gradients to another model
ture conjugation, and answer generation.
being trained on a different task. This approach is noticeably
more efficient than training another model from scratch. The
6.1.1. Visual Feature Extraction model learns structural information and low-level features in
The visual input is high-dimensional data that is costly to the lower layers and can be transferred to other related tasks.
process. Visual feature extraction captures the important infor- Transfer learning is particularly useful for vision models as
mation from the high-dimensional visual input and produces a training such models from scratch requires a massive amount
dimensionality-reduced vector representation of the image. The of training data, time, and computational resources as docu-
vector representation should hold an abridged form of informa- mented by Szegedy et al. [180]. VQA models usually incorpo-
tion mapped to a space where similar features in different im- rate GoogLeNet [180], VGGNet [178], ResNet [179], and vari-
ages produce similar vectors. Thus, feature extraction ensures ents of R-CNN [90, 182, 183] as visual feature extractors. The
that the semantic components of the visual input are conveyed CNN backbones are typically pre-trained on Imagenet [142] or
to the model. Furthermore, transforming the visual input into COCO [67] datasets.
a vector ensures mathematical manipulation and simplifies the The usual approach is freezing the whole pre-trained model
process of grouping similar inputs. except the last few layers and fine-tuning those layers on the
In the era predating the widespread adoption of deep learn- target dataset [184]. The procedure significantly reduces the
ing models, methods like explicit RGB vector, SVM [164], training time while requiring less computational resources and
HAAR [165, 166], HOG [167], SIFT [168], Singular value training data. The gradual shift from simpler architectures like
[169], PCA [170] or simply hand-crafting kernels to extract GoogLeNet and VGGNet to latter architectures like ResNet and
characteristics [171, 172] was functional to a certain degree on R-CNN provides key insights into the requirements of image
specific tasks. While these methods were easily computable, feature extraction. Both deeper architectures and object local-
they lacked strong generalization capabilities. With the onset ization can drastically improve the quality of extracted visual
17
Visual Input
Image Encoder

Attention Multimodal Fusion

Textual Input

VGG-Net
What is the name ResNet
Vector Operations EMW, EMA, VC
of the animal? Faster R-CNN
Single-Hop Multi-Hop NN-Based LSTM, GRU, CNN
GoogLeNet
Visual Textual Co Bilinear Pooling MCB, MFB

Textual Embedding

What Answer Generator


Textual Encoder (Classifier)
N classes
is Textual Answer

Giraffe
...
LSTM/BLSTM
One Hot Encodding GRU/BGRU
SkipThought Bag of Words (BoW)
Word2Vec CNN
GloVe

Figure 5: Overview of the traditional pre-transformer VQA architecture based on Joint Embedding and Attention. CNN-RNN-based encoder pairs, Multimodal
Fusion, and classification head were used.

information. called Word Embedding. Unlike images containing spatial


A recent development in computer vision is the emergence of data that can be easily represented in the vector space, texts
visual transformers by Dosovitskiy et al. [185] which breaks an have semantic data which makes it challenging for a vector
image into 16×16 image patches and feeds to a Transformer ar- space representation.
chitecture [28]. Liu et al. [186] later proposed the Swin Trans- One of the earlier approaches is to represent each word as a
former which uses a shifting window to apply self-attention, one-hot vector representation where each n word in the corpus
mimicking the convolution operation of CNNs. The architec- can be represented using an n dimensional vector. However,
ture offers relatively better performance in tasks that require de- this simple representation assumes that every word is indepen-
tailed feature extraction e.g., semantic segmentation and object dent to each other and disregards the similarity between word
detection. vectors. Hence, every word vector will be orthogonal in the vec-
However, convolutions are still relevant as Liu et al. [187] tor space. Additionally, as the vector size depends on the vo-
proposed ConvNext architecture that can perform on par or bet- cabulary length, the representation can become quite long for
ter than transformer-based models in tasks including classifica- larger vocabularies. Count-based methods were also popular
tion, object detection, and semantic segmentation. The archi- in the early days of NLP. These methods constructed a matrix
tecture borrows certain ideas from modern transformer models based on the frequency of word occurrences [191] and can be
and applies them to classic CNN-based architecture. The afore- further broken down by computing the singular value decom-
mentioned backbones are widely used by modern VQA mod- position (SVD) [192] to get the best-fit approximation of lower
els for visual feature extraction as these methods are proven to dimensions. However, these earlier representations are limited
be quite effective [188, 189, 190]. Although contemporary vi- by their inability to capture semantic similarity between words.
sual feature extraction techniques may adhere to better results, If the semantic information is properly captured, words
they require a significant amount of time and computational re- with similar meanings should be close in the vector space.
sources. Prediction-based models were employed to predict the next
word given a current context. The associated probability can
6.1.2. Textual Feature Extraction be represented as P(xt |xt−1 , ..., x1 ) where xt represents the tth
Textual data from the question should be processed to create word. However, it is rare to use all the previous words as the
vector representations of the words or sentences. Researchers context, and models like n-gram use the last n words as the con-
have been seeking ways to improve such vector representations text. Consequently, the model learns the word representations
18
by expressing each word in a k dimensional space and capturing generative problem if the model has to generate a natural lan-
the semantic information. guage answer. When treated as a classifier or discriminative
Earlier works using Neural networks [193, 194] paved the model, a single-word answer is usually generated. Evaluation
way for sequential models such as Recurrent Neural Networks of such answers tends to be easier using simple evaluation met-
(RNNs) [195] for linguistic representations [196]. Methods rics like VQA accuracy [1]. On the other hand, generative mod-
like Word2Vec [197], CBOW, and Skip-Gram [196] were de- els often produce longer answers that are tough to evaluate.
veloped to learn word representations. These models were For a visual input v, a question input q, and generated answer
trained to predict a masked word given the context surrounding â, we consider a VQA model, M : (V, Q) → A with the param-
it. Through predictions, the models learn vector representations eters θ. Depending on the problem formulation, the model can
of words that are later used to extract textual features. Fur- either be treated as a classifier or a generator.
ther research on the RNN architecture provided us with GRU To generate the answer â from a set of answers A, the formu-
[198] and LSTMs [199] — widely popular in VQA for question lation of VQA as a classification task can be defined as:
feature extraction as they are able to preserve long-range con-
textual information. The recent transformer architecture [28] â = arg max p(a|v, q; θ) (6)
a∈A
outperforms regular LSTMs and GRUs and has been widely
adopted in VQA [200, 201, 202]. For the task of answer generation, given a set of answer to-
kens â = {â1 , ..., ân }, the generative formulation of VQA is de-
6.1.3. Feature Conjugation or Multimodal Fusion fined as:
After extracting the visual and textual features from the sep- ân = arg max p(a|v, q, â<n ; θ) (7)
a∈A
arate streams, the model has to combine them to form a joint
feature vector. Before, the feature conjugation phase, the visual
and textual inputs were processed more or less independently. 6.2. Advanced Techniques
Feature conjugation is also called fusion and is a form of multi- Although the fundamental techniques form the base of VQA,
modal fusion that deals with combining vectors from two sep- contemporary methodologies rely on several advanced tech-
arate modalities into a single modality. For VQA, the multi- niques that can provide several advantages including perfor-
modal fusion usually aggregates the visual and textual inputs mance improvement, model scalability, robustness, etc. The
but might also fuse auditory data for sub-domains like Multi- realm of advanced VQA techniques can be overwhelming. In
modal VideoQA. this subsection, we discuss a few of these advanced methods.
Generally, the input streams are independent to each other
before fusion. But, techniques like attention might result in the 6.2.1. Attention
streams affecting each other. For instance, Chen et al. [203] When an image or a text is given to a human, that person
used question-guided attention i.e. based on the textual input a doesn’t equally focus on the whole image or text. Depending
form of attention will be produced on the visual input. How- on the content, the person might focus more on a particular re-
ever, attention isn’t a form of fusion. In fact, the easiest way gion of the image or a particular portion of the text. A similar
to identify the fusion phase is to recognize the step from where mechanism is employed in VQA following the success of atten-
the visual and textual features lose their own identity and can- tion in standard computer vision tasks like object recognition
not be differentiated from one another. Before feature conju- [228] and multimodal tasks like image captioning [229, 8]. At-
gation, both the visual and textual features can be used in an- tention can be described as soft weights, where the model learns
other network for another task unrelated to VQA. After feature to focus more on certain parts of the input.
conjugation, the modalities are merged and the resulting output While there are multiple ways of implementing attention, the
encapsulates multimodal information of the two input streams. mechanism fundamentally relies on generating an attention vec-
Vector operations have been popular choices of feature con- tor by correlating the modalities followed by some form of nor-
jugation. The operations include Vector Concatenation (VC) malization. The entire process is typically referred to as a single
[204, 205, 206], Element-Wise Multiplication (EWM) [1, 101, attention layer. Depending on the architecture, models can in-
207], and Element-Wise Addition (EWA) [203, 15, 208]. The corporate single or multiple attention layers. Certain types of
methodologies will produce a simple yet effective joint embed- multi-layer attention may be dependent on the attention output
ding that can be further passed through the latter layers of the of the previous layers. Attention mechanisms can be classi-
model. Apart from vector operations, neural network-based fied based on the number of attention layers as single-hop and
approaches like using a CNN or RNN-based network can also multi-hop attention. Each category can be further classified
be used for fusion. Bilinear pooling has been a popular choice based on attention modality as visual attention, textual atten-
for multimodal fusion. Prominent works include Multimodal tion, and co-attention. Zhang et al. [77] extensively analyzed
Compact Bi-Linear Pooling (MCB)[209] for VQA [210] ad- the formulation of various forms of attention in the aforemen-
dresses the Multimodal Tucker Fusion model. tioned categories.
Lu et al. [218] utilizes image-question co-attention to join the
6.1.4. Answer Generation embeddings. Word-to-region attention [230] bridges the gap
VQA can either be treated as a classification problem if the between keywords detected in questions with image regions.
model has to pick an answer from a fixed set of answers or a Anderson et al. [223] introduced bottom-up top-down attention.
19
Architecture
Name Comments
Image Text Fusion Answer
Encoder Encoder Strategy Generator
Neural-Image-QA [25] GoogLeNet LSTM First Deep Learning Approach
VIS+LSTM [24] VGGNet LSTM Softmax COCO-QA Baseline
mQA [105] GoogLeNet LSTM EWA LSTM, Softmax F-VQA Baseline
LSTM Q+I [1] VGGNet LSTM EWM VQA Baseline
ABC-CNN [203] VGGNet LSTM EWA Introduces Attention
Softmax
iBOWIMG [204] GoogLeNet BoW VC
Full-CNN [211] VGGNet CNN
LSTM-Att [101] VGGNet LSTM EWM LSTM, Softmax Visual7W Baseline
SMem-VQA [212] GoogLeNet BoW EWA
DPPNet [213] VGGNet GRU DPL DPL - Dynamic Parameter Layer
Word + Region [214] VGGNet BoW VC
SAN [208] VGGNet CNN/LSTM EWA Uses multiple attention layers
MRN [215] VGGNet, ResNet GRU EWA
Softmax
DAN [216] VGGNet, ResNet BLSTM EWM
MCB [209] ResNet LSTM BL Introduces Bilinear Pooling
MLB [217] ResNet GRU BL
HieCoAtt [218] VGGNet, ResNet LSTM EWA Introduces Co-attention
DMN+ [219] VGGNet BGRU
Attr-CNN+LSTM [220] VGGNet LSTM LSTM LSTM
MFB [221] ResNet LSTM BL
MLAN [221] ResNet GRU EWA
Softmax
MUTAN [210] ResNet GRU BL
SAAA [222] ResNet LSTM VC
Up-Down [223] FR-CNN, ResNet GRU VC Sigmoid
MFH [224] ResNet LSTM BL Softmax
DCN [225] Resnet BLSTM VC Sigmoid
Tips-Trick [207] FR-CNN, ResNet GRU EWM Sigmoid Sigmoid outperforms Softmax
BAN [226] FR-CNN GRU BL Softmax
MCAN [227] FR-CNN LSTM EWA Softmax

Table 9: Architectural overview of traditional VQA methods preceding the transformer and VLP era. Some of the common techniques incorporated by these
methods are Joint Embedding, Attention, Modular Network, and Bilinear Pooling Fusion.

Malinowski et al. [231] introduced “hard attention” to formu- [195, 199, 198] made training them considerably slow. Fur-
late visual and textual attention by filtering redundant visual thermore, the transformer architecture was able to capture long-
features. The mechanism works as an effective feature selec- range dependencies surpassing the memory techniques used in
tion method for visual features, showing great performance on LSTMs [199] and GRUs [198]. Similar to RNNs the trans-
cluttered and noisy data. Rahman et al. [232] utilizes bounding former architecture has several variants like BERT [233], GPT
boxes along with the visual input to extract visual features and [59], RoBERTa [234], T5 [144], etc.
compute attention over the existing attention vectors as a form Transformers were not limited to the textual domain as Doso-
of co-attention. vitskiy et al. [185] proposed the Vision Transformer (ViT)
which became a popular choice for visual feature extraction.
6.2.2. Transformer Furthermore, transformer-based models were widely employed
Transformers form the foundation of contemporary VQA ar- for fusing the visual and textual modalities. Lu et al. [235] re-
chitecture. Vaswani et al. [28] proposed that layers of self- lied on Cross-Modal Transformers (CMT) using cross-attention
attention are sufficient for language encoding and can outper- in their fusion scheme. The architectures differed by the num-
form state-of-the-art sequential models in NLP tasks like Ma- ber of transformer layers or the type of attention used. The
chine translation, language parsing, etc. The transformer archi- CLIP architecture [59] aligns vision-language modalities and
tecture is fundamentally an encoder-decoder architecture based has been a popular module in zero-shot VQA methods.
on a variant of attention called self-attention. The primary ad- The underlying equations behind the transformer architecture
vantage of using transformers as text encoders is parallelization and its derivatives have been discussed extensively in different
of the processing which leads to significantly reduced training literature. For the sake of conciseness, the associated equations
time. In contrast, the sequential processing of RNN-variants will not be covered in our work. Similar to CNNs and RNNs,
20
we will treat transformers can black boxes. Recurrent Neural Networks (RNNs) [195] were popular in
working with sequential data like texts. The two most popular
6.2.3. Vision Language Pre-training (VLP) variants of RNNs were Long Short-Term Memory Networks
Following the success of pre-trained models in Computer Vi- (LSTMs) [199] and Gated Recurrent Units (GRUs) [198]. The
sion (CV) and Natural Language Processing (NLP) [233, 59], gated memory used in both variants solved the vanishing gradi-
work began on vision-language pretraining for tasks like Im- ent problem in the vanilla RNNs. While LSTMs were popular
age Captioning, VQA, and Visual Reasoning. The goal of pre- choices for encoding the textual input in VQA [24, 1], GRUs
training is to train a large-scale generalized model on multi- gained popularity in the upcoming years due to having fewer
ple tasks using a large volume of data. The resulting pre- parameters and requiring less training time.
trained model will be further fine-tuned on downstream tasks. Following eq- 2, 3, we can rewrite the visual and textual en-
Popular choices for pre-training tasks include Masked Lan- coding equations for CNN-RNN-based architectures as
guage Modeling (MLM) [236, 233], Masked Vision Modelling
(MVM) [237], and Vision-language Marching (VLM) [238]. hv = CNN(v) (8)
Apart from VQA, popular downstream vision-language tasks
include Visual Captioning [239, 8], Visual Entailment [240], hq = RNN(q) (9)
Visual Commonsense Reasoning [7], Visual Grounding [101],
etc; some of them will be discussed extensively in sec-9.2. During the multimodal fusions, vector operations like
Element-wise Addition (EWA), Element-wise Multiplication
6.3. Taxonomy of VQA Architectures (EWM) or Hadamard Product, and Vector Concatenation (VC)
are used to join the modalities. Certain vector operations im-
Categorizing VQA architectures can be challenging due to pose restrictions on the encoding dimensions. For instance, the
the variety of approaches present in the literature. Earlier meth- dimensions for visual and textual encodings must match to per-
ods often adopted joint embedding architectures, where the two form EWA and EWM. Although VC has no such restriction,
input streams are independently processed and fused. In con- it increases the dimensions of the fused encoding. Following
trast, modern VLP architectures incorporate transformer-based eq-4, the corresponding equations for the EWA, EWM, and VC
encoders, fusers, and decoders. Works [98] preceding the deep are:
learning era will not be covered in the review. The domain of
VQA can be subdivided into the traditional CNN-RNN-based,
the subsequent CNN-BERT-based, and the VLP-based archi- Φ(hv , hq ) = hv + hq (10)
tectural paradigms that will be discussed in the following sub-
Φ(hv , hq ) = hv ⊙ hq (11)
sections.
Φ(hv , hq ) = [hv , hq ] (12)
6.3.1. CNN-RNN-based Architecture
Throughout the years, the foundations of VQA shifted from Neural network-based architectures like CNNs and LSTMs
neural networks to transformers. Earlier deep-learning meth- are also popular choices for fusing the multi-modal input. Ma-
ods focused on answer generation from the fusion of visual linowski et al. [25] fed CNN features and input text to an LSTM
and textual streams [24, 1]. The joint embedding scheme us- to generate the answer where the LSTM used the image fea-
ing two encoders is the foundation of traditional VQA archi- tures as the previous state input. While CNNs have been less
tectures. In the pre-transformer era, VQA models employed popular fusion modules, Ma et al. [211] proposed a VQA archi-
usually relied on two separate feature extractors for the two tecture using CNNs only. The corresponding fusion equations
modalities [204, 208] and incorporated some form of attention for LSTMs and CNNs are:
mechanism [203]. The visual and textual embeddings preserve
their own identity before being fed to the multimodal fusion Φ(hv , hq ) = LS T M(hv , hq ) (13)
block. The answer generation module was generally formed of
multiple fully connected layers with a classifier head to select Φ(hv , hq ) = CNN(hv , hq ) (14)
the correct answer from the top k answers of the dataset [24, 1].
Deep learning-based methods started with CNN-RNN- Subsequent fusion strategies relied on bilinear pooling tech-
based models where visual feature extraction was done with niques like Multimodal Compact Bilinear pooling (MCB)
a variant of CNN [177] while textual encoding was performed [209], Multimodal Factorized Bilinear pooling (MFB) [221],
using a variant of RNN [195]. CNNs are used due to the sparse- Multimodal Low-rank Bilinear attention (MLB) Kim et al.
ness of the convolutional layers making feature extraction more [217], MUTAN [210], Multi-modal Factorized High-order
efficient for visual inputs. GoogLeNet [180], VGGNet [178], pooling (MFH) [224], etc. Apart from the fusion strategies,
ResNet [179], etc. were popular CNN variants for visual encod- the traditional architectures relied on various forms of atten-
ing followed by specialized object detection networks based on tion mechanisms discussed extensively in sec-6.2.1. Following
R-CNN [90] like Fast R-CNN [182] and Faster R-CNN [183]. sec-6.1.4, answer generation was primarily treated as a classi-
The additional information provided by object detection net- fication task using the softmax or sigmoid function [24] while
works resulted in richer visual features. many approaches used LSTMs for generative answers [25].
21
Answer

Giraffe Answer Decoder Multimodal Fusion

Nx Kx

Feedforward Feedforward
Feedforward
Mx

Single-Stream Cross-Attention Cross-Attention


Self-Attention Dual-Stream
Architecture
Architecture
Qvt Kvt Qv KT VT KV VV QT
Vvt

Self-Attention Self-Attention

Qv KV VV QT KT VT

Visual Features
Textual Features

Textual Input

Visual Feature Extractor What is the name Textual Feature Extractor


of the animal?

Visual Input

CNN BERT
Faster R-CNN RoBERTa
Flatten + 2D Patches ALBERT
XLNet

Figure 6: Generalized Vision Language Pre-training (VLP) network encapsulating both Single-Stream and Dual-Stream architectures. The Q, K, and V vectors
represent the query, key, and value vectors respectively for the transformer architecture [28]. The associated subscripts v and t stand for the visual and textual
inputs respectively. The architecture incorporates two feature extractors followed by transformer blocks for encoding. The output is followed by fusion and decoder
modules that can be optional based on the architectural design.

22
Table 10: Architectural overview of VLP methods.

Model Name Text Encoder Visual Encoder #Streams Fusion Encoder/Scheme Attention
ViLBERT [235] BERT FR-CNN Dual Cross-Modal Transformer Co
VisualBERT [29] BERT FR-CNNa Single BERT Self
Unicoder-VL [30] BERT FR-CNN Single BERT Self
LXMERT [241] BERT FR-CNN Dual Cross-Modal Transformer Self + Cross
VL-BERT [242] BERT FR-CNN + Resnet Single BERT Self
Unified-VLP [243] UniLM FR-CNN Single Unified Transformer Self
UNITER [237] BERT FR-CNN Single BERT Self
OSCAR [31] BERT FR-CNN Single BERT Self
ViLT [244] ViT Patch Embedding Single ViT Self
ALBEF [238] BERT ViT Dual Cross-Modal Transformer Self + Cross
BLIP [123] BERT ViT Dual Cross-Modal Transformer Self + Cross
VLMO [245] Word-Piece [246] Patch Embedding Single MoME Transformer Self
OFA [247] Byte Pair Encoding ResNet Single Unified Transformer Self
Unified-IO [248] Sentence-Piece [249] VQ-GAN [250] Single T5-based [144] Transformer Self
PaLI [124] mT5 [251] ViT Single Unified Transformer Self
a
Based on the ResNext architecture [252]

Test-Dev Test-Std
Model Name
Y\N Num Other All All
LSTM Q+I [24] 80.5 36.8 43 57.8 58.2
SMem [2-Hop] [212] 80.9 37.3 43.1 58 58.2
SAN [208] 79.3 36.6 46.1 58.7 58.9
FDA [253] 81.1 36.2 45.8 59.2 59.5
DMN+ [219] 80.5 36.8 48.3 60.3 60.4
HierCoAtt [218] 79.7 38.7 51.7 61.8 62.1
DPPnet [213] 80.71 37.24 41.69 57.22 57.36
MRN [215] 82.28 38.82 49.25 61.68 61.84
Deep Q+I [254] 80.87 36.46 43.4 58.02 58.16
iBOWIMG [204] 76.5 35 42.6 55.7 55.9
ACK [255] 79.2 36.1 40.1 55.7 56
MCB [Ensemble-7] [209] 83.4 39.8 58.5 66.7 66.5
MLB [Ensemble-7] [217] 84.57 39.21 57.81 66.77 66.89
Dual-MFA [256] 83.59 40.18 56.84 66.01 66.09
VQA-Machine [257] 81.5 38.4 53 63.1 63.3
Neural-Image QA [25] 78.4 36.4 46.3 58.4 58.4
NMN [135] 81.2 38 44 58.6 58.7
DMN [258] 81 38.4 45.2 59.2 59.4
MUTAN [Ensemble-5] [210] 85.14 39.81 58.52 67.42 67.36
QGHC [259] 83.54 38.06 57.1 65.89 65.9
D-NMN [260] 81.1 38.6 45.5 59.4 59.4

Table 11: Performance Analysis of traditional VQA models in the pre-transformer era. The models are evaluated on the test-dev and test-std split of the VQA
dataset [1].

23
6.3.2. CNN-BERT-based Architecture 6.4. Performance Analysis
Transformers revolutionized the landscape of natural lan- The performance of VQA models has evolved rapidly in re-
guage processing (NLP) and were quickly popularized for tex- cent years. The VQA dataset [1] had been a popular bench-
tual encoding in VQA. The VQA methodologies were re- mark to evaluate the early joint embedding and attention-based
molded by a variant of the transformer architecture called Bidi- models as illustrated in table-11. The vanilla CNN-RNN-based
rectional Encoder Representations from Transformers (BERT) [24, 25] architectures achieved close to 60% VQA accuracy on
[233]. In the subsequent CNN-BERT-based architectures, the test-dev and test-std splits. Neural module network-based
BERT [233] became the primary choice for textual encoding architectures [135] achieved similar accuracy. The architectures
along with a latter variant of CNN [90, 182] for visual encod- employed bilinear pooling and boosted the score beyond 65%.
ing. The language encoding equation based on eq-3 can be re- In a few years, VQAv2 [10] VQAv1 as the popular bench-
defined for this architectural paradigm as - mark, and the performance of associated VQA models is dis-
cussed in Table-12. It should be noted that the table contains ex-
hq = BERT (q) (15) clusively VLP architectures as it is impossible to achieve those
high levels of accuracy with pre-training. The earlier mod-
Derivatives of the transformer like multi-modal transform-
els achieved beyond 70% accuracy on the test-dev and test-std
ers [245], Cross-Modal Transformers (CMTs), and BERT were
splits of the VQAv2 dataset which is substantially higher than
employed for the fusion of the multimodal input streams.
the non-VLP models in the VQA dataset. Furthermore, VQAv2
CMTs relied on bi-directional cross-attention along with self-
is more than VQAv1 due to the absence of linguistic bias.
attention. Cross-attention calculates the attention weights based
on the inputs of one modality and the comparison with the in- Currently, the best models are achieving around 84% accu-
puts of the other modality. racy but have severely underperformed in the zero-shot setting.
The VLP model PaLI [124] achieved state-of-the-art results in
the traditional setting while the multimodal LLM GPT-4 [43]
6.3.3. VLP-based Architecture achieved the same in the zero-shot setting. Based on the per-
The scope of transformers was not limited to the textual formance results, a shift towards multimodal LLMs with spe-
modality only as Dosovitskiy et al. [185] showed promising cialized fine-tuning strategies for VQA can be expected to be
results with the Vision Transformer (ViT) architecture for vi- introduced in the coming years.
sual feature extraction. ViT showed scope for researchers to
incorporate transformer-based vision-language extractors intro- Model Name Test-Dev Test-Std
ducing dual-transformer-based models and subsequently, led VisualBERT [29] 70.80 71.00
to the emergence of VLP architectures as discussed in section- ViLBERT [235] 71.79 72.22
6.2.3. Following eq-2, the visual encoding can be rewritten as, LXMERT [241] 72.42 72.54
OSCAR [31] 73.16 73.44
hv = ViT (v) (16) UNITER [237] 73.82 74.02
VLP architectures can be either dual-stream or single- PixelBERT [261] 74.45 74.55
stream depending on how the input modalities are treated. VILLA [262] 74.69 74.87
Dual-stream architectures rely on separate multimodal streams UNIMO [263] 75.06 75.27
sent to independent transformer blocks and the resulting encod- ALBEF [238] 75.84 76.04
ings are fused at a subsequent phase. On the other hand, single- VinVL [264] 76.52 76.60
stream architectures feed the concatenated multimodal input to METER [265] 77.68 77.64
a single transformer block utilizing merged attention for fusion BLIP [123] 78.25 78.32
[27]. GIT [266] 78.56 78.81
ViLT [244] uses a single ViT transformer as the primary vi- SimVLM [267] 80.03 80.34
sual and textual feature extractor treating the whole input as a Florence [268] 80.16 80.36
single stream while architectures like ALBEF [238] and BLIP mPlug [269] 81.27 81.26
[123] rely on two separate streams processed separately by ViT GIT-2 [266] 81.74 81.92
and BERT as visual and textual encoders respectively. Sin- OFA [247] 82.00 82.00
gle streams architecture often exhibits monolithic architectural Flamingo [126] 82.00 82.10
paradigms i.e. there is a single unified transformer block [237]. CoCa [270] 82.30 82.30
Another crucial design choice is between the encoder-only BLIP-2 [271] 82.30 82.30
architectural paradigm where the encoding is directly fed to One-Peace [272] 82.60 82.50
the output layer and the encoder-decoder architectural paradigm VLMO [245] 82.88 82.78
where the encoding is fed to a decoder module for answer gen- BeiT-3 [273] 84.20 84.00
eration. An interesting work worth mentioning is VLMO [245] PaLI [124] 84.30 84.30
which unifies the dual and fusion encoders into a single Mixture
of Modality Expert (MoME) transformer and achieved promis- Table 12: Performance Analysis of VLP architectures in VQA. The models are
ing results. evaluated on the test-dev and test-std split of the VQAv2 dataset [10].

24
7. Evaluation Metrics 8.1. Dataset Challenges
The biggest and oldest challenge in VQA dates back to
VQA is often associated with the Visual Turing Test [98] its early days when there was an unavailability of large-scale
that requires a well-defined and sophisticated evaluation metric datasets that could be used as benchmarks. Researchers aimed
identical to human evaluation. Idealistically, the metric should to create a dataset that represents realistic or natural images
produce a single score for each generated answer based on its with a diverse range of QA pairs. The seemingly straightfor-
similarity with a human-annotated answer. The scores can be ward challenge was first attempted by Malinowski and Fritz
further aggregated to produce a single score for a particular [93] with the DAQUAR dataset, although the dataset was re-
model. stricted to indoor scenes only. The VQA dataset Antol et al. [1]
Evaluation of natural language answers isn’t straightforward. gained widespread popularity but exhibited imbalanced class
For instance, the answers ”cat” and ”feline” might be semanti- distribution. The simple task of establishing a generalized
cally similar but the answer ”feline” is rarely used by humans; open-ended benchmark turned out to be far more difficult.
making the answer undesirable. On the other hand, semanti-
cally similar words shouldn’t be harshly discouraged. The an- 8.1.1. Generalized Open-Ended Benchmark
swers ”cat” and ”cats” are two separate yet semantically similar To get a better grasp of the difficulty in establishing a bench-
words. If one of the words produces a high metric score, then mark, let’s consider a hypothetical dataset that, in order to be
the other word should also produce a significantly high metric idealistic, must contain visuals of every conceivable visual class
score. The following subsections will discuss the evaluation of and concept. For instance, if images of zebras are absent in our
classification and generative answers. hypothetical dataset, the trained model will not know what a ze-
bra is. Similarly, the training set must cover concepts and activ-
7.1. Evaluation of Classification Answers ities like reading, walking, etc. The challenge will seem over-
The standard accuracy metric has been a popular choice to whelming when one can comprehend the vast number of exist-
evaluate classifier-based VQA models. Vanilla accuracy disre- ing visual classes and concepts. The hypothetical dataset must
gards the semantic similarities between words and only con- also be linguistically sound i.e. the associated model trained on
siders a question correct when the predicted answer exactly the dataset must be able to comprehend questions containing
matches the ground truth. WUPS score is a softer form of ac- any word as long as the question is semantically correct.
curacy that accredits partially correct answers. VQA accuracy Altogether, the hypothetical VQA dataset must be suffi-
has been a popular metric that regards an answer to be correct ciently large to encapsulate all visual and linguistic entities
based on the number of answer annotations it matches. In such along with their associated intricacies. Although creating such
datasets, a single question must have multiple answer annota- a dataset might seem practically impossible, there can be a
tions. The VQA dataset [1] ensured 10 answer annotations per workaround by decoupling the training and testing phases.
question and deemed an answer to be fully correct if it matches Prior to standard VQA training, contemporary methods pre-
at least 3 of the 10 answer annotations. Otherwise, the answer is train the models on various vision and language tasks with
partially correct based on the number of matched annotations. larger datasets. As discussed in section-6.2.3, vision language
pre-training (VLP) [26, 27] enables the models to transfer
7.2. Evaluation of Generative Answers knowledge from other related tasks and understand concepts
beyond the limited VQA training data. However, the challenge
Generative answers are tougher to evaluate, especially if the
of establishing a generalized VQA dataset is now reformulated
generated answer is long. Machine translation metrics like
to the challenge of establishing a standard VLP dataset.
BLEU score [274] are often employed to find the similarity be-
tween ground truth and the generated answer. Apart from as- 8.1.2. Evaluation Dataset
sessing the similarity, the models face challenges in evaluating
The test split should be capable of rigorously evaluating the
subjective and controversial answers. Controversial and unsafe
performance of models in various VQA sub-tasks by reflecting
answers are usually avoided in VQA with limited work cover-
the obstacles in realistic scenarios. As VQA models become in-
ing these topics. Nevertheless, with the increasing popularity
creasingly capable, the task of evaluation becomes increasingly
of multimodal LLMs [127, 43], both VQA and Visual Dialog
difficult. Similar to designing a VQA benchmark, designing
systems need to develop robust metrics for generative answer
an ingenious evaluation dataset that is difficult for the state-of-
evaluation.
the-art VLP models can be equally challenging. We should also
consider that a large amount of training data might provide high
8. Challenges and Opportunities accuracy in the benchmarks, but a higher accuracy cannot al-
ways be associated with a better architecture. The testing split
Being a multimodal task, VQA faces the classical mul- should be able to differentiate such models and evaluate other
timodal challenges – the representation of multimodal data, aspects of the models apart from the accuracy score.
the fusion of the modalities, and co-learning between the two
modalities [76]. However, this section will explore some of 8.1.3. VLP Dataset
the domain-specific challenges along with possible mitigation VLP datasets are usually large-scale datasets of image-text
strategies. pairs automatically extracted from online sources. As quality
25
Table 13: Overview of various Evaluation Metrics used in VQA

Metric Measures Use Case


Accuracy Exact answer match MCQs, Single answer annotation
VQA-Accuracy No. of answers matched (at most 3) Multi answer annotation
Wu-Palmer Similarity (WUPS) Semantic connotation difference Softer form of accuracy
BiLingual Evaluation N-gram co-occurrences between
Long generative answers
Understudy (BLEU) [274] actual and predicted answer
Arithmetic/Harmonic mean
Mean Per Type (MPT) Unbalanced question types
for individual question type
Average Normalized Reasoning evaluation
Levenshtein distance
Levenshtein Similarity (ANLS) Smooth error penalization
F1-Score Harmonic mean of precision and recall Biased datasets
Human Judgement Subjective human opinion Subjective answers, controversial topics

assurance is difficult for such large-scale datasets, it often leads detrimental in the medical domain. Hallucinations occur when
to image-text misalignment issues and data redundancy [275]. LLMs produce misinformation or erroneous responses with
VLP datasets are also primarily English-based and translations high confidence. The penalty of misinformation can be dire
of native English answers are usually erroneous [124]. Cur- in the medical domain, potentially costing the life of a patient.
rent training strategies relying on multilingual captioning [276]
use human-annotated captions to overcome annotation artifacts. 8.2.3. Miscellaneous
However, cross-lingual performance is still low compared to its Med-VQA is considerably closer to knowledge-based VQA
native English counterparts [124]. Potential research directions (KB-VQA) as Med-VQA questions often require medical
may include introducing non-English VLP datasets or improv- knowledge along with a high degree of image understanding
ing cross-lingual capabilities by enhancing cross-lingual model and answer precision. Medical data is also constantly being
architecture. updated which might make the models outdated if no form of
online learning is incorporated. Online learning enables ma-
8.2. Med-VQA Challenges chine learning models to continuously update their parameters
The medical domain is a promising field for VQA with its by training on real-time data. Establishing a system based on
own set of challenges that will be discussed in this section, online learning in the medical domain has its challenges, given
the fact that there can be no room for misinformation and error.
8.2.1. Med-VQA Dataset
Med-VQA faces challenges similar to classical VQA in cre- 8.3. Model Evaluation Challenges
ating a large-scale dataset that encompasses medical images
from every category. Datasets dedicated to radiology [277] and The improved performance of state-of-the-art models in-
pathology [278] contain actual X-rays, CT scans, MRIs, and creased the difficulty in model evaluation. A higher accuracy
textbook-extracted pathological images. However, as the medi- score does not imply a better architecture and other factors
cal domain is more diverse, the aforementioned datasets do not should be considered along with accuracy. This subsection will
necessarily encapsulate the whole domain. Furthermore, cer- be centralized around the evaluation practices in the current
tain factors can make the creation of medical VQA datasets VQA literature.
more challenging than that of standard VQA. Firstly, collecting
high-quality annotated medical data is difficult and expensive. 8.3.1. Generalization, Robustness, and Consistency
Secondly, the generalization of medical data is not as straight- Historically, models were evaluated based on their accuracy
forward as standard VQA data since images might have subtle scores [1]. However, relying solely on accuracy doesn’t en-
changes that can drastically change the generated answer. sure the evaluation of a model’s generalization capabilities,
consistency, and robustness [282]. Models with more param-
8.2.2. Multimodal Medical LLMs eters trained on larger datasets may have better accuracy scores
Med-VQA is also experiencing the trend of using LLMs and but might be prone to noise and corruption effects on the input
their multimodal counterparts [279]. LLMs can be used as modalities. Furthermore, the models may not be able to predict
modules for comprehensive answer generation while MLLMs beyond training data or consistently produce correct answers.
can be deployed as an end-to-end VQA system. BiomedGPT Generalization is defined as the ability of the model to ex-
[279] and LLaVa-Med [280] are examples of multimodal LLM- tend its performance beyond the distribution of the dataset i.e.
based architectures dedicated to the medical domain and have its extensibility in a novel setting. Zero-shot settings [96] or
shown great performance in Med-VQA and Medical Image Di- specialized metrics [282] can be used to evaluate the general-
alog. Nevertheless, these LLMs come with their own set of ization capabilities of a model. On the other hand, robustness is
problems, primarily hallucination [281] which can be more defined as the reluctance of the model to change its prediction
26
when introduced to artifacts, corruptions, or noise in the mul- VQA models typically exhibit an association of textual ques-
timodal inputs. Consistency refers to the logical coherence of tions with textual answers while ignoring the visual input
the answers i.e. related to visual entailment. [100, 10]. This tendency is termed linguistic bias which usu-
Both robustness and consistency in the textual domain have ally occurs due to dataset imbalance. Goyal et al. [10] used
been thoroughly evaluated in the VQA literature [283, 284] as counterexamples in their dataset to overcome linguistic bias.
VQA models often struggled with paraphrased questions, syn- Redistributing the train, validation, and test splits has also been
tactic errors, grammatical errors, etc. Robustness also high- a popular strategy in mitigating bias [132].
lights the challenges of VQA models in realistic settings as Recent studies have shed light on other forms of bias, posing
seen in VizWiz [36], emphasizing the fact that real-life data and challenges for modern VQA architectures. Hirota et al. [72]
questions are not always in idealistic conditions. The evaluation highlighted gender and racial bias in VQA datasets, emphasiz-
of the generalization, robustness, and consistency of VQA mod- ing the need to address these issues to develop safe VQA mod-
els is regaining interest with the trends of VLP and multimodal els. Models trained on a particular often exhibit their own styles
LLMs in VQA [285]. in answering the question that might result in cross-dataset mis-
match [286]. The design focus should be towards creating
8.3.2. Generative Answer Evaluation Metric datasets that are inclusive to everyone, regardless of their back-
VQA has been traditionally defined as a classification prob- ground.
lem that simplifies answer generation by limiting it to the top k
answers of the dataset [24]. Discriminative models dominated 8.4.2. Model Reasoning
the classification problem formulation until the recent rise of With the advancements in model architecture and increased
generative models [59]. Generative models can produce high- model accuracy in difficult VQA tasks, one might argue
quality long answers suitable to elucidate ”how” and ”why” whether VQA models are actually capable of logical reason-
type questions. Due to the shift from the classification formula- ing over multimodal data or simply encapsulate and extend the
tion, VQA lost the simplicity of evaluating classes [1] as free- patterns found in the training data. Do models have the capabil-
form answers are harder to evaluate. Following the discussion ities to understand the question-answering paradigm on a visual
in sec-7.2, evaluation of generative answers is difficult consid- input? One may not expect human-like modeling of the world
ering subjectivity and the variations in answer style. but with the increased generalization capacities, one can expect
close to human-like reasoning. Several datasets [11, 94, 13]
8.4. Rational VQA challenged the rationale of VQA models and their reasoning ca-
pabilities. Researchers continue to work on improving the log-
A utopian goal in VQA is to design VQA models as rational ical consistency of the models and their ability to reason over
as human beings which can be done by addressing the bias in complex problems.
datasets, questioning the reasoning capabilities of the models,
and understanding the rationale behind answer generation. 8.4.3. Model Explainability
A VQA model should be able to explain or refer to partic-
8.4.1. Bias in VQA ular data on how it came to a particular conclusion. A model
A dataset can inherently have redundant patterns that we do is termed explainable if during an inference the user can per-
not want our model to follow. Training a VQA model is analo- fectly understand the logical sequence of the input data be-
gous to teaching children – often when trying to teach a child to ing processed to the answer output. Large unified architec-
do something in a specific way, they may pick up a few ”short- tures like BLIP [123], Unified-VLP [248], etc., and multi-
cuts” that will allow them to complete that particular task faster modal LLMs like GPT-4 [43], KOSMOS-1 [127], took both
but in a different way. However, disregarding the intended way the accuracy and generalization capabilities of VQA models
may affect their performance in a different yet similar task. A to new heights at the cost of model explainability. Several
VQA model being trained from a biased dataset might perform works [287, 288] addressed model explainability in VQA while
well using shortcuts for that particular dataset or setting but proposing explainable architectures. However, state-of-the-art
consequently, will lose generalization capabilities by showing models [247, 273, 245] are changing the explainability land-
subpar performance in other settings. scape of VQA as the models are viewed as a black box or a
For a better illustration of dataset bias, let us consider a network of black-box modules.
dataset where all the color-based questions on apples have the
answer ”red”. A model trained on this dataset will associate 8.5. Zero-shot VQA (ZS-VQA)
the color red with apples. If the image of a green apple is Generalization is an important concept in the field of ma-
shown to the model and a color-related question is asked, the chine learning and is key to the development of Artificial Gen-
model will have a high probability of predicting ”red” due to eral Intelligence (AGI) [289]. Every machine learning model is
the strong correlation between the textual color-based question expected to exhibit some form of generalization as the models
and the textual answer. Models are more likely to find such cor- are tested on a different split of the dataset while testing. While
relations between the same modality which, in this case, is the the particular samples in the test split might be different, both
textual modality. Such a model can be considered blind as the the training and testing splits usually come from the same data
model is unable to utilize the visual information. distribution.
27
8.5.1. ZS-VQA using Out-of-distribution Data deemed as an interface for extracting meaningful information
Let us consider an example where a machine learning model from large volumes of unstructured data that connects with the
is expected to predict based on integer values within the input textual modality. Analogous to image captioning, VQA can be
range [10, 30]. During training, the model has seen all the dis- considered a more specific interface due to using questions in
crete sample values excluding 13 and 15. If these particular the input. Tiong et al. [111] employs the subtask Image Cap-
values are used while testing, we can say that the testing sam- tioning within its architecture that can be a bottleneck for more
ples are different as they were not present in the training set. advanced ZS-VQA models. Researchers are exploring efficient
Nevertheless, their distribution is the same as that of the train- ways of interfacing between vision and language beyond tradi-
ing set. On the contrary, if a large value like 100 is fed to the tional VL tasks.
model while testing, we can conclude the model is being tested
on out-of-distribution data. 9. VQA in the Bigger Picture
Zero-shot VQA is a problem setting for VQA where the per-
formance of the model is evaluated on data that is beyond the We shall now explore how the domain of VQA can be per-
training data of the model. Going beyond the training data can ceived in the world of multi-modal problems. We try to define
be defined in multiple ways, one of them is using a test split the boundaries of VQA and delve deeper into its subdomains.
which comes from a different distribution than the training split
[96]. Formally, for the VQA model trained on samples from Multimodal Problems
Vision and Language
the joint probability distribution DV,Q , we define the zero-shot (V&L) Problems
setting for VQA as the evaluation setting of the model on a dif-
Audio and
ferent distribution D′ V,Q . Language
Visual Question Answering (VQA)

A VQA dataset has visual and textual components. An ideal Problems VideoQA ImageQA
zero-shot setting will have both visual and textual components Multimodal Unimodal Single Multiple
Audio VideoQA VideoQA Image Image
for evaluation from a different distribution. Teney and Hen- Captioning
gel [96] proposed an out-of-distribution (OOD) split on the Knowledge-based
VQA
question-answer set for the Visual7W dataset [101]. The re- Audio
Video
Question
Captioning
splitting guarantees that there will be at least one unseen word Answering Image Visual
Captioning Reasoing
in every question in the validation and test split. However, the Visual Visual
Grounding Dialogue
visual concepts do not necessarily come from a different dis-
tribution i.e. it is not guaranteed that the model will be asked
questions on unseen visual concepts. Farazi et al. [290] ex- Figure 7: VQA among other multimodal problems
tended the setting to handle novelty in both visual and semantic
concepts.
9.1. Multimodal Question Answering (MQA)
8.5.2. ZS-VQA on VLP Models In this literature, VQA often refers to the traditional setting
While resplitting a dataset can theoretically ensure mutually where the model tries to generate an answer from a single image
exclusive training and evaluation distribution for both modal- and a single question [1, 24]. However, as previously stated,
ities, it is practically difficult to achieve albeit at the cost of VQA can be generalized to any form of visual input that is
a reduced dataset. Notably, the zero-shot splits for Visual7W not limited to a single image. For instance - one of the re-
by Teney and Hengel [96] had only 25% of the number of test cently evolving sub-domains of VQA, VideoQA [3, 4] takes
images in the original split. Additionally, imposing the visual a video as the visual input. Following fig-1, the video can be ei-
concepts to come from a different distribution further decreases ther unimodal or multimodal. The key challenge of multimodal
the size of the test split. VideoQA is the processing of auditory data by a VQA system
Instead of creating out-of-distribution splits, we shall rede- as it introduces a new modality to our previously established
fine the zero-shot setting for VQA by using models that weren’t bimodal task. The introduction of auditory data also brings the
specifically trained in VQA i.e. the model has been trained on opportunity to extend QA beyond bimodality.
other multimodal vision-language tasks like Image Captioning Multimodal Question Answering (MQA) can be theorized as
[126], Image Conditioned Masked Language Modeling [291], a superset of VQA where any form of multimodal input with a
etc. For instance, Guo et al. [75] used a frozen image encoder, a question is given to generate multimodal answers as the output.
text encoder, and an LLM to achieve a zero-shot setting where Let’s define a multimodal question-answering model M, such
the modules are frozen i.e. they haven’t been trained on the task that,
of VQA. M : (Xi1 , Xi2 , ..., Xim , Qk ) → (Y j1 , Y j2 , ..., Y jn ) (17)
where both the input, X , and output, Y, are multimodal from the
8.5.3. Vision-Language Interfacing modalities {i1 , i2 , .., im } ∈ M and { j1 , j2 , .., jn } ∈ N given M, N
The modular ZS-VQA architectures [111, 125] can simplify are sets of modalities. The unimodal questions Qk comes from
VQA as the task of interfacing between Vision and Language the modality k. The associated model can be a discriminator, a
(VL). Images are seen as unstructured data and VQA can be generator, or a hybrid of both. The input and output modalities
28
do not necessarily need to match e.g. the input modalities of capabilities. A commonsense-based question like ”What is the
VQA are both visual and textual while the output modality is color of the sun?” will often be incorrectly answered by a sys-
textual only. tem specifically trained on VQA. The VCR dataset [7] has been
MQA is not strictly equivalent to VQA as there exists a set instrumental in this domain followed by other generalized rea-
of multimodal tasks that require no visual input thereby making soning datasets like GQA [12].
them part exclusively part of MQA. A good example can be
Audio/Spoken QA [292] where QA is done on an auditory input 9.2.3. Visual Grounding (VG)
only. Formally, MQA can be defined as a problem where a Visual Grounding (VG) [101] is defined as the problem
set of inputs from various modalities is given as the context to where visual elements relevant to the textual question have to
a free-form input question in order to generate a multi-modal be located i.e. the bounding boxes of the visual elements must
answer. MQA is also a subset of multimodal tasks in general be provided. The located element can be a person, action, ob-
which are not limited to question answering e.g. captioning ject, etc. VG can also be considered as a specific setting of
for images [8], audio [293], and video[294] can be viewed as VQA. VG models must have a comprehensive understanding
multimodal captioning. of visual attributes along with their location in the visual input.
There are few existing architectures that can work on MQA The Visual Toloka Challenge [297] is an interesting setting
or most of the generalized multimodal problems. Wang et al. of VG that aims to ask a question about a visual element that
[272] proposed the One Peace architecture to work on any mul- is implicitly mentioned in the question. The simplicity can be
timodal task using modality adapters. On the other hand, mul- achieved through logical congruence, e.g. an image of a human
timodal LLMs ([43, 127, 112] have been producing state-of- face with the question ”What is used to smell?” will have the
the-art performance on multimodal problems. Visual ChatGPT output bounding box surrounding the nose of the human.
[50] works with visual and textual inputs to produce visual or
textual outputs analogous to figure-3. Maaz et al. [295] extends 9.2.4. Inverse VQA (iVQA) and Visual Question Generation
the setting of Visual ChatGPT to videos. (VQG)
The task of Inverse VQA (iVQA) [298] deals with generat-
9.2. Related Vision-Language (VL) Tasks ing a question for an image-answer pair that can be general-
ized to any visual input with a textual answer. A similar task
The task of VQA can be refined as a task that aims to com-
of generating textual questions based on visual input is termed
bine visual and textual modalities and project back to textual
Visual Question Generation (VQG) [299, 299, 52]. The most
modality. VQA faces challenges in both aligning and fusing
straightforward strategy for VQG is to use a Visual Captioning
the modalities [76], bearing resemblance to other VL tasks. As
model and incorporate automated techniques that convert tex-
discussed in section-8.5.2, the pre-trained architectures for non-
tual captions to textual questions. However, the textual caption
VQA tasks can be utilized for VQA in both traditional and zero-
is usually generalized while the questions can have different
shot settings. Some of the VL tasks related to VQA shall be
paradigms.
discussed in this section.
Retrieval models are often employed to use the top k captions
to produce multiple questions along with training end-to-end
9.2.1. Visual Captioning (VC) generative models [299]. iVQA relies on similar approaches
Visual captioning [296] is undoubtedly the closest task to to answer generation but must consider the answer during QA
VQA and has been monumental in the construction of several generation. For VQG, the variations in questions along with
VQA datasets [24, 3]. VC has inspired several Joint Embedding their semantic and syntactic quality determine the quality of
VQA architectures [25, 24] and has been a popular downstream the model. iVQA emphasizes the relevancy of the answer and
task for vision language pertaining [126]. The primary goal visual content with the generated question.
of VC is to generate a syntactically and semantically correct Both iVQA and VQG can play pivotal roles in dataset cre-
textual description of the visual input. The associated models ation and data augmentation along with many real-world appli-
require recognition and understanding of visual elements and cations like automated education systems. Zeng et al. [300] uti-
their relationships. VC can also be performed in knowledge- lized video descriptions to generate synthetic QA pairs that are
based settings, benefiting from knowledge retrieval methodolo- later used to train video QA models. Similar techniques are em-
gies. Image captioning is the most popular form of visual cap- ployed by Changpinyo et al. [301] relying on image captions to
tioning while Video Captioning has been gaining popularity re- generate QA pairs and utilize the abundance of image-caption
cently. datasets to train large-scale VQA and Zero-Shot VQA mod-
els. VQG can be integrated as an important training module
9.2.2. Visual Commonsense Reasoning (VCR) in the contemporary modular VQA architectures [111] and to
VQA models need to exhibit some form of visual reason- fine-tune VLP-based methods on downstream tasks Wang et al.
ing capabilities as many datasets [11, 131] are aimed at testing [247], Chen et al. [124].
such capabilities. Often relying on complex and compositional
questions, Visual Commonsense Reasoning (VCR) [7] can be 9.2.5. Embodied and Interactive QA
viewed as an extension of VQA that highlights a key limitation Das et al. [66] introduced EmbodiedQA where an agent has
of VQA systems – the lack of common sense and reasoning to navigate through a 3D environment and answer related ques-
29
tions. The task shares similarities with navigation in a 360°
video [302] but incorporates reinforcement learning (RL) el-
ements as the agent has to take actions learned from the en-
vironment alongside question answering. An emerging do-
main closely tied to embodiedQA is Interactive QA (IQA) [303]
where an RL agent is required to navigate through an interac-
tive scene to answer related questions. Such tasks require an RL
agent’s proficiency in numerous tasks such as detection, navi- Q: How many giraffes are there?
gation, environment manipulation, question answering, and so A: 4

on. (a)

9.2.6. Miscellenous Tasks


Categorization of visual elements based on some relevant at-
tributes describing e.g. categorizing clothes into t-shirts, pants,
etc. is a form of product-based VL categorization [304]. Sen-
timent analysis has been popular in visual and textual domains Q: Which image has more giraffes?
but its multimodal VL setting [305] has been less explored. The A: The second image.

intersection of information retrieval (IR) with other modalities (b)

is cross-modal retrieval [306, 307] which aims to retrieve any


form of multimodal output based on a multimodal input.
Visual Entailment [240] explores the semantic alignment of
the visual input with the text. Tasks like Multimodal Machine
Translation [308] aim to translate texts in visual inputs and have
important use cases in realistic settings. VQA also interacts
with Reinforcement Learning (RL) in tasks that require some
form of action to be performed by the model. Embodied QA
[66] explores such a task where an agent is randomly located
in a virtual environment and asked a question for which it has
to take multiple actions in the virtual environment to learn the Q: How many giraffes are there in all the images?
A: 10
actual answer. Following the discussion in sec-2.4, Visual Dia-
(c)
log is also increasing in popularity with the rise of multimodal
large language models (MLLMs) [127, 112].
Figure 8: Comparison of VQA problems with (a) a single image, (b) a pair of
images, and (c) a set of images. The problem can be specified as a counting-
9.3. Sub-domains of VQA based problem.
As seen in section-9.1, the task of multimodal task of VQA
can be extensively generalized. In this section, we shall explore
different settings of VQA that are bound within the definition of
VQA i.e. a visual input will be a strictly required to answer the M : Xvid , Xt → Yt (19)
textual question. Usually, the subdomains of VQA occur due to
variations in the visual modality. VideoQA can have further variations in modality based on
the audio feed of the video. Some tasks require processing mul-
9.3.1. Single Image QA (SIQA) timodal videos [5] i.e. videos with an audio feed while others
The traditional setting of VQA, Single Image QA, is de- don’t.
fined as the textual answer generation or classification problem
of a textual question on a single image. SIQA has been ad-
dressed early in the VQA literature and our survey mostly cov- 9.3.3. Change Detection QA (CDQA)
ers datasets and techniques in this particular setting. Following In change detection [309], a pair of images is fed to a model
eq-1, 17, for a VQA model, M, an image Xi , a textual question that generates a binary image highlighting the areas where
Xt , and a textual answer Yt , we can formally define SIQA as, change occurs i.e. a pixel-wise binary classification. CDQA
[74] aims to perform question-answering on such inputs and
M : Xi , Xt → Yt (18) generate a textual answer instead. Given, a pair of images,
Xi1 , Xi2 , and a question, Xt , the model is required to generate
9.3.2. VideoQA an appropriate answer, Yt , related to the change in the images.
VideoQA has been discussed extensively in sec-5.5.1. Simi- Following eq-17, CDQA can be defined as,
lar to eq-18, for a video input Xvid , the problem can be formally
defined as, M : Xi1 , Xi2 , Xt → Yt (20)
30
9.3.4. Image Set QA (ISQA) 10.1. Recent Trends
A generalized form of CDQA where instead of a pair of im- The current biggest trend in VQA is the use of Generative
ages, a set of images along with a question is given as the in- AI due to its recent rise in popularity. VideoQA [3] has also
put to a model to generate the textual answer as the output [2]. seen substantial work in recent years [4]. Zero-Shot VQA has
ISQA can take various forms e.g. a question is asked about a also seen significant growth with the introduction of modular
specific image, the other images can serve as the context. The Zero-Shot architecture [111], graph-based methods [149], and
set of images is less likely to exhibit a temporal relationship multimodal LLMs [43].
with each image analogous to a video. The problem can be
formally defined as, 10.1.1. Dataset Trends
The large VLP architectures [123, 247] rely on automatically
M : (Xi1 , Xi2 , ...Xin , Xt ) → Yt (21) mined datasets for VL tasks. Due to the higher availability of
image-text pairs compared to image-QA triplets, VLP mostly
where, M is our VQA model, {Xi1 , Xi2 , ...Xin } ∈ I is a set of
relies on large captioning datasets like Conceptual Captions
images, Xt is the textual question, and Yt is the textual answer.
[311]. Changpinyo et al. [301] established that image caption-
ISQA can be reduced to VCQA which deals with a pair of im-
ing data is sufficient to train the VLP models and provided a
ages and a change-related question or can be reduced to any
framework for automatic generation of VQA data at volume
form of Image-pair VQA.
from captioning data. VLP often relies on a mixture of large-
scale datasets [124] and the associated models are usually ben-
9.3.5. VQA 360° efitted by scaling using a higher volume of data.
360° images are emerging as popular choices of visual in- As seen in sec-8.1.3, VLP datasets are shifting VQA models
put that expand beyond the conventional field of view found from the boundaries of substantially smaller VQA training data
in standard images. Question answering on 360° images can to generalized domains. Table-3 highlights the decline of tradi-
be challenging as it requires the model to extract spatial infor- tional datasets in VQA over the years. Current research works
mation from visual content around the camera’s optical center. are directed toward establishing difficult evaluation benchmarks
Furthermore, a 360° dataset should not only include 360° im- with a focus on generalization, zero-shot setting, and reasoning
ages but also a diverse set of questions challenging the intrin- capabilities. The trends of VQA datasets can be attributed to
sic properties of 360° images. Chou et al. [65] introduces the the enhanced capabilities of VQA models in the past few years.
task of VQA within the context of 360° images by presenting a
natural 360° image VQA dataset and a novel model to perform 10.1.2. Cross-lingual and Multi-lingual datasets
multi-level spatial reasoning. 360° image VQA can be extended Most of the VQA literature worked with English QA pairs
to 360° videos with audios [310]. and disregarded the performance of models in other languages.
However, recent works have addressed this issue by proposing
9.3.6. Miscellaneous Modalities the VQA in cross-lingual and multi-lingual settings. Pfeiffer
GIF-QA [64] is analogous to VideoQA but processes a GIF et al. [312] extended the standard GQA [12] dataset to 7 new
instead of a video. GIFs do not have any audio feed and are languages. Changpinyo et al. [313] proposed a QA transla-
analogous to image sequences. Single Image QA can have tion framework on the multilingual image captioning dataset
further variations like using images of graphs, charts, docu- Crossmodal-3600 [276]. Liu et al. [314] addressed the perfor-
ments, infographics, etc [17, 23, 53, 22]. QA on PDFs [56] mance gap between English and other languages by implement-
is an emerging topic that requires document processing and vi- ing simple modifications to the multi-lingual training setup.
sual understanding capabilities equivalent to language and vi- Further analysis of question types and languages highlighted a
sion models respectively. SlidesQA [57] proposes QA on slide zero-shot performance gap and difficulties in answering certain
decks which is similar to ISQA but requires an understanding question types in certain languages.
of intra-slide and inter-slide relationships.
10.1.3. Foundational Models
AI models are often trained on a variety of tasks with a large
10. Trends, Open Problems, and Future of VQA amount of data and are referred to as foundational models. The
unimodal Large Language Model (LLM) BERT [233] is an ex-
The rapidly evolving domain of VQA has myriad open chal- ample of a foundational model in the textual modality and one
lenges and problems that future researchers should explore. The of the first entries in the domain. State-of-the-art VQA models
potential applications of VQA can also be excellent opportuni- such as OFA [247], Pali [124], BEiT-3 [273], etc. are VLP ar-
ties for engineers and entrepreneurs. VQA has seen plenty of chitectures having multiple pre-training objectives and can be
revolutions with the advent of deep-learning-based architecture considered as foundational models.
[25, 24], transformer architecture [244, 29, 241], and currently, The key advantage of foundational models is the adaptability
LLMs and generative AI [75, 127]. The following section ex- of a pre-trained model to a variety of downstream tasks as seen
plores the trends of today’s VQA systems while highlighting in sec-6.2.3. This property ensures impressive performance
unexplored problems along with future research directions. [247, 273] that is not limited to VQA but extends to similar
31
vision-language domains as seen in section-9.2. Recently,Wang reasoning-based questions [11, 13]. However, most of these
et al. [272] extended the modalities further by proposing a gen- models are natively trained on English texts as there is a lack
eralized foundational model framework for multiple modali- of non-English VLP datasets. An interesting VQA setting that
ties using modality adapters and a fusion encoder. Modality is often overlooked is the cross-lingual setting [312]. However,
adapters are modules for task-specific finetuning and have been there is a lack of standardized work on VQA for languages ex-
primarily used in VQA as Vision Adapters [315]. cept English and Chinese. There is potential for creating non-
English VLP and VQA datasets.
10.1.4. Generative AI
The generative approach in AI deals with models compre- 10.2.2. Visual Robustness Evaluation
hending the patterns from the training data to generate new VQA has been widely adopted in different domains but lacks
data. The Generative Pre-trained Transformer (GPT) [59] is significant work in domain evaluation. Few works exist [282]
one of the most popular generative textual models. In contrast, on the overall generalization and vision-language robustness of
the discriminative approach is generally associated with non- VQA models. Integration of unimodal and multimodal LLMs
probabilistic classifiers that are iteratively trained to distinguish in VQA systems also introduced new problems like hallucina-
between output classes e.g. a neural network-based classifier. tions [281] for which there is no VQA framework. Although
VQA is often modeled as a generative task as seen in 6.1.4. VQA is not restricted to generative AI, object hallucination
Recent trends show GPT-based LLMs [60, 43] being adopted as has been a common phenomenon in vision-language classifiers
modules in VQA systems primarily for Zero-Shot VQA (ZS- [320]. Evaluation of VQA models has always been a challeng-
VQA) [75]. Furthermore, multimodal LLMs discussed in the ing task and we still have a class of evaluation-related open
next subsection are generative models showing promising re- problems that aims for developing reliable VQA systems.
sults in traditional VQA, ZS-VQA, and Visual dialogue as seen
in 2.4. 10.2.3. Multi-Image Counting QA
While the traditional single image and single question set-
10.1.5. Multimodal LLMs (MLLMs) ting has been widely adopted in many domains, other settings
Researchers are aiming to extend the capabilities of LLMs like image-pair and image-set question answering saw little to
to other modalities and coined the term Multimodal LLM to no work [2, 74]. Counting-based questions as seen in [73] are
define these LLMs. The successor to the popular ChatGPT limited to a single image while multi-image counting is still an
model, GPT-4 [43], has been marketed as multimodal i.e. it will open problem. Fig-8 proposes the task of Image Set QA speci-
be able to process inputs from modalities other than language fied at counting-based questions only.
and produce output in those modalities. However, the scope
of modalities beyond languages is assumed to be limited to vi- 10.2.4. Green Computing in VQA
sion only. The Kosmos-1 [127] and Kosmos-2 [112] are recent Environmental sustainability is a key issue that is addressed
advancements in multimodal LLMs capable of showing satis- by green computing [321]. Ahmad et al. [322] evaluated AI
factory performance in a variety of unimodal and multi-modal systems in sustainable energy industries while advocating for a
tasks. greener AI system. Generative AI and LLMs integrated in VQA
In recent years, several learning paradigms have evolved models are at the frontier of modern AI systems. Researchers
to train MLLMs with varying architectures [316]. Initially, have yet to establish significant works for more efficient and
MLLMs relied on the pre-training and fine-tuning analogous to greener VQA systems that can ensure sustainability in the long
the Vision Language Pre-training (VLP) for VQA. Afterward, run.
they relied on prompt engineering and instruction tuning, two
widely adopted techniques in modern LLM literature. A set 10.3. What’s next for VQA?
of MLLMs extended GPT-based backbones to visual modali-
ties [162, 317, 295]. The bridge the gap between the modalities The future of Visual Question Answering (VQA) holds sig-
LLMs often relied on separate adapters, e.g. Llama Adapter nificant promise as researchers and developers continue to in-
[318, 319], used as learnable interfaces. novate in this domain. The integration of VQA into real-world
applications, such as healthcare, autonomous vehicles, and e-
commerce, will become more prevalent, demonstrating its prac-
10.2. Open Problems
tical utility. The incorporation of generative models, such as
There are lots of potential areas in VQA that haven’t been GPT-4 [43], into VQA systems, will also contribute to im-
explored yet. Both the visual and linguistic domains can have proved performance and adaptability. Furthermore, address-
variations derived from the generalized problem definition of ing challenges related to biases, fairness, and interpretability
multimodal question answering. In this section, we shall ex- in VQA will be crucial for ethical and responsible deployment.
plore various novel problem statements in VQA. Apart from the enhanced performance, VQA can expect to see
broader applicability, and a growing emphasis on ethical con-
10.2.1. Non-English VLP Datasets and Models siderations, which collectively hold the potential to revolution-
Linguistically, VQA has experienced a variety of ques- ize the way humans and machines interact with visual informa-
tions ranging from simple binary questions [100] to complex tion.
32
11. Conclusions [15] P. Wang, Q. Wu, C. Shen, A. Dick, A. Van Den Hengel, Fvqa: Fact-
based visual question answering, IEEE transactions on pattern analysis
Our work went through the datasets and methods sketched in and machine intelligence 40 (2017) 2413–2427.
[16] D. Schwenk, A. Khandelwal, C. Clark, K. Marino, R. Mottaghi, A-
the setting of traditional VQA surveys and then delved deeper okvqa: A benchmark for visual question answering using world knowl-
into the modern techniques in the context of vision language edge, in: European Conference on Computer Vision, Springer, 2022,
pre-training (VLP). However, we couldn’t expand the discus- pp. 146–162.
[17] N. Methani, P. Ganguly, M. M. Khapra, P. Kumar, Plotqa: Reasoning
sion on two emerging subdomains of VQA - Zero Shot VQA
over scientific plots, in: Proceedings of the IEEE/CVF Winter Confer-
(ZS-VQA) and VideoQA. Both of these domains are popular ence on Applications of Computer Vision, 2020, pp. 1527–1536.
fields of interest for current researchers and should be priori- [18] A. Mishra, S. Shekhar, A. K. Singh, A. Chakraborty, Ocr-vqa: Vi-
tized by future researchers as well. Additional discussion on sual question answering by reading text in images, in: 2019 inter-
national conference on document analysis and recognition (ICDAR),
the intricacies of transformer-based architecture should also be IEEE, 2019, pp. 947–952.
beneficial in introducing architectural novelty. Nevertheless, [19] A. F. Biten, R. Tito, A. Mafla, L. Gomez, M. Rusinol, E. Valveny,
we believe that the provided directions will be fruitful in sculpt- C. Jawahar, D. Karatzas, Scene text visual question answering, in: Pro-
ing the domain of VQA in the coming years. ceedings of the IEEE/CVF international conference on computer vision,
2019, pp. 4291–4301.
[20] Z. Lin, D. Zhang, Q. Tao, D. Shi, G. Haffari, Q. Wu, M. He, Z. Ge,
Medical visual question answering: A survey, Artificial Intelligence in
References Medicine (2023) 102611.
[21] M. Mathew, D. Karatzas, C. Jawahar, Docvqa: A dataset for vqa on
[1] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, document images, in: Proceedings of the IEEE/CVF winter conference
D. Parikh, Vqa: Visual question answering, in: Proceedings of the IEEE on applications of computer vision, 2021, pp. 2200–2209.
international conference on computer vision, 2015, pp. 2425–2433. [22] A. Masry, D. X. Long, J. Q. Tan, S. Joty, E. Hoque, Chartqa: A bench-
[2] A. Bansal, Y. Zhang, R. Chellappa, Visual question answering on image mark for question answering about charts with visual and logical reason-
sets, in: Computer Vision–ECCV 2020: 16th European Conference, ing, arXiv preprint arXiv:2203.10244 (2022).
Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16, Springer, [23] M. Mathew, V. Bagal, R. Tito, D. Karatzas, E. Valveny, C. Jawahar,
2020, pp. 51–67. Infographicvqa, in: Proceedings of the IEEE/CVF Winter Conference
[3] D. Xu, Z. Zhao, J. Xiao, F. Wu, H. Zhang, X. He, Y. Zhuang, Video on Applications of Computer Vision, 2022, pp. 1697–1706.
question answering via gradually refined attention over appearance and [24] M. Ren, R. Kiros, R. Zemel, Exploring models and data for image ques-
motion, in: Proceedings of the 25th ACM international conference on tion answering, Advances in neural information processing systems 28
Multimedia, 2017, pp. 1645–1653. (2015).
[4] Y. Zhong, J. Xiao, W. Ji, Y. Li, W. Deng, T.-S. Chua, Video ques- [25] M. Malinowski, M. Rohrbach, M. Fritz, Ask your neurons: A neural-
tion answering: Datasets, algorithms and challenges, arXiv preprint based approach to answering questions about images, in: Proceedings
arXiv:2203.01225 (2022). of the IEEE international conference on computer vision, 2015, pp. 1–9.
[5] J. Lei, L. Yu, M. Bansal, T. L. Berg, Tvqa: Localized, compositional [26] Z. Gan, L. Li, C. Li, L. Wang, Z. Liu, J. Gao, et al., Vision-language
video question answering, arXiv preprint arXiv:1809.01696 (2018). pre-training: Basics, recent advances, and future trends, Foundations
[6] V. Mezaris, I. Kompatsiaris, M. G. Strintzis, An ontology approach to and Trends® in Computer Graphics and Vision 14 (2022) 163–352.
object-based image retrieval, in: Proceedings 2003 International Con- [27] F.-L. Chen, D.-Z. Zhang, M.-L. Han, X.-Y. Chen, J. Shi, S. Xu, B. Xu,
ference on Image Processing (Cat. No. 03CH37429), volume 2, IEEE, Vlp: A survey on vision-language pre-training, Machine Intelligence
2003, pp. II–511. Research 20 (2023) 38–56.
[7] R. Zellers, Y. Bisk, A. Farhadi, Y. Choi, From recognition to cogni- [28] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
tion: Visual commonsense reasoning, in: Proceedings of the IEEE/CVF Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural
conference on computer vision and pattern recognition, 2019, pp. 6720– information processing systems 30 (2017).
6731. [29] L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, K.-W. Chang, Visualbert: A
[8] M. Z. Hossain, F. Sohel, M. F. Shiratuddin, H. Laga, A comprehensive simple and performant baseline for vision and language, arXiv preprint
survey of deep learning for image captioning, ACM Computing Surveys arXiv:1908.03557 (2019).
(CsUR) 51 (2019) 1–36. [30] G. Li, N. Duan, Y. Fang, M. Gong, D. Jiang, Unicoder-vl: A universal
[9] A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. Moura, D. Parikh, encoder for vision and language by cross-modal pre-training, in: Pro-
D. Batra, Visual dialog, in: Proceedings of the IEEE conference on ceedings of the AAAI conference on artificial intelligence, volume 34,
computer vision and pattern recognition, 2017, pp. 326–335. 2020, pp. 11336–11344.
[10] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, D. Parikh, Making the v [31] X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu,
in vqa matter: Elevating the role of image understanding in visual ques- L. Dong, F. Wei, et al., Oscar: Object-semantics aligned pre-training
tion answering, in: Proceedings of the IEEE conference on computer for vision-language tasks, in: Computer Vision–ECCV 2020: 16th Eu-
vision and pattern recognition, 2017, pp. 6904–6913. ropean Conference, Glasgow, UK, August 23–28, 2020, Proceedings,
[11] J. Johnson, B. Hariharan, L. Van Der Maaten, L. Fei-Fei, Part XXX 16, Springer, 2020, pp. 121–137.
C. Lawrence Zitnick, R. Girshick, Clevr: A diagnostic dataset for com- [32] K. Kafle, C. Kanan, Visual question answering: Datasets, algorithms,
positional language and elementary visual reasoning, in: Proceedings of and future challenges, Computer Vision and Image Understanding 163
the IEEE conference on computer vision and pattern recognition, 2017, (2017) 3–20.
pp. 2901–2910. [33] Q. Wu, D. Teney, P. Wang, C. Shen, A. Dick, A. Van Den Hengel, Visual
[12] D. A. Hudson, C. D. Manning, Gqa: A new dataset for real-world visual question answering: A survey of methods and datasets, Computer Vision
reasoning and compositional question answering, in: Proceedings of and Image Understanding 163 (2017) 21–40.
the IEEE/CVF conference on computer vision and pattern recognition, [34] J. P. Bigham, C. Jayant, H. Ji, G. Little, A. Miller, R. C. Miller, R. Miller,
2019, pp. 6700–6709. A. Tatarowicz, B. White, S. White, et al., Vizwiz: nearly real-time an-
[13] C. Zhang, F. Gao, B. Jia, Y. Zhu, S.-C. Zhu, Raven: A dataset for rela- swers to visual questions, in: Proceedings of the 23nd annual ACM
tional and analogical visual reasoning, in: Proceedings of the IEEE/CVF symposium on User interface software and technology, 2010, pp. 333–
conference on computer vision and pattern recognition, 2019, pp. 5317– 342.
5327. [35] S. Barra, C. Bisogni, M. De Marsico, S. Ricciardi, Visual question an-
[14] P. Wang, Q. Wu, C. Shen, A. v. d. Hengel, A. Dick, Explicit swering: Which investigated applications?, Pattern Recognition Letters
knowledge-based reasoning for visual question answering, arXiv 151 (2021) 325–331.
preprint arXiv:1511.02570 (2015).

33
[36] D. Gurari, Q. Li, A. J. Stangl, A. Guo, C. Lin, K. Grauman, J. Luo, Science and Engineering, volume 949, IOP Publishing, 2020, p. 012074.
J. P. Bigham, Vizwiz grand challenge: Answering visual questions from [55] A. Kembhavi, M. Seo, D. Schwenk, J. Choi, A. Farhadi, H. Hajishirzi,
blind people, in: Proceedings of the IEEE conference on computer vi- Are you smarter than a sixth grader? textbook question answering for
sion and pattern recognition, 2018, pp. 3608–3617. multimodal machine comprehension, in: Proceedings of the IEEE Con-
[37] D. Gurari, Q. Li, C. Lin, Y. Zhao, A. Guo, A. Stangl, J. P. Bigham, ference on Computer Vision and Pattern recognition, 2017, pp. 4999–
Vizwiz-priv: A dataset for recognizing the presence and purpose of pri- 5007.
vate visual information in images taken by blind people, in: Proceedings [56] Y. Ding, S. Luo, H. Chung, S. C. Han, Vqa: A new dataset for real-world
of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- vqa on pdf documents, arXiv preprint arXiv:2304.06447 (2023).
tion, 2019, pp. 939–948. [57] R. Tanaka, K. Nishida, K. Nishida, T. Hasegawa, I. Saito, K. Saito,
[38] Y.-Y. Tseng, A. Bell, D. Gurari, Vizwiz-fewshot: Locating objects in Slidevqa: A dataset for document visual question answering on multiple
images taken by people with visual impairments, in: European Confer- images, arXiv preprint arXiv:2301.04883 (2023).
ence on Computer Vision, Springer, 2022, pp. 575–591. [58] R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von
[39] M. A. Burton, E. Brady, R. Brewer, C. Neylan, J. P. Bigham, A. Hurst, Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, et al., On
Crowdsourcing subjective fashion advice using vizwiz: challenges and the opportunities and risks of foundation models, arXiv preprint
opportunities, in: Proceedings of the 14th international ACM SIGAC- arXiv:2108.07258 (2021).
CESS conference on Computers and accessibility, 2012, pp. 135–142. [59] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al., Improving
[40] E. Brady, M. R. Morris, Y. Zhong, S. White, J. P. Bigham, Visual chal- language understanding by generative pre-training, Technical Report,
lenges in the everyday lives of blind people, in: Proceedings of the OpenAI (2018).
SIGCHI conference on human factors in computing systems, 2013, pp. [60] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal,
2117–2126. A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language mod-
[41] W. S. Lasecki, P. Thiha, Y. Zhong, E. Brady, J. P. Bigham, Answering els are few-shot learners, Advances in neural information processing
visual questions with conversational crowd assistants, in: Proceedings systems 33 (2020) 1877–1901.
of the 15th International ACM SIGACCESS Conference on Computers [61] A. S. Toor, H. Wechsler, M. Nappi, Biometric surveillance using visual
and Accessibility, 2013, pp. 1–8. question answering, Pattern Recognition Letters 126 (2019) 111–118.
[42] D. Gurari, K. Grauman, Crowdverge: Predicting if people will agree [62] A. Sarkar, M. Rahnemoonfar, Vqa-aid: Visual question answering for
on the answer to a visual question, in: Proceedings of the 2017 CHI post-disaster damage assessment and analysis, in: 2021 IEEE Inter-
Conference on Human Factors in Computing Systems, 2017, pp. 3511– national Geoscience and Remote Sensing Symposium IGARSS, IEEE,
3522. 2021, pp. 8660–8663.
[43] OpenAI, Gpt-4 technical report, 2023. arXiv:2303.08774. [63] A. Sarkar, T. Chowdhury, R. Murphy, A. Gangopadhyay, M. Rah-
[44] D. Gurari, Y. Zhao, M. Zhang, N. Bhattacharya, Captioning images nemoonfar, Sam-vqa: Supervised attention-based visual question an-
taken by people who are blind, in: Computer Vision–ECCV 2020: 16th swering model for post-disaster damage assessment on remote sensing
European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, imagery, IEEE Transactions on Geoscience and Remote Sensing (2023).
Part XVII 16, Springer, 2020, pp. 417–434. [64] Y. Jang, Y. Song, Y. Yu, Y. Kim, G. Kim, Tgif-qa: Toward spatio-
[45] C. Chen, S. Anjum, D. Gurari, Grounding answers for visual questions temporal reasoning in visual question answering, in: Proceedings of the
asked by visually impaired people, arXiv preprint arXiv:2202.01993 IEEE conference on computer vision and pattern recognition, 2017, pp.
(2022). 2758–2766.
[46] M. P. Salyers, K. A. Bonfils, L. Luther, R. L. Firmin, D. A. White, E. L. [65] S.-H. Chou, W.-L. Chao, W.-S. Lai, M. Sun, M.-H. Yang, Visual ques-
Adams, A. L. Rollins, The relationship between professional burnout tion answering on 360deg images, in: Proceedings of the IEEE/CVF
and quality and safety in healthcare: a meta-analysis, Journal of general winter conference on applications of computer vision, 2020, pp. 1607–
internal medicine 32 (2017) 475–482. 1616.
[47] B. He, M. Xia, X. Yu, P. Jian, H. Meng, Z. Chen, An educational [66] A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, D. Batra, Embodied
robot system of visual question answering for preschoolers, in: 2017 question answering, in: Proceedings of the IEEE conference on com-
2nd International Conference on Robotics and Automation Engineering puter vision and pattern recognition, 2018, pp. 1–10.
(ICRAE), 2017, pp. 441–445. doi:10.1109/ICRAE.2017.8291426. [67] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
[48] S. Anwar, N. A. Bascou, M. Menekse, A. Kardgar, A systematic review P. Dollár, C. L. Zitnick, Microsoft coco: Common objects in context,
of studies on educational robotics, Journal of Pre-College Engineering in: European conference on computer vision, Springer, 2014, pp. 740–
Education Research (J-PEER) 9 (2019) 2. 755.
[49] J. Sophia, T. Jacob, Edubot-a chatbot for education in covid-19 pan- [68] J. Xu, T. Mei, T. Yao, Y. Rui, Msr-vtt: A large video description dataset
demic and vqabot comparison, in: 2021 Second International Confer- for bridging video and language, in: Proceedings of the IEEE conference
ence on Electronics and Sustainable Communication Systems (ICESC), on computer vision and pattern recognition, 2016, pp. 5288–5296.
2021, pp. 1707–1714. doi:10.1109/ICESC51422.2021.9532611. [69] S. Mori, H. Nishida, H. Yamada, Optical character recognition, John
[50] C. Wu, S. Yin, W. Qi, X. Wang, Z. Tang, N. Duan, Visual chatgpt: Talk- Wiley & Sons, Inc., 1999.
ing, drawing and editing with visual foundation models, arXiv preprint [70] A. Suhr, M. Lewis, J. Yeh, Y. Artzi, A corpus of natural language for
arXiv:2303.04671 (2023). visual reasoning, in: Proceedings of the 55th Annual Meeting of the
[51] S. Suresh, V. Nagaraj Rao, G. Srinivasa, Gamification of a visual ques- Association for Computational Linguistics (Volume 2: Short Papers),
tion answer system, in: 2018 IEEE Tenth International Conference on 2017, pp. 217–223.
Technology for Education (T4E), 2018, pp. 41–44. doi:10.1109/T4E. [71] R. Shrestha, K. Kafle, C. Kanan, A negative case analysis of visual
2018.00016. grounding methods for vqa, arXiv preprint arXiv:2004.05704 (2020).
[52] N. Vedd, Z. Wang, M. Rei, Y. Miao, L. Specia, Guiding visual ques- [72] Y. Hirota, Y. Nakashima, N. Garcia, Gender and racial bias in visual
tion generation, in: Proceedings of the 2022 Conference of the North question answering datasets, in: Proceedings of the 2022 ACM Con-
American Chapter of the Association for Computational Linguistics: ference on Fairness, Accountability, and Transparency, 2022, pp. 1280–
Human Language Technologies, Association for Computational Lin- 1292.
guistics, Seattle, United States, 2022, pp. 1640–1654. URL: https: [73] M. Acharya, K. Kafle, C. Kanan, Tallyqa: Answering complex counting
//aclanthology.org/2022.naacl-main.118. doi:10.18653/v1/ questions, in: Proceedings of the AAAI conference on artificial intelli-
2022.naacl-main.118. gence, volume 33-01, 2019, pp. 8076–8084.
[53] A. Kembhavi, M. Salvato, E. Kolve, M. Seo, H. Hajishirzi, A. Farhadi, [74] Z. Yuan, L. Mou, Z. Xiong, X. X. Zhu, Change detection meets vi-
A diagram is worth a dozen images, in: Computer Vision–ECCV 2016: sual question answering, IEEE Transactions on Geoscience and Remote
14th European Conference, Amsterdam, The Netherlands, October 11– Sensing 60 (2022) 1–13.
14, 2016, Proceedings, Part IV 14, Springer, 2016, pp. 235–251. [75] J. Guo, J. Li, D. Li, A. M. H. Tiong, B. Li, D. Tao, S. Hoi, From
[54] P. Bongini, F. Becattini, A. D. Bagdanov, A. Del Bimbo, Visual question images to textual prompts: Zero-shot visual question answering with
answering for cultural heritage, in: IOP Conference Series: Materials frozen large language models, in: Proceedings of the IEEE/CVF Con-

34
ference on Computer Vision and Pattern Recognition, 2023, pp. 10867– ings of the IEEE conference on computer vision and pattern recognition,
10877. 2016, pp. 5014–5022.
[76] T. Baltrušaitis, C. Ahuja, L.-P. Morency, Multimodal machine learn- [101] Y. Zhu, O. Groth, M. Bernstein, L. Fei-Fei, Visual7w: Grounded ques-
ing: A survey and taxonomy, IEEE transactions on pattern analysis and tion answering in images, in: Proceedings of the IEEE conference on
machine intelligence 41 (2018) 423–443. computer vision and pattern recognition, 2016, pp. 4995–5004.
[77] D. Zhang, R. Cao, S. Wu, Information fusion in visual question answer- [102] C. Deng, Q. Wu, Q. Wu, F. Hu, F. Lyu, M. Tan, Visual grounding via
ing: A survey, Information Fusion 52 (2019) 268–280. accumulated attention, in: Proceedings of the IEEE conference on com-
[78] S. Lu, M. Liu, L. Yin, Z. Yin, X. Liu, W. Zheng, The multi-modal fusion puter vision and pattern recognition, 2018, pp. 7746–7755.
in visual question answering: a review of attention mechanisms, PeerJ [103] R. Kuhn, E. Neveu, Political journalism: New challenges, new practices,
Computer Science 9 (2023) e1400. Routledge, 2003.
[79] K. Kafle, R. Shrestha, C. Kanan, Challenges and prospects in vision and [104] L. Yu, E. Park, A. C. Berg, T. L. Berg, Visual madlibs: Fill in
language research, Frontiers in Artificial Intelligence 2 (2019) 28. the blank image generation and question answering, arXiv preprint
[80] D. Yuan, Language bias in visual question answering: A survey and arXiv:1506.00278 (2015).
taxonomy, arXiv preprint arXiv:2111.08531 (2021). [105] H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, W. Xu, Are you talking
[81] A. A. Yusuf, F. Chong, M. Xianling, An analysis of graph convolutional to a machine? dataset and methods for multilingual image question,
networks and recent datasets for visual question answering, Artificial Advances in neural information processing systems 28 (2015).
Intelligence Review 55 (2022) 6277–6300. [106] M. H. Rafi, S. Islam, S. H. I. Labib, S. S. Hasan, F. M. Shah, S. Ahmed,
[82] A. Mogadala, M. Kalimuthu, D. Klakow, Trends in integration of vision A deep learning-based bengali visual question answering system, in:
and language research: A survey of tasks, datasets, and methods, Journal 2022 25th International Conference on Computer and Information Tech-
of Artificial Intelligence Research 71 (2021) 1183–1317. nology (ICCIT), IEEE, 2022, pp. 114–119.
[83] Y. Fu, T. Xiang, Y.-G. Jiang, X. Xue, L. Sigal, S. Gong, Recent advances [107] A. Chandrasekar, A. Shimpi, D. Naik, Indic visual question answer-
in zero-shot recognition: Toward data-efficient understanding of visual ing, in: 2022 IEEE International Conference on Signal Processing and
content, IEEE Signal Processing Magazine 35 (2018) 112–125. Communications (SPCOM), IEEE, 2022, pp. 1–5.
[84] J. Chen, Y. Geng, Z. Chen, I. Horrocks, J. Z. Pan, H. Chen, Knowledge- [108] S. M. kamel, S. I. Hassan, L. Elrefaei, Vaqa: Visual arabic question
aware zero-shot learning: Survey and perspective, arXiv preprint answering, Arabian Journal for Science and Engineering (2023) 1–21.
arXiv:2103.00070 (2021). [109] K. Kafle, C. Kanan, An analysis of visual question answering algo-
[85] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, rithms, in: Proceedings of the IEEE international conference on com-
Y. Kalantidis, L.-J. Li, D. A. Shamma, et al., Visual genome: Connect- puter vision, 2017, pp. 1965–1973.
ing language and vision using crowdsourced dense image annotations, [110] K. Marino, M. Rastegari, A. Farhadi, R. Mottaghi, Ok-vqa: A visual
International journal of computer vision 123 (2017) 32–73. question answering benchmark requiring external knowledge, in: Pro-
[86] A. K. Gupta, Survey of visual question answering: Datasets and tech- ceedings of the IEEE/cvf conference on computer vision and pattern
niques, arXiv preprint arXiv:1705.03865 (2017). recognition, 2019, pp. 3195–3204.
[87] D. Teney, Q. Wu, A. van den Hengel, Visual question answering: A [111] A. M. H. Tiong, J. Li, B. Li, S. Savarese, S. C. Hoi, Plug-and-play vqa:
tutorial, IEEE Signal Processing Magazine 34 (2017) 63–75. Zero-shot vqa by conjoining large pretrained models with zero training,
[88] S. Hassantabar, Visual question answering: Datasets, methods, chal- arXiv preprint arXiv:2210.08773 (2022).
lenges and oppurtunities, 2018. [112] Z. Peng, W. Wang, L. Dong, Y. Hao, S. Huang, S. Ma, F. Wei, Kosmos-
[89] S. Manmadhan, B. C. Kovoor, Visual question answering: a state-of- 2: Grounding multimodal large language models to the world, arXiv
the-art review, Artificial Intelligence Review 53 (2020) 5705–5745. preprint arXiv:2306.14824 (2023).
[90] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for [113] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, Z. Ives, Db-
accurate object detection and semantic segmentation, in: Proceedings of pedia: A nucleus for a web of open data, in: international semantic web
the IEEE conference on computer vision and pattern recognition, 2014, conference, Springer, 2007, pp. 722–735.
pp. 580–587. [114] N. Tandon, G. Melo, G. Weikum, Acquiring comparative commonsense
[91] H. Sharma, A. S. Jalal, A survey of methods, datasets and evaluation knowledge from the web, in: Proceedings of the AAAI Conference on
metrics for visual question answering, Image and Vision Computing Artificial Intelligence, volume 28, 2014, pp. 154—-162.
116 (2021) 104327. [115] H. Liu, P. Singh, Conceptnet—a practical commonsense reasoning tool-
[92] Y. Srivastava, V. Murali, S. R. Dubey, S. Mukherjee, Visual question kit, BT technology journal 22 (2004) 211–226.
answering using deep learning: A survey and performance analysis, in: [116] P. Lu, L. Ji, W. Zhang, N. Duan, M. Zhou, J. Wang, R-vqa: learning vi-
Computer Vision and Image Processing: 5th International Conference, sual relation facts with semantic attention for visual question answering,
CVIP 2020, Prayagraj, India, December 4-6, 2020, Revised Selected in: Proceedings of the 24th ACM SIGKDD International Conference on
Papers, Part II 5, Springer, 2021, pp. 75–86. Knowledge Discovery & Data Mining, 2018, pp. 1880–1889.
[93] M. Malinowski, M. Fritz, A multi-world approach to question answering [117] W. Lin, Z. Wang, B. Byrne, Fvqa 2.0: Introducing adversarial
about real-world scenes based on uncertain input, Advances in neural samples into fact-based visual question answering, arXiv preprint
information processing systems 27 (2014). arXiv:2303.10699 (2023).
[94] S. Pandhre, S. Sodhani, Survey of recent advances in visual question [118] A. Jain, M. Kothyari, V. Kumar, P. Jyothi, G. Ramakrishnan,
answering, arXiv preprint arXiv:1709.08203 (2017). S. Chakrabarti, Select, substitute, search: A new benchmark for
[95] A. Agrawal, D. Batra, D. Parikh, A. Kembhavi, Don’t just assume; look knowledge-augmented visual question answering, in: Proceedings of
and answer: Overcoming priors for visual question answering, in: Pro- the 44th International ACM SIGIR Conference on Research and Devel-
ceedings of the IEEE conference on computer vision and pattern recog- opment in Information Retrieval, 2021, pp. 2491–2498.
nition, 2018, pp. 4971–4980. [119] N. Silberman, D. Hoiem, P. Kohli, R. Fergus, Indoor segmentation and
[96] D. Teney, A. v. d. Hengel, Zero-shot visual question answering, arXiv support inference from rgbd images, in: European conference on com-
preprint arXiv:1611.05546 (2016). puter vision, Springer, 2012, pp. 746–760.
[97] S. A. Hasan, Y. Ling, O. Farri, J. Liu, H. Müller, M. Lungren, Overview [120] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland,
of imageclef 2018 medical domain visual question answering task, Pro- D. Borth, L.-J. Li, Yfcc100m: The new data in multimedia research,
ceedings of CLEF 2018 Working Notes (2018). Communications of the ACM 59 (2016) 64–73.
[98] M. Malinowski, M. Fritz, Towards a visual turing challenge, arXiv [121] V. Ordonez, G. Kulkarni, T. Berg, Im2text: Describing images using
preprint arXiv:1410.8027 (2014). 1 million captioned photographs, Advances in neural information pro-
[99] X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollár, C. L. cessing systems 24 (2011).
Zitnick, Microsoft coco captions: Data collection and evaluation server, [122] P. Young, A. Lai, M. Hodosh, J. Hockenmaier, From image descriptions
arXiv preprint arXiv:1504.00325 (2015). to visual denotations: New similarity metrics for semantic inference over
[100] P. Zhang, Y. Goyal, D. Summers-Stay, D. Batra, D. Parikh, Yin and event descriptions, Transactions of the Association for Computational
yang: Balancing and answering binary visual questions, in: Proceed- Linguistics 2 (2014) 67–78.

35
[123] J. Li, D. Li, C. Xiong, S. Hoi, Blip: Bootstrapping language-image [143] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset,
pre-training for unified vision-language understanding and generation, S. Kamali, S. Popov, M. Malloci, A. Kolesnikov, et al., The open images
in: International Conference on Machine Learning, PMLR, 2022, pp. dataset v4: Unified image classification, object detection, and visual re-
12888–12900. lationship detection at scale, International Journal of Computer Vision
[124] X. Chen, X. Wang, S. Changpinyo, A. Piergiovanni, P. Padlewski, 128 (2020) 1956–1981.
D. Salz, S. Goodman, A. Grycner, B. Mustafa, L. Beyer, et al., Pali: [144] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena,
A jointly-scaled multilingual language-image model, arXiv preprint Y. Zhou, W. Li, P. J. Liu, Exploring the limits of transfer learning with
arXiv:2209.06794 (2022). a unified text-to-text transformer, The Journal of Machine Learning Re-
[125] H. Song, L. Dong, W.-N. Zhang, T. Liu, F. Wei, Clip models are few- search 21 (2020) 5485–5551.
shot learners: Empirical studies on vqa and visual entailment, arXiv [145] W. Commons, Wikimedia commons, Retrieved June 2 (2012).
preprint arXiv:2203.07190 (2022). [146] M. Joshi, E. Choi, D. S. Weld, L. Zettlemoyer, Triviaqa: A large scale
[126] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, distantly supervised challenge dataset for reading comprehension, arXiv
A. Mensch, K. Millican, M. Reynolds, et al., Flamingo: a visual lan- preprint arXiv:1705.03551 (2017).
guage model for few-shot learning, Advances in Neural Information [147] F. Petroni, A. Piktus, A. Fan, P. Lewis, M. Yazdani, N. De Cao,
Processing Systems 35 (2022) 23716–23736. J. Thorne, Y. Jernite, V. Karpukhin, J. Maillard, et al., Kilt: a
[127] S. Huang, L. Dong, W. Wang, Y. Hao, S. Singhal, S. Ma, T. Lv, L. Cui, benchmark for knowledge intensive language tasks, arXiv preprint
O. K. Mohammed, Q. Liu, et al., Language is not all you need: Align- arXiv:2009.02252 (2020).
ing perception with language models, arXiv preprint arXiv:2302.14045 [148] J. C. Raven, J. Court, Raven’s progressive matrices, Western Psycholog-
(2023). ical Services Los Angeles, CA, 1938.
[128] S. Shah, A. Mishra, N. Yadati, P. P. Talukdar, Kvqa: Knowledge-aware [149] Z. Chen, J. Chen, Y. Geng, J. Z. Pan, Z. Yuan, H. Chen, Zero-shot visual
visual question answering, in: Proceedings of the AAAI Conference on question answering using knowledge graph, in: The Semantic Web–
Artificial Intelligence, volume 33-01, 2019, pp. 8876–8884. ISWC 2021: 20th International Semantic Web Conference, ISWC 2021,
[129] P. Lerner, O. Ferret, C. Guinaudeau, H. Le Borgne, R. Besançon, J. G. Virtual Event, October 24–28, 2021, Proceedings 20, Springer, 2021,
Moreno, J. Lovón Melgarejo, Viquae, a dataset for knowledge-based pp. 146–162.
visual question answering about named entities, in: Proceedings of the [150] A. Trott, C. Xiong, R. Socher, Interpretable counting for visual question
45th International ACM SIGIR Conference on Research and Develop- answering, arXiv preprint arXiv:1712.08697 (2017).
ment in Information Retrieval, 2022, pp. 3108–3120. [151] S. E. Kahou, V. Michalski, A. Atkinson, Á. Kádár, A. Trischler, Y. Ben-
[130] D. Vrandečić, M. Krötzsch, Wikidata: a free collaborative knowledge- gio, Figureqa: An annotated figure dataset for visual reasoning, arXiv
base, Communications of the ACM 57 (2014) 78–85. preprint arXiv:1710.07300 (2017).
[131] P. Lu, L. Qiu, J. Chen, T. Xia, Y. Zhao, W. Zhang, Z. Yu, X. Liang, S.-C. [152] K. Kafle, B. Price, S. Cohen, C. Kanan, Dvqa: Understanding data visu-
Zhu, Iconqa: A new benchmark for abstract diagram understanding and alizations via question answering, in: Proceedings of the IEEE confer-
visual language reasoning, arXiv preprint arXiv:2110.13214 (2021). ence on computer vision and pattern recognition, 2018, pp. 5648–5656.
[132] C. Dancette, R. Cadene, D. Teney, M. Cord, Beyond question-based [153] A. Singh, V. Natarajan, M. Shah, Y. Jiang, X. Chen, D. Batra, D. Parikh,
biases: Assessing multimodal shortcut learning in visual question an- M. Rohrbach, Towards vqa models that can read, in: Proceedings of the
swering, in: Proceedings of the IEEE/CVF International Conference on IEEE/CVF Conference on Computer Vision and Pattern Recognition,
Computer Vision, 2021, pp. 1574–1583. 2019, pp. 8317–8326.
[133] J. Ma, P. Wang, D. Kong, Z. Wang, J. Liu, H. Pei, J. Zhao, Robust visual [154] R. Chaudhry, S. Shekhar, U. Gupta, P. Maneriker, P. Bansal, A. Joshi,
question answering: Datasets, methods, and future challenges, arXiv Leaf-qa: Locate, encode & attend for figure question answering, in:
preprint arXiv:2307.11471 (2023). Proceedings of the IEEE/CVF Winter Conference on Applications of
[134] D. Gao, R. Wang, S. Shan, X. Chen, Cric: A vqa dataset for compo- Computer Vision, 2020, pp. 3512–3521.
sitional reasoning on vision and commonsense, IEEE Transactions on [155] N. Siegel, Z. Horvitz, R. Levin, S. Divvala, A. Farhadi, Figureseer: Pars-
Pattern Analysis and Machine Intelligence 45 (2022) 5561–5578. ing result-figures in research papers, in: Computer Vision–ECCV 2016:
[135] J. Andreas, M. Rohrbach, T. Darrell, D. Klein, Neural module networks, 14th European Conference, Amsterdam, The Netherlands, October 11–
in: Proceedings of the IEEE conference on computer vision and pattern 14, 2016, Proceedings, Part VII 14, Springer, 2016, pp. 664–680.
recognition, 2016, pp. 39–48. [156] G. Zeng, Y. Zhang, Y. Zhou, X. Yang, Beyond ocr+ vqa: Involving ocr
[136] R. Liu, C. Liu, Y. Bai, A. L. Yuille, Clevr-ref+: Diagnosing visual into the flow for robust and accurate textvqa, in: Proceedings of the 29th
reasoning with referring expressions, in: Proceedings of the IEEE/CVF ACM International Conference on Multimedia, 2021, pp. 376–385.
conference on computer vision and pattern recognition, 2019, pp. 4185– [157] M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, S. Fidler,
4194. Movieqa: Understanding stories in movies through question-answering,
[137] S. Kottur, J. M. Moura, D. Parikh, D. Batra, M. Rohrbach, Clevr-dialog: in: Proceedings of the IEEE conference on computer vision and pattern
A diagnostic dataset for multi-round reasoning in visual dialog, arXiv recognition, 2016, pp. 4631–4640.
preprint arXiv:1903.03166 (2019). [158] P. Yang, X. Wang, X. Duan, H. Chen, R. Hou, C. Jin, W. Zhu, Avqa: A
[138] L. Arras, A. Osman, W. Samek, Clevr-xai: A benchmark dataset for dataset for audio-visual question answering on videos, in: Proceedings
the ground truth evaluation of neural network explanations, Information of the 30th ACM International Conference on Multimedia, 2022, pp.
Fusion 81 (2022) 14–40. 3480–3491.
[139] L. Salewski, A. S. Koepke, H. P. Lensch, Z. Akata, Clevr-x: A visual [159] N. Garcia, M. Otani, C. Chu, Y. Nakashima, Knowit vqa: Answering
reasoning dataset for natural language explanations, in: International knowledge-based questions about videos, in: Proceedings of the AAAI
Workshop on Extending Explainable AI Beyond Deep Models and Clas- conference on artificial intelligence, volume 34, 2020, pp. 10826–10834.
sifiers, Springer, 2020, pp. 69–88. [160] J. Mun, P. Hongsuck Seo, I. Jung, B. Han, Marioqa: Answering ques-
[140] Z. Li, X. Wang, E. Stengel-Eskin, A. Kortylewski, W. Ma, tions by watching gameplay videos, in: Proceedings of the IEEE Inter-
B. Van Durme, A. L. Yuille, Super-clevr: A virtual benchmark to di- national Conference on Computer Vision, 2017, pp. 2867–2875.
agnose domain robustness in visual reasoning, in: Proceedings of the [161] Y. Li, Y. Song, L. Cao, J. Tetreault, L. Goldberg, A. Jaimes, J. Luo,
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Tgif: A new dataset and benchmark on animated gif description, in:
2023, pp. 14963–14973. Proceedings of the IEEE Conference on Computer Vision and Pattern
[141] N. Bitton-Guetta, Y. Bitton, J. Hessel, L. Schmidt, Y. Elovici, Recognition, 2016, pp. 4641–4650.
G. Stanovsky, R. Schwartz, Breaking common sense: Whoops! a vision- [162] X. Zhang, C. Wu, Z. Zhao, W. Lin, Y. Zhang, Y. Wang, W. Xie, Pmc-
and-language benchmark of synthetic and compositional images, arXiv vqa: Visual instruction tuning for medical visual question answering,
preprint arXiv:2303.07274 (2023). arXiv preprint arXiv:2305.10415 (2023).
[142] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A [163] K. Yang, G.-S. Xia, Z. Liu, B. Du, W. Yang, M. Pelillo, L. Zhang, Asym-
large-scale hierarchical image database, in: 2009 IEEE conference on metric siamese networks for semantic change detection in aerial images,
computer vision and pattern recognition, Ieee, 2009, pp. 248–255. IEEE Transactions on Geoscience and Remote Sensing 60 (2021) 1–18.

36
[164] C. Cortes, V. Vapnik, Support-vector networks, Machine learning 20 [188] Y. Hirota, N. Garcia, M. Otani, C. Chu, Y. Nakashima, I. Taniguchi,
(1995) 273–297. T. Onoye, A picture may be worth a hundred words for visual question
[165] A. Haar, Zur theorie der orthogonalen funktionensysteme, Georg- answering, arXiv preprint arXiv:2106.13445 (2021).
August-Universitat, Gottingen., 1909. [189] H. Xue, Y. Huang, B. Liu, H. Peng, J. Fu, H. Li, J. Luo, Probing inter-
[166] P. Viola, M. Jones, Rapid object detection using a boosted cascade of modality: Visual parsing with self-attention for vision-and-language
simple features, in: Proceedings of the 2001 IEEE computer society pre-training, Advances in Neural Information Processing Systems 34
conference on computer vision and pattern recognition. CVPR 2001, (2021).
volume 1, Ieee, 2001, pp. I–I. [190] G. Luo, Y. Zhou, X. Sun, Y. Wang, L. Cao, Y. Wu, F. Huang, R. Ji, To-
[167] N. Dalal, B. Triggs, Histograms of oriented gradients for human detec- wards lightweight transformer via group-wise transformation for vision-
tion, in: 2005 IEEE computer society conference on computer vision and-language tasks, IEEE Transactions on Image Processing (2022).
and pattern recognition (CVPR’05), volume 1, Ieee, 2005, pp. 886–893. [191] G. A. Miller, W. G. Charles, Contextual correlates of semantic similarity,
[168] D. G. Lowe, Object recognition from local scale-invariant features, in: Language and cognitive processes 6 (1991) 1–28.
Proceedings of the seventh IEEE international conference on computer [192] C. Eckart, G. Young, The approximation of one matrix by another of
vision, volume 2, Ieee, 1999, pp. 1150–1157. lower rank, Psychometrika 1 (1936) 211–218.
[169] Z.-Q. Hong, Algebraic feature extraction of image for recognition, Pat- [193] W. Xu, A. Rudnicky, Can artificial neural networks learn language mod-
tern recognition 24 (1991) 211–219. els?, Sixth International Conference on Spoken Language Processing
[170] A. Hyvarinen, E. Oja, P. Hoyer, J. Hurri, Image feature extraction by (2000) 202–205.
sparse coding and independent component analysis, in: Proceedings. [194] Y. Bengio, R. Ducharme, P. Vincent, A neural probabilistic language
Fourteenth International Conference on Pattern Recognition (Cat. No. model, Advances in Neural Information Processing Systems 13 (2000).
98EX170), volume 2, IEEE, 1998, pp. 1268–1273. [195] D. E. Rumelhart, G. E. Hinton, R. J. Williams, et al., Learning internal
[171] K. Fukushima, S. Miyake, Neocognitron: A self-organizing neural net- representations by error propagation, 1985.
work model for a mechanism of visual pattern recognition, in: Compe- [196] T. Mikolov, W.-t. Yih, G. Zweig, Linguistic regularities in continuous
tition and cooperation in neural nets, Springer, 1982, pp. 267–285. space word representations, in: Proceedings of the 2013 conference of
[172] D. Ciregan, U. Meier, J. Schmidhuber, Multi-column deep neural net- the north american chapter of the association for computational linguis-
works for image classification, in: 2012 IEEE conference on computer tics: Human language technologies, 2013, pp. 746–751.
vision and pattern recognition, IEEE, 2012, pp. 3642–3649. [197] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word
[173] D. A. Pomerleau, Alvinn: An autonomous land vehicle in a neural net- representations in vector space, arXiv preprint arXiv:1301.3781 (2013).
work, Advances in neural information processing systems 1 (1988). [198] J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Empirical evaluation of
[174] A. Sarlashkar, M. Bodruzzaman, M. Malkani, Feature extraction using gated recurrent neural networks on sequence modeling, arXiv preprint
wavelet transform for neural network based image classification, in: arXiv:1412.3555 (2014).
Proceedings of Thirtieth Southeastern Symposium on System Theory, [199] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural com-
IEEE, 1998, pp. 412–416. putation 9 (1997) 1735–1780.
[175] B. Lerner, H. Guterman, M. Aladjem, I. h. Dinstein, A comparative [200] Z. Yang, N. Garcia, C. Chu, M. Otani, Y. Nakashima, H. Takemura, A
study of neural network based feature extraction paradigms, Pattern comparative study of language transformers for video question answer-
Recognition Letters 20 (1999) 7–14. ing, Neurocomputing 445 (2021) 121–133.
[176] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with [201] A. F. Biten, R. Litman, Y. Xie, S. Appalaraju, R. Manmatha,
deep convolutional neural networks, Advances in neural information Latr: Layout-aware transformer for scene-text vqa, arXiv preprint
processing systems 25 (2012). arXiv:2112.12494 (2021).
[177] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning [202] Z. Yang, Y. Lu, J. Wang, X. Yin, D. Florencio, L. Wang, C. Zhang,
applied to document recognition, Proceedings of the IEEE 86 (1998) L. Zhang, J. Luo, Tap: Text-aware pre-training for text-vqa and text-
2278–2324. caption, in: Proceedings of the IEEE/CVF Conference on Computer
[178] K. Simonyan, A. Zisserman, Very deep convolutional networks for Vision and Pattern Recognition, 2021, pp. 8751–8761.
large-scale image recognition, arXiv preprint arXiv:1409.1556 (2014). [203] K. Chen, J. Wang, L.-C. Chen, H. Gao, W. Xu, R. Nevatia, Abc-cnn:
[179] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image An attention based convolutional neural network for visual question an-
recognition, in: Proceedings of the IEEE conference on computer vision swering, arXiv preprint arXiv:1511.05960 (2015).
and pattern recognition, 2016, pp. 770–778. [204] B. Zhou, Y. Tian, S. Sukhbaatar, A. Szlam, R. Fergus, Simple baseline
[180] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Er- for visual question answering, arXiv preprint arXiv:1512.02167 (2015).
han, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, [205] A. Jabri, A. Joulin, L. van der Maaten, Revisiting visual question
in: Proceedings of the IEEE conference on computer vision and pattern answering baselines, in: B. Leibe, J. Matas, N. Sebe, M. Welling
recognition, 2015, pp. 1–9. (Eds.), Computer Vision – ECCV 2016, Springer International Publish-
[181] S. Bozinovski, A. Fulgosi, The influence of pattern similarity and trans- ing, Cham, 2016, pp. 727–739.
fer learning upon training of a base perceptron b2, in: Proceedings of [206] J.-H. Huang, C. D. Dao, M. Alfadly, B. Ghanem, A novel framework for
Symposium Informatica, volume 3, 1976, pp. 121–126. robustness analysis of visual qa models, in: Proceedings of the AAAI
[182] R. Girshick, Fast r-cnn, in: Proceedings of the IEEE international con- Conference on Artificial Intelligence, volume 33-01, 2019, pp. 8449–
ference on computer vision, 2015, pp. 1440–1448. 8456.
[183] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time [207] D. Teney, P. Anderson, X. He, A. Van Den Hengel, Tips and tricks for
object detection with region proposal networks, Advances in neural in- visual question answering: Learnings from the 2017 challenge, in: Pro-
formation processing systems 28 (2015). ceedings of the IEEE conference on computer vision and pattern recog-
[184] K. Kafle, C. Kanan, Answer-type prediction for visual question answer- nition, 2018, pp. 4223–4232.
ing, in: Proceedings of the IEEE conference on computer vision and [208] Z. Yang, X. He, J. Gao, L. Deng, A. Smola, Stacked attention networks
pattern recognition, 2016, pp. 4976–4984. for image question answering, in: Proceedings of the IEEE conference
[185] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, on computer vision and pattern recognition, 2016, pp. 21–29.
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., [209] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, M. Rohrbach,
An image is worth 16x16 words: Transformers for image recognition at Multimodal compact bilinear pooling for visual question answering
scale, arXiv preprint arXiv:2010.11929 (2020). and visual grounding, CoRR abs/1606.01847 (2016). URL: http:
[186] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, B. Guo, Swin //arxiv.org/abs/1606.01847. arXiv:1606.01847.
transformer: Hierarchical vision transformer using shifted windows, in: [210] H. Ben-Younes, R. Cadene, M. Cord, N. Thome, Mutan: Multimodal
Proceedings of the IEEE/CVF International Conference on Computer tucker fusion for visual question answering, in: Proceedings of the IEEE
Vision, 2021, pp. 10012–10022. international conference on computer vision, 2017, pp. 2612–2620.
[187] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, S. Xie, A con- [211] L. Ma, Z. Lu, H. Li, Learning to answer questions from image using
vnet for the 2020s, arXiv preprint arXiv:2201.03545 (2022). convolutional neural network, in: Proceedings of the AAAI Conference

37
on Artificial Intelligence, volume 30 of AAAI’16, AAAI Press, 2016, p. [233] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training
3567–3573. of deep bidirectional transformers for language understanding, arXiv
[212] H. Xu, K. Saenko, Ask, attend and answer: Exploring question-guided preprint arXiv:1810.04805 (2018).
spatial attention for visual question answering, in: Computer Vision– [234] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis,
ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized bert pre-
October 11–14, 2016, Proceedings, Part VII 14, Springer, 2016, pp. training approach, arXiv preprint arXiv:1907.11692 (2019).
451–466. [235] J. Lu, D. Batra, D. Parikh, S. Lee, Vilbert: Pretraining task-agnostic
[213] H. Noh, P. H. Seo, B. Han, Image question answering using convolu- visiolinguistic representations for vision-and-language tasks, Advances
tional neural network with dynamic parameter prediction, in: Proceed- in neural information processing systems 32 (2019).
ings of the IEEE conference on computer vision and pattern recognition, [236] W. L. Taylor, “cloze procedure”: A new tool for measuring readability,
2016, pp. 30–38. Journalism quarterly 30 (1953) 415–433.
[214] K. J. Shih, S. Singh, D. Hoiem, Where to look: Focus regions for vi- [237] Y.-C. Chen, L. Li, L. Yu, A. El Kholy, F. Ahmed, Z. Gan, Y. Cheng,
sual question answering, in: Proceedings of the IEEE conference on J. Liu, Uniter: Universal image-text representation learning, in: Euro-
computer vision and pattern recognition, 2016, pp. 4613–4621. pean conference on computer vision, Springer, 2020, pp. 104–120.
[215] J.-H. Kim, S.-W. Lee, D. Kwak, M.-O. Heo, J. Kim, J.-W. Ha, B.-T. [238] J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, S. C. H. Hoi, Align
Zhang, Multimodal residual learning for visual qa, Advances in neural before fuse: Vision and language representation learning with momen-
information processing systems 29 (2016). tum distillation, Advances in neural information processing systems 34
[216] H. Nam, J.-W. Ha, J. Kim, Dual attention networks for multimodal (2021) 9694–9705.
reasoning and matching, in: Proceedings of the IEEE conference on [239] O. Vinyals, A. Toshev, S. Bengio, D. Erhan, Show and tell: A neural
computer vision and pattern recognition, 2017, pp. 299–307. image caption generator, in: Proceedings of the IEEE conference on
[217] J.-H. Kim, K.-W. On, W. Lim, J. Kim, J.-W. Ha, B.-T. Zhang, Hadamard computer vision and pattern recognition, 2015, pp. 3156–3164.
product for low-rank bilinear pooling, arXiv preprint arXiv:1610.04325 [240] N. Xie, F. Lai, D. Doran, A. Kadav, Visual entailment: A novel task
(2016). for fine-grained image understanding, arXiv preprint arXiv:1901.06706
[218] J. Lu, J. Yang, D. Batra, D. Parikh, Hierarchical question-image co- (2019).
attention for visual question answering, Advances in neural information [241] H. Tan, M. Bansal, Lxmert: Learning cross-modality encoder represen-
processing systems 29 (2016). tations from transformers, arXiv preprint arXiv:1908.07490 (2019).
[219] C. Xiong, S. Merity, R. Socher, Dynamic memory networks for visual [242] W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, J. Dai, Vl-bert: Pre-
and textual question answering, in: International conference on machine training of generic visual-linguistic representations, arXiv preprint
learning, PMLR, 2016, pp. 2397–2406. arXiv:1908.08530 (2019).
[220] Q. Wu, C. Shen, P. Wang, A. Dick, A. Van Den Hengel, Image cap- [243] L. Zhou, H. Palangi, L. Zhang, H. Hu, J. Corso, J. Gao, Unified vision-
tioning and visual question answering based on attributes and external language pre-training for image captioning and vqa, in: Proceedings of
knowledge, IEEE transactions on pattern analysis and machine intelli- the AAAI conference on artificial intelligence, volume 34-07, 2020, pp.
gence 40 (2017) 1367–1381. 13041–13049.
[221] D. Yu, J. Fu, T. Mei, Y. Rui, Multi-level attention networks for visual [244] W. Kim, B. Son, I. Kim, Vilt: Vision-and-language transformer with-
question answering, in: Proceedings of the IEEE conference on com- out convolution or region supervision, in: International Conference on
puter vision and pattern recognition, 2017, pp. 4709–4717. Machine Learning, PMLR, 2021, pp. 5583–5594.
[222] V. Kazemi, A. Elqursh, Show, ask, attend, and answer: A strong baseline [245] H. Bao, W. Wang, L. Dong, Q. Liu, O. K. Mohammed, K. Aggarwal,
for visual question answering, arXiv preprint arXiv:1704.03162 (2017). S. Som, S. Piao, F. Wei, Vlmo: Unified vision-language pre-training
[223] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, with mixture-of-modality-experts, Advances in Neural Information Pro-
L. Zhang, Bottom-up and top-down attention for image captioning and cessing Systems 35 (2022) 32897–32912.
visual question answering, in: Proceedings of the IEEE conference on [246] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey,
computer vision and pattern recognition, 2018, pp. 6077–6086. M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al., Google’s neural ma-
[224] Z. Yu, J. Yu, C. Xiang, J. Fan, D. Tao, Beyond bilinear: Generalized chine translation system: Bridging the gap between human and machine
multimodal factorized high-order pooling for visual question answering, translation, arXiv preprint arXiv:1609.08144 (2016).
IEEE transactions on neural networks and learning systems 29 (2018) [247] P. Wang, A. Yang, R. Men, J. Lin, S. Bai, Z. Li, J. Ma, C. Zhou, J. Zhou,
5947–5959. H. Yang, Ofa: Unifying architectures, tasks, and modalities through
[225] D.-K. Nguyen, T. Okatani, Improved fusion of visual and language rep- a simple sequence-to-sequence learning framework, in: International
resentations by dense symmetric co-attention for visual question answer- Conference on Machine Learning, PMLR, 2022, pp. 23318–23340.
ing, in: Proceedings of the IEEE conference on computer vision and [248] J. Lu, C. Clark, R. Zellers, R. Mottaghi, A. Kembhavi, Unified-io: A uni-
pattern recognition, 2018, pp. 6087–6096. fied model for vision, language, and multi-modal tasks, arXiv preprint
[226] J.-H. Kim, J. Jun, B.-T. Zhang, Bilinear attention networks, Advances arXiv:2206.08916 (2022).
in neural information processing systems 31 (2018). [249] T. Kudo, J. Richardson, Sentencepiece: A simple and language inde-
[227] Z. Yu, J. Yu, Y. Cui, D. Tao, Q. Tian, Deep modular co-attention net- pendent subword tokenizer and detokenizer for neural text processing,
works for visual question answering, in: Proceedings of the IEEE/CVF arXiv preprint arXiv:1808.06226 (2018).
conference on computer vision and pattern recognition, 2019, pp. 6281– [250] P. Esser, R. Rombach, B. Ommer, Taming transformers for high-
6290. resolution image synthesis, in: Proceedings of the IEEE/CVF con-
[228] J. Ba, V. Mnih, K. Kavukcuoglu, Multiple object recognition with visual ference on computer vision and pattern recognition, 2021, pp. 12873–
attention, arXiv preprint arXiv:1412.7755 (2014). 12883.
[229] J. Jin, K. Fu, R. Cui, F. Sha, C. Zhang, Aligning where to see and what to [251] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant,
tell: image caption with region-based attention and scene factorization, A. Barua, C. Raffel, mt5: A massively multilingual pre-trained text-to-
arXiv preprint arXiv:1506.06272 (2015). text transformer, arXiv preprint arXiv:2010.11934 (2020).
[230] L. Peng, Y. Yang, Y. Bin, N. Xie, F. Shen, Y. Ji, X. Xu, Word-to-region [252] S. Xie, R. Girshick, P. Dollár, Z. Tu, K. He, Aggregated residual trans-
attention network for visual question answering, Multimedia Tools and formations for deep neural networks, in: Proceedings of the IEEE con-
Applications 78 (2019) 3843–3858. ference on computer vision and pattern recognition, 2017, pp. 1492–
[231] M. Malinowski, C. Doersch, A. Santoro, P. Battaglia, Learning visual 1500.
question answering by bootstrapping hard attention, in: Proceedings of [253] I. Ilievski, S. Yan, J. Feng, A focused dynamic attention model for visual
the European Conference on Computer Vision (ECCV), 2018, pp. 3–20. question answering, arXiv preprint arXiv:1604.01485 (2016).
[232] T. Rahman, S.-H. Chou, L. Sigal, G. Carenini, An improved attention [254] J. Lu, X. Lin, D. Batra, D. Parikh, Deeper lstm and normalized cnn
for visual question answering, in: Proceedings of the IEEE/CVF Con- visual question answering model, GitHub repository 6 (2015).
ference on Computer Vision and Pattern Recognition, 2021, pp. 1653– [255] Q. Wu, P. Wang, C. Shen, A. Dick, A. Van Den Hengel, Ask me any-
1662. thing: Free-form visual question answering based on knowledge from

38
external sources, in: Proceedings of the IEEE conference on computer A massively multilingual multimodal evaluation dataset, arXiv preprint
vision and pattern recognition, 2016, pp. 4622–4630. arXiv:2205.12522 (2022).
[256] P. Lu, H. Li, W. Zhang, J. Wang, X. Wang, Co-attending free-form re- [277] J. J. Lau, S. Gayen, A. Ben Abacha, D. Demner-Fushman, A dataset
gions and detections with multi-modal multiplicative feature embedding of clinically generated visual questions and answers about radiology im-
for visual question answering, in: Proceedings of the AAAI conference ages, Scientific data 5 (2018) 1–10.
on artificial intelligence, volume 32-1, 2018, pp. 7218—-7225. [278] X. He, Y. Zhang, L. Mou, E. Xing, P. Xie, Pathvqa: 30000+ questions
[257] P. Wang, Q. Wu, C. Shen, A. van den Hengel, The vqa-machine: Learn- for medical visual question answering, 2020. arXiv:2003.10286.
ing how to use existing vision algorithms to answer new questions, in: [279] K. Zhang, J. Yu, Z. Yan, Y. Liu, E. Adhikarla, S. Fu, X. Chen, C. Chen,
Proceedings of the IEEE Conference on Computer Vision and Pattern Y. Zhou, X. Li, et al., Biomedgpt: A unified and generalist biomedical
Recognition, 2017, pp. 1173–1182. generative pre-trained transformer for vision, language, and multimodal
[258] A. Kumar, O. Irsoy, P. Ondruska, M. Iyyer, J. Bradbury, I. Gulrajani, tasks, arXiv preprint arXiv:2305.17100 (2023).
V. Zhong, R. Paulus, R. Socher, Ask me anything: Dynamic memory [280] C. Li, C. Wong, S. Zhang, N. Usuyama, H. Liu, J. Yang, T. Naumann,
networks for natural language processing, in: International conference H. Poon, J. Gao, Llava-med: Training a large language-and-vision as-
on machine learning, PMLR, 2016, pp. 1378–1387. sistant for biomedicine in one day, arXiv preprint arXiv:2306.00890
[259] P. Gao, H. Li, S. Li, P. Lu, Y. Li, S. C. Hoi, X. Wang, Question-guided (2023).
hybrid convolution for visual question answering, in: Proceedings of the [281] Y. Li, Y. Du, K. Zhou, J. Wang, W. X. Zhao, J.-R. Wen, Evaluating
European Conference on Computer Vision (ECCV), 2018, pp. 469–485. object hallucination in large vision-language models, arXiv preprint
[260] J. Andreas, M. Rohrbach, T. Darrell, D. Klein, Learning to arXiv:2305.10355 (2023).
compose neural networks for question answering, arXiv preprint [282] T. Gupta, R. Marten, A. Kembhavi, D. Hoiem, Grit: General robust
arXiv:1601.01705 (2016). image task benchmark, arXiv preprint arXiv:2204.13653 (2022).
[261] Z. Huang, Z. Zeng, B. Liu, D. Fu, J. Fu, Pixel-bert: Aligning im- [283] J.-H. Huang, M. Alfadly, B. Ghanem, M. Worring, Assessing the robust-
age pixels with text by deep multi-modal transformers, arXiv preprint ness of visual question answering, ArXiv abs/1912.01452 (2019). URL:
arXiv:2004.00849 (2020). https://api.semanticscholar.org/CorpusID:208548469.
[262] Z. Gan, Y.-C. Chen, L. Li, C. Zhu, Y. Cheng, J. Liu, Large-scale ad- [284] C. E. Jimenez, O. Russakovsky, K. Narasimhan, Carets: A con-
versarial training for vision-and-language representation learning, Ad- sistency and robustness evaluative test suite for vqa, arXiv preprint
vances in Neural Information Processing Systems 33 (2020) 6616–6628. arXiv:2203.07613 (2022).
[263] W. Li, C. Gao, G. Niu, X. Xiao, H. Liu, J. Liu, H. Wu, H. Wang, Unimo: [285] Y. Zhao, T. Pang, C. Du, X. Yang, C. Li, N.-M. Cheung, M. Lin, On
Towards unified-modal understanding and generation via cross-modal evaluating adversarial robustness of large vision-language models, arXiv
contrastive learning, arXiv preprint arXiv:2012.15409 (2020). preprint arXiv:2305.16934 (2023).
[264] P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, Y. Choi, J. Gao, [286] W.-L. Chao, H. Hu, F. Sha, Cross-dataset adaptation for visual question
Vinvl: Revisiting visual representations in vision-language models, in: answering, in: Proceedings of the IEEE Conference on Computer Vision
Proceedings of the IEEE/CVF conference on computer vision and pat- and Pattern Recognition, 2018, pp. 5716–5725.
tern recognition, 2021, pp. 5579–5588. [287] Q. Li, J. Fu, D. Yu, T. Mei, J. Luo, Tell-and-answer: Towards ex-
[265] Z.-Y. Dou, Y. Xu, Z. Gan, J. Wang, S. Wang, L. Wang, C. Zhu, P. Zhang, plainable visual question answering using attributes and captions, arXiv
L. Yuan, N. Peng, et al., An empirical study of training end-to-end preprint arXiv:1801.09041 (2018).
vision-and-language transformers, in: Proceedings of the IEEE/CVF [288] Y. Goyal, A. Mohapatra, D. Parikh, D. Batra, Towards transparent ai
Conference on Computer Vision and Pattern Recognition, 2022, pp. systems: Interpreting visual question answering models, arXiv preprint
18166–18176. arXiv:1608.08974 (2016).
[266] J. Wang, Z. Yang, X. Hu, L. Li, K. Lin, Z. Gan, Z. Liu, C. Liu, L. Wang, [289] B. Goertzel, C. Pennachin, Artificial general intelligence, volume 2,
Git: A generative image-to-text transformer for vision and language, Springer, 2007.
arXiv preprint arXiv:2205.14100 (2022). [290] M. R. Farazi, S. H. Khan, N. Barnes, From known to the unknown:
[267] Z. Wang, J. Yu, A. W. Yu, Z. Dai, Y. Tsvetkov, Y. Cao, Simvlm: Simple Transferring knowledge to answer questions about novel visual and se-
visual language model pretraining with weak supervision, arXiv preprint mantic concepts, Image and Vision Computing 103 (2020) 103985.
arXiv:2108.10904 (2021). [291] W. Jin, Y. Cheng, Y. Shen, W. Chen, X. Ren, A good prompt is worth
[268] L. Yuan, D. Chen, Y.-L. Chen, N. Codella, X. Dai, J. Gao, H. Hu, millions of parameters: Low-resource prompt-based learning for vision-
X. Huang, B. Li, C. Li, et al., Florence: A new foundation model for language models, arXiv preprint arXiv:2110.08484 (2021).
computer vision, arXiv preprint arXiv:2111.11432 (2021). [292] Y.-S. Chuang, C.-L. Liu, H.-Y. Lee, L.-s. Lee, Speechbert: An audio-
[269] C. Li, H. Xu, J. Tian, W. Wang, M. Yan, B. Bi, J. Ye, H. Chen, G. Xu, and-text jointly learned language model for end-to-end spoken question
Z. Cao, et al., mplug: Effective and efficient vision-language learning by answering, arXiv preprint arXiv:1910.11559 (2019).
cross-modal skip-connections, arXiv preprint arXiv:2205.12005 (2022). [293] K. Drossos, S. Lipping, T. Virtanen, Clotho: An audio captioning
[270] J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, Y. Wu, dataset, in: ICASSP 2020-2020 IEEE International Conference on
Coca: Contrastive captioners are image-text foundation models, arXiv Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2020, pp.
preprint arXiv:2205.01917 (2022). 736–740.
[271] J. Li, D. Li, S. Savarese, S. Hoi, Blip-2: Bootstrapping language-image [294] V. Iashin, E. Rahtu, Multi-modal dense video captioning, in: Proceed-
pre-training with frozen image encoders and large language models, ings of the IEEE/CVF conference on computer vision and pattern recog-
arXiv preprint arXiv:2301.12597 (2023). nition workshops, 2020, pp. 958–959.
[272] P. Wang, S. Wang, J. Lin, S. Bai, X. Zhou, J. Zhou, X. Wang, C. Zhou, [295] M. Maaz, H. Rasheed, S. Khan, F. S. Khan, Video-chatgpt: Towards de-
One-peace: Exploring one general representation model toward unlim- tailed video understanding via large vision and language models, arXiv
ited modalities, arXiv preprint arXiv:2305.11172 (2023). preprint arXiv:2306.05424 (2023).
[273] W. Wang, H. Bao, L. Dong, J. Bjorck, Z. Peng, Q. Liu, K. Aggarwal, [296] Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, L. Deng, Se-
O. K. Mohammed, S. Singhal, S. Som, et al., Image as a foreign lan- mantic compositional networks for visual captioning, in: Proceedings of
guage: Beit pretraining for all vision and vision-language tasks, arXiv the IEEE conference on computer vision and pattern recognition, 2017,
preprint arXiv:2208.10442 (2022). pp. 5630–5639.
[274] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for au- [297] S. Gao, Z. Chen, G. Chen, W. Wang, T. Lu, Champion solution for
tomatic evaluation of machine translation, in: Proceedings of the 40th the wsdm2023 toloka vqa challenge, arXiv preprint arXiv:2301.09045
annual meeting of the Association for Computational Linguistics, 2002, (2023).
pp. 311–318. [298] F. Liu, T. Xiang, T. M. Hospedales, W. Yang, C. Sun, ivqa: Inverse
[275] A. J. Wang, K. Q. Lin, D. J. Zhang, S. W. Lei, M. Z. Shou, Too visual question answering, in: Proceedings of the IEEE Conference on
large; data reduction for vision-language pre-training, arXiv preprint Computer Vision and Pattern Recognition, 2018, pp. 8611–8619.
arXiv:2305.20087 (2023). [299] N. Mostafazadeh, I. Misra, J. Devlin, M. Mitchell, X. He, L. Vander-
[276] A. V. Thapliyal, J. Pont-Tuset, X. Chen, R. Soricut, Crossmodal-3600: wende, Generating natural questions about an image, arXiv preprint

39
arXiv:1603.06059 (2016). [321] P. Kurp, Green computing, Communications of the ACM 51 (2008)
[300] K.-H. Zeng, T.-H. Chen, C.-Y. Chuang, Y.-H. Liao, J. C. Niebles, 11–13.
M. Sun, Leveraging video descriptions to learn video question answer- [322] T. Ahmad, D. Zhang, C. Huang, H. Zhang, N. Dai, Y. Song, H. Chen,
ing, in: Proceedings of the AAAI Conference on Artificial Intelligence, Artificial intelligence in sustainable energy industry: Status quo, chal-
volume 31-1, 2017, pp. 4334—-4340. lenges and opportunities, Journal of Cleaner Production 289 (2021)
[301] S. Changpinyo, D. Kukliansky, I. Szpektor, X. Chen, N. Ding, R. Sori- 125834.
cut, All you may need for vqa are image captions, arXiv preprint
arXiv:2205.01883 (2022).
[302] H.-N. Hu, Y.-C. Lin, M.-Y. Liu, H.-T. Cheng, Y.-J. Chang, M. Sun, Deep
360 pilot: Learning a deep agent for piloting through 360deg sports
videos, in: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2017, pp. 3451–3460.
[303] D. Gordon, A. Kembhavi, M. Rastegari, J. Redmon, D. Fox, A. Farhadi,
Iqa: Visual question answering in interactive environments, in: Proceed-
ings of the IEEE conference on computer vision and pattern recognition,
2018, pp. 4089–4098.
[304] M. Zhuge, D. Gao, D.-P. Fan, L. Jin, B. Chen, H. Zhou, M. Qiu, L. Shao,
Kaleido-bert: Vision-language pre-training on fashion domain, in: Pro-
ceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, 2021, pp. 12647–12657.
[305] D. Ghosal, M. S. Akhtar, D. Chauhan, S. Poria, A. Ekbal, P. Bhat-
tacharyya, Contextual inter-modal attention for multi-modal sentiment
analysis, in: proceedings of the 2018 conference on empirical methods
in natural language processing, 2018, pp. 3454–3466.
[306] K. Wang, Q. Yin, W. Wang, S. Wu, L. Wang, A comprehensive survey
on cross-modal retrieval, arXiv preprint arXiv:1607.06215 (2016).
[307] F. Chen, X. Chen, J. Shi, D. Zhang, J. Chang, Q. Tian, Hivlp: Hierar-
chical vision-language pre-training for fast image-text retrieval, arXiv
preprint arXiv:2205.12105 (2022).
[308] L. Specia, S. Frank, K. Sima’An, D. Elliott, A shared task on multimodal
machine translation and crosslingual image description, in: Proceedings
of the First Conference on Machine Translation: Volume 2, Shared Task
Papers, 2016, pp. 543–553.
[309] W. Shi, M. Zhang, R. Zhang, S. Chen, Z. Zhan, Change detection based
on artificial intelligence: State-of-the-art and challenges, Remote Sens-
ing 12 (2020) 1688.
[310] H. Yun, Y. Yu, W. Yang, K. Lee, G. Kim, Pano-avqa: Grounded audio-
visual question answering on 360deg videos, in: Proceedings of the
IEEE/CVF International Conference on Computer Vision, 2021, pp.
2031–2041.
[311] P. Sharma, N. Ding, S. Goodman, R. Soricut, Conceptual captions: A
cleaned, hypernymed, image alt-text dataset for automatic image cap-
tioning, in: Proceedings of the 56th Annual Meeting of the Associa-
tion for Computational Linguistics (Volume 1: Long Papers), 2018, pp.
2556–2565.
[312] J. Pfeiffer, G. Geigle, A. Kamath, J.-M. O. Steitz, S. Roth, I. Vulić,
I. Gurevych, xgqa: Cross-lingual visual question answering, arXiv
preprint arXiv:2109.06082 (2021).
[313] S. Changpinyo, L. Xue, I. Szpektor, A. V. Thapliyal, J. Amelot, X. Chen,
R. Soricut, Towards multi-lingual visual question answering, arXiv
preprint arXiv:2209.05401 (2022).
[314] C. Liu, J. Pfeiffer, A. Korhonen, I. Vulic, I. Gurevych, Delving
deeper into cross-lingual visual question answering, arXiv preprint
arXiv:2202.07630 (2022).
[315] Z. Chen, Y. Duan, W. Wang, J. He, T. Lu, J. Dai, Y. Qiao, Vision trans-
former adapter for dense predictions, arXiv preprint arXiv:2205.08534
(2022).
[316] S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, E. Chen, A survey on
multimodal large language models, arXiv preprint arXiv:2306.13549
(2023).
[317] H. Liu, C. Li, Q. Wu, Y. J. Lee, Visual instruction tuning, arXiv preprint
arXiv:2304.08485 (2023).
[318] R. Zhang, J. Han, A. Zhou, X. Hu, S. Yan, P. Lu, H. Li, P. Gao, Y. Qiao,
Llama-adapter: Efficient fine-tuning of language models with zero-init
attention, arXiv preprint arXiv:2303.16199 (2023).
[319] P. Gao, J. Han, R. Zhang, Z. Lin, S. Geng, A. Zhou, W. Zhang, P. Lu,
C. He, X. Yue, et al., Llama-adapter v2: Parameter-efficient visual in-
struction model, arXiv preprint arXiv:2304.15010 (2023).
[320] A. Rohrbach, L. A. Hendricks, K. Burns, T. Darrell, K. Saenko, Ob-
ject hallucination in image captioning, arXiv preprint arXiv:1809.02156
(2018).

40

You might also like