0% found this document useful (0 votes)

2 views

Joint_Answering_and_Explanation_for_Visual_Commonsense_Reasoning

This paper addresses the challenges in Visual Commonsense Reasoning (VCR) by proposing a novel framework that couples question answering (Q→A) and rationale inference (QA→R) processes, which are often treated separately in existing methods. The authors introduce a new branch, QR→A, to enhance the connection between these two processes, improving the overall performance on benchmark datasets. Experimental results demonstrate the effectiveness of this approach, highlighting the importance of integrating answering and reasoning for better visual comprehension.

Uploaded by

kayopo8074

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Joint_Answering_and_Explanation_for_Visual_Commonsense_Reasoning

Uploaded by

kayopo8074

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

3836 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL.

32, 2023

Joint Answering and Explanation for Visual

Commonsense Reasoning
Zhenyang Li , Yangyang Guo , Member, IEEE, Kejie Wang , Yinwei Wei , Member, IEEE,
Liqiang Nie , Senior Member, IEEE, and Mohan Kankanhalli , Fellow, IEEE

Abstract— Visual Commonsense Reasoning (VCR), deemed as prediction. In light of this, the task of Visual Commonsense
one challenging extension of Visual Question Answering (VQA), Reasoning (VCR) [3] has recently been introduced to bridge
endeavors to pursue a higher-level visual comprehension. VCR this gap. Beyond answering the cognition-level questions
includes two complementary processes: question answering over
a given image and rationale inference for answering explanation. (Q→A) as canonical VQA does, VCR further prompts to
Over the years, a variety of VCR methods have pushed more predict a rationale for the right answer option (QA→R),
advancements on the benchmark dataset. Despite significance of as shown in Figure 1.
these methods, they often treat the two processes in a separate VCR is made more challenging than VQA roughly in terms
manner and hence decompose VCR into two irrelevant VQA of the following two aspects: 1) On the data side – The images
instances. As a result, the pivotal connection between question
answering and rationale inference is broken, rendering existing in the VCR dataset describe more sophisticated scenes (e.g.,
efforts less faithful to visual reasoning. To empirically study this social interactions or mental states). Therefore, the collected
issue, we perform some in-depth empirical explorations in terms questions are rather difficult and often demand high-level
of both language shortcuts and generalization capability. Based visual reasoning capabilities (e.g., why or how). And 2) on
on our findings, we then propose a plug-and-play knowledge the task side – It is difficult holistically predict both the right
distillation enhanced framework to couple the question answering
and rationale inference processes. The key contribution lies in answer and rationale. VCR models should first predict the right
the introduction of a new branch, which serves as a relay to answer, based on which the acceptable rationale can be further
bridge the two processes. Given that our framework is model- inferred from a few candidates. As question answering about
agnostic, we apply it to the existing popular baselines and validate the complex scenes has been proved non-trivial, inferring the
its effectiveness on the benchmark dataset. As demonstrated right rationale simultaneously in VCR thus leads to more
in the experimental results, when equipped with our method,
these baselines all achieve consistent and significant performance difficulties.
improvements, evidently verifying the viability of processes In order to address the above research challenges, several
coupling. prior efforts have been dedicated to VCR over the past few
Index Terms— Visual commonsense reasoning, language years. The initial attempts contribute to designing specific
shortcut, knowledge distillation. architectures to approach VCR, such as R2C [3] and CCN [4].
In addition, recent BERT-based pre-training approaches, such
I. I NTRODUCTION as ViLBERT [5] and ERNIE-ViL [6], are reckoned as a better

V ISUAL Question Answering (VQA) is to answer a natu-

ral language question pertaining to a given image [1], [2].
Despite its noticeable progress, existing VQA benchmarks
solution to the first challenge. Specifically, a pretrain-then-
finetune learning scheme is adopted to transfer knowledge
from large-scale vision-language datasets [7] to VCR. These
merely address simple recognition questions (e.g., how many methods, though keep advancing numbers of the benchmark,
or what color), while neglecting the explanation of answering all consider Q→A and QA→R as two independent processes.
As a result, the question answering and rationale inference
Manuscript received 25 March 2022; revised 11 January 2023;
accepted 29 May 2023. Date of publication 6 July 2023; date of current processes are far from being resolved. Namely, the second
version 13 July 2023. This work was supported in part by the National Key challenge still remains unsettled in literature.
Research and Development Project of New Generation Artificial Intelligence Separately treating Q→A and QA→R brings adverse
under Grant 2018AAA0102502 and in part by the National Natural Science
Foundation of China under Grant U1936203. The associate editor coordinating effects on visual reasoning, considering that these two pro-
the review of this manuscript and approving it for publication was Prof. cesses share a consistent goal by nature. On the one hand,
Zhu Li. (Corresponding authors: Yangyang Guo; Liqiang Nie.) Q→A entails the reasoning clues for QA→R to infer the
Zhenyang Li and Kejie Wang are with the School of Computer Science
and Technology, Shandong University, Qingdao 266237, China (e-mail: right rationale. On the other hand, QA→R offers essential
[email protected]; [email protected]). explanations for Q→A to justify why the predicted answer is
Yangyang Guo, Yinwei Wei, and Mohan Kankanhalli are with correct. Nevertheless, separately treating these two processes
the School of Computing, National University of Singapore,
Singapore 117417 (e-mail: [email protected]; weiyinwei@hotmail. makes VCR degenerate into two independent VQA tasks1
com; [email protected]).
Liqiang Nie is with the School of Computer Science and Technol-
ogy, Harbin Institute of Technology, Shenzhen 518055, China (e-mail: 1 Note that for QA→R, the query to a VQA model now becomes the con-
[email protected]). catenation of the original question and the right answer. And the corresponding
Digital Object Identifier 10.1109/TIP.2023.3286259 answer choices are the set of candidate rationales.
1941-0042 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on July 17,2024 at 11:47:08 UTC from IEEE Xplore. Restrictions apply.
LI et al.: JOINT ANSWERING AND EXPLANATION FOR VISUAL COMMONSENSE REASONING 3837

work is among the first efforts to jointly explore the two

processes in VCR.
• We propose a novel plug-and-play knowledge distilla-
tion enhanced framework. Our newly introduced QR→A
branch serves as a proxy to efficiently couple Q→A and
QA→R.
• We apply this framework to five baseline methods and
conduct experiments on the VCR benchmark dataset. The
results demonstrate both the effectiveness and general-
ization capability of our proposed framework. As a side
product, the code has been released to facilitate other
researchers.2
The rest of this paper is structured as follows. We briefly
review the related literature in Section II and then discuss the
pitfalls of existing methods in Section III, followed by our
proposed ARC framework in Section IV. Experimental setup
and results analyses are presented in Section V and Section VI,
respectively. We finally conclude our work and outline the
future work in Section VII.
Fig. 1. Visual comparison between conventional learning strategy and ours.
Unlike previous methods separately treating Q→A and QA→R, ours in the
bottom, couple these two processes together by their non-separable nature. II. R ELATED W ORK
A. Explanation in Visual Question Answering
Traditional VQA models often adopt a CNN-RNN learning
paradigm, wherein the images are encoded via a pre-trained
As a result, QA→R has to rely on other unexpected infor- CNN network, in parallel to an RNN network taking the ques-
mation (e.g., word overlaps between answers and rationales),
tions as input [1], [8], [9]. Broad efforts have been devoted to
rendering the explanation less meaningful, due to the absence
applying the attention [10], [11], modular structure [12], [13],
of guidance from Q→A. To bring this problem explicit to
or external knowledge [14], [15] to VQA models. Recently,
readers, we perform extensive in-depth empirical experiments
some researchers have recognized the unfaithfulness of exist-
in this work (ref. Section III).
ing approaches. For example, [16], [17], [18] aim to identify
In addition, by conforming to the task intuition and human
and overcome the language prior problem, i.e., the answers are
cognition, the answering (Q→A) and reasoning (QA→R)
blindly predicted based on textual shortcuts between questions
should be made cohesive and consistent rather than separate.
and answers. Other methods endeavor to introduce explanation
To achieve this goal, we propose a novel plug-and-play knowl-
for VQA models [19], [20]. For instance, visual explanation
edge distillation enhanced framework to perform Answering
approaches often harness the heat map to achieve where
and Reasoning Coupling (ARC for short). The key to our
to look, such as Gram-CAM [21] or U-CAM [22]. VQA-
framework is the introduction of another proxy process i.e.,
HAT [23] collects human attention data which are utilized
QR→A. It targets answer prediction with the inputs of
to supervise visual attention learning. HINT [16] encour-
the given question and the right rationale. The intuition is
ages VQA models to be sensitive to the same input regions
that QR→A shares the same goal with Q→A yet is more
as humans. Different from these approaches, VQA-E [19]
information-abundant, and its reasoning procedure should be
requires a VQA model to generate a textual explanation for
similar with QA→R as these two are actually complementary.
the answer prediction. As a natural extension, both the visual
In our implementation, we devise two novel Knowledge Distil-
and textual explanations are also well studied [24], [25].
lation (KD) modules: KD-A and KD-R. The former aligns the
predicted logits between Q→A and QR→A since answering
with the right rationale is expected to be more confident. While B. Visual Commonsense Reasoning
the latter, which is pivotal to maintaining semantic consistency VCR contributes to an indispensable branch of the explain-
between QA→R and QR→A, aligns the fused multi-modal able VQA [3]. It involves two processes: question answering
features between them. With the aid of the above two KD (Q→A) and rationale inference (QA→R), whereby both are
modules, the models are enabled to couple the answering and embodied in a multiple-choice fashion. VCR instructs an
reasoning together, making visual reasoning more faithful. approach to not only answer the given question but also
In summary, the contribution of this paper is threefold: provide a reasonable explanation to Q→A. Accordingly, the
answer prediction is made more transparent and user-friendly
• We revisit VCR from the perspective of coupling the to humans.
Q→A and QA→R processes. With extensive probing Existing studies are in fact sparse due to its challenging
tests, we find that separately treating the two processes, nature. Some of them resort to designing specific model
as adopted by existing VCR methods, is detrimental to
visual reasoning. To the best of our knowledge, this 2 https://github.com/SDLZY/ARC

Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on July 17,2024 at 11:47:08 UTC from IEEE Xplore. Restrictions apply.
3838 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 32, 2023

architectures [3], [4], [26], [27]. For instance, R2C [3] adopts a TABLE I
three-step fashion, grounding texts with respect to the involved R ATIONALE P REDICTION P ERFORMANCE C OMPARISON
W ITH AND W ITHOUT Q UESTIONS
objects; contextualizing the answer with the corresponding
question and objects; and reasoning over the shared represen-
tation. Inspired by the neuron connectivity of brains, CCN [4]
dynamically models the visual neuron connectivity, which
is contextualized by the queries and responses. HGL [26]
leverages a vision-to-answer and a dual question-to-answer
heterogeneous graphs to seamlessly bridge vision and lan-
guage. Following approaches explored more fine-grained cues, III. P ITFALL OF E XISTING VCR M ETHODS
such as ECMR [27] incorporating the syntactic information
into visual reasoning [28], MLCC [29] using counterfactual VCR is challenging due to the fact that the model is required
thinking to generate informative samples. to not only answer a question but also reason about the
Thereafter, BERT-based pre-training approaches have been answer prediction. Accordingly, the answering and reasoning
extensively explored in the vision-and-language domain. Most processes are complementary and inseparable from each other.
of these methods employ a pretrain-then-finetune scheme However, existing methods often treat Q→A and QA→R sep-
and achieve significant performance improvement on VQA arately, resulting in sub-optimal visual reasoning as revealed
benchmarks including VCR [5], [30], [31]. The models in the following two aspects.
are often first pre-trained on large-scale vision-and-language
datasets (such as Conceptual Captions [7]), and then fine-tuned
A. Language Shortcut Recognition
on the downstream VCR. For instance, some models are
pre-trained on image-text datasets to align the visual-linguistic Besides the base Q→A model, representative approaches
clues through single-stream architectures (e.g., VL-BERT [31], mostly adopt another independent model for QA→R. Since
UNITER [30], 12-in-1 [32]). MERLOT RESERVE [33] there is no connection between these two processes, we hereby
introduces more modalities, i.e., sound and video, into raise a question: what kinds of clues do these methods employ,
cross-modal pre-training, and significantly outperforms other excluding the Q→A reasoning information? With this concern,
methods. at the first step, we recognize that the overlapped words
However, one limitation still prevents these methods from between right answers and rationales dominate the rationale
further advancement, namely, the Q→A and QA→R pro- inference. For instance, in Figure 1, the correct rationale
cesses are tackled independently. Such a strategy makes the largely overlaps with the right answer, e.g., the ‘[person1]’
answer prediction and rationale inference become two inde- and ‘[person2]’ tag, and the word ‘speaking’. This may lead
pendent VQA tasks. In this work, we propose to address this the models to predict rationale based on these shortcuts rather
issue by combining these two processes. than performing visual explanation for Q→A [17], [43].
The evidence is elaborated as follows.
1) QA→R Performance w/o Q: We input only the correct
C. Knowledge Distillation answer as the query to predict rationales with other settings
untouched and show the results in Table I. One can observe
The past few years have witnessed the noticeable develop- that the three models only degrade slightly when removing
ment of KD [34], [35]. As an effective model compression the question input. As VCR is a question-driven task, visual
tool, KD transfers knowledge from a cumbersome large net- reasoning becomes meaningless under such input removal
work (teacher) to a compacter network (student). Based on its conditions.
application scope, previous KD methods can be roughly cate- 2) Attention Distribution Over Queries: As attention plays
gorized into two groups – logit-based and feature-based. The an essential part in current VCR models, we then design
logit-based methods [34], [36] encourage the student to imitate experiments to explore the shortcut from this perspective.
the output from a teacher model. For example, the vanilla In this experiment, we consider only the QA→R. Given the
KD utilizes the softened logits from a pre-trained teacher attention map W ∈ R(lq +la )×lr for a query-rationale, wherein
network as extra supervision to instruct the student [34]. the query is the original question appended with the right
In contrast, feature-based methods [35], [37], [38] attempt answer. pair, where lq , la , and lr respectively denote the
to transfer knowledge in intermediate features between the length of the question, answer, and rationale, we calculate the
two networks. FitNet [39] directly aligns the embeddings attention contribution from the answer side only and analyze
of each input. Attention Transfer [40] extends FitNet from below. Specifically, we find that the median attention value
embeddings to attention matrices for different levels of feature obtained from answers in three methods (HGL, R2C, and
maps. To close the performance gap between the teacher and CCN) on the validation set is 0.72, 0.78, and 0.86, respectively,
student, RCO [41] presents route-constrained hint learning, indicating that the these models pay more attention to answers
which is employed to supervise the student by the outputs of rather than the holistic question-answer inputs. In this way,
hint layers from the teacher. Besides, FSP [42] estimates the they mainly rely on the answer information to predict the
Gram matrix across layers to reflect the data flow of how the right rationale, while the questions are somewhat ignored. Two
teacher network learns. examples produced by R2C are illustrated in Figure 2.

TABLE III
M ODEL P ERFORMANCE C OMPARISON OVER T WO V ERSIONS OF THE
VALIDATION S ET. Origin AND Rewritten D ENOTE THE M ODELS A RE
T ESTED ON THE O RIGINAL AND R EWRITTEN VCR VALIDATION
S ET, R ESPECTIVELY. LP R EPRESENTS F URTHER L ANGUAGE
P RE -T RAINING ON THE R EWRITTEN T EXTUAL DATA

these problems, a VCR model is expected to benefit from the

joint training of answering and reasoning.

IV. M ETHOD
The above probing tests demonstrate that separately treating
the two processes in VCR leads to unsatisfactory outcomes.
To overcome this, we propose an ARC framework to couple
Fig. 2. Illustration of the attention distribution of QA→R from R2C.
Q→A and QA→R together. In this work, we introduce
The orange line splits the question and the answer from the query. another new branch, namely QR→A, as a bridge to achieve
this goal.
TABLE II
R EWRITTEN E XAMPLES F ROM O UR E XPERIMENTS A. Preliminary
Before delving into the details of our ARC, we first outline
the three processes involved in our framework and their
corresponding learning functions.
1) Process Definition: Our framework contains three pro-
cesses: two of them are formally defined by the original
VCR [3] (i.e., Q→A and QA→R) and the other is a new
process introduced by this work (QR→A). Note that all three
processes are formulated in a multiple-choice format. That is,
given an image I and a query related to the image, the model
B. Generalization on Out-of-Domain Data is required to select the right option from several candidate
responses.
In addition to the shortcut problem, thereafter, we also Q→A: The query is embodied with a question Q, and
study the generalization capability of existing methods on the candidate responses are a set of answer choices A. The
the out-of-domain data. To implement this, we first rewrite objective is to select the right answer A+ ,
some sentences (including questions, answers, and rationales)
while maintaining their semantics. The Paraphrase-Online A+ = arg max f A (Ai | Q, I ), (1)
Ai ∈A
tool3 is leveraged to achieve this, whereby we substitute
some verbs/nouns with synonyms. We show some examples where f A denotes the Q→A model.
in Table II. In the next step, we apply the above methods to QA→R: The query is the concatenation of the question Q
the rewritten data and evaluate their performance. We also test and the right answer A+ . A set of rationales R constitutes the
the model performance with further BERT pre-training on the candidate responses and the model is expected to choose the
rewritten textual data. From the results in Table III, we find right rationale R+ ,
that all models degrade drastically on these new instances,
implying that they largely overfit on the in-domain data while R+ = arg max f R (Ri | Q, I, A+ ), (2)
lack generalization ability on the out-of-domain ones. Ri ∈R

Based on these two findings, we notice that the current where f R denotes the QA→R model.
models fail to conduct visual reasoning based on the answering One can see that it is difficult to directly connect these
clues. They instead, leverage superficial correlations to infer two since the involved parameters are not shared and the
rationale or simply overfit on the in-domain data. To approach input to QA→R includes the ground-truth answer rather than
the predicted one. In view of this, to bridge these two,
3 https://www.paraphrase-online.com/ we introduce the QR→A as a proxy.

Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on July 17,2024 at 11:47:08 UTC from IEEE Xplore. Restrictions apply.
3840 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 32, 2023

Fig. 3. The overall architecture of our proposed framework. There are three processes, i.e., Q→A, QR→A, and QA→R. We bridge these processes with
two KD modules: KD-A is leveraged to align the predicted logits between QR→A and Q→A; and KD-R aims to maintain semantic consistency between
QR→A and QA→R via the feature-level knowledge distillation operation.

QR→A: The query and responses are respectively the con- and reasoning. In the next, we present our model-agnostic
catenation of the question Q along with the right rationale R+ framework to couple the two successive processes, i.e., Q→A
and answer choices. The objective is, and QA→R, together.
A+ = arg max f C (Ai | Q, I, R+ ), (3)
Ai ∈A B. Proposed Method
where fC denotes the QR→A model. On the one hand, In order to effectively integrate the above three processes,
QR→A shares the consistent objective with Q→A. On the we propose a dual-level knowledge distillation framework and
other hand, its reasoning should be similar to that of QA→R. illustrate it in Figure 3. Specifically, after extracting the fused
These two factors make QR→A a good proxy for connecting multi-modal features with the aforementioned backbones, two
Q→A and QA→R. parallel KD modules are introduced to bridge the QR→A with
2) Training Pipeline: In general, for the feature extractors, the other two processes.
the VCR model often uses a pre-trained CNN network to Our first KD module is leveraged to align the predicted
obtain the visual features from the input image I , and an scores from QR→A and Q→A. The two processes share the
RNN-based or Transformer-based model to extract the textual same objective and QR→A offers ample information for pick-
features of the query and response. Thereafter, a multi-modal ing the right answer. In this way, we take the predictions from
fusion module is employed to obtain the joint representation, QR→A as the teacher and the predictions from Q→A as the
followed by the classifier to predict the logit ỹi for the student. Our second KD module is required to align the feature
response i. learning between QR→A and QA→R. Despite the objectives
To achieve the objectives in Equation 1 and 2, previ- being different, the reasoning of these two is actually similar.
ous methods often separately optimize the following two That is, answering with the given right rationale is simply the
cross-entropy losses, reverse of reasoning given the right answer, which allows these
two processes to share akin features.
exp ỹiA
 X|A|

 L A
= − y A
log , 1) Knowledge Distillation Between QR→A and Q→A

i i P A
j exp ỹ j (KD-A): Since the rationale is predicted for explaining the

(4)
X|R| exp ỹiR right answer, we thus empirically leverage the other direction,
,
 R R

 L = − y i log P R i.e., incorporating the correct rationale with the given question
i
j exp ỹ j

for answer prediction. It is intuitive that when combined
where yiA and yiR denote the ground truth label of answer Ai with the correct rationale, the answering confidence should
and rationale Ri , respectively. be enhanced. In light of this, we simply take the output logits
In existing VCR methods, models f A and f R often share from QR→A as the teacher and use them guide the learning
identical architecture and are trained separately. As a result, of Q→A.
the two processes are reduced into two independent VQA Specifically, in this KD module, the knowledge is encoded
tasks, resulting in connection absence between answering and transferred in the form of softened scores: it aligns the

output probability p A (from the student f A ) with the pC (from Algorithm 1 Training Procedure of ARC
the teacher f C ). To avoid the over-confidence problem [36],
a relaxation temperature T > 1 is introduced to soften the
logits ỹC from f C . The same relaxation is applied to the
output logits ỹ A of f A ,
exp ( ỹiC /T )

piC = P|A| ,



( C /T )


j=1 exp ỹ j
(5)
exp ( ỹiA /T )
A
.


 i
 p = P|A|
j=1 exp ( ỹ j /T )
A


We then employ the KL divergence to align the predicted

scores as follows,
L KA D = D K L ( pC || p A ). (6)
2) Knowledge Distillation Between QR→A and QA→R
(KD-R): We argue that the features learned by QR→A are
more reliable than that of QA→R. The support for this comes
from two aspects: QR→A resembles raising questions with
right answers, providing explicit evidence for feature learning;
leveraging QA→R to perform reasoning might lead to the
textual shortcut problem, as discussed in Section III-A.
To this end, we design another KD module to align the
feature learning between QR→A and QA→R, which aims to
maintain semantic consistency. The intermediate features that Q→A and QA→R are made more cohesive than these
are directly inputted to the final classifier are taken as a proxy. methods.
Specifically, we obtain one teacher feature hC The optimization procedure of our proposed ARC, as shown
+ from the teacher
C R
model f ; and several student features hi from the student in Algorithm 1, contains two stages. In the first stage, we train
model f R wherein only one is deemed as positive – h+ R the teacher network f C until its cross-entropy loss converges
according to the ground-truth rationale R+ while others are (line 1-4),
R
all negative h− . Thereafter, we estimate the similarity score |A|
X exp ỹiC
with the following formula, C
L =− yiA log P C
, (12)
i j exp ỹ j
si = (hC
+ ) hi ,
T R
(7)
where ỹiC denotes the output of f C corresponding to answer
Especially, s+ should be larger than s− in optimal condition. Ai . Secondly, f A and f R are optimized together, where
To enable the training of this module, we adopt the rationale the cross-entropy losses and KD losses are both considered
labels as supervision and perform KD-R with, (line 5-10). During inference, the f C branch is removed,
|R| and only f A and f R are responsible to predict answers and
X exp si
L KR D = − yiR log P|R| . (8) rationales, respectively.
i j exp si
V. E XPERIMENTAL S ETUP
C. Training Protocol A. Datasets and Evaluation Protocols
With the above two KD modules, we then combine them We conducted extensive experiments on the VCR [3] bench-
with the original answering and reasoning losses as follows, mark dataset. Specifically, the images are extracted from the
movie clips in LSMDC [45] and MovieClips,4 wherein the
L Q→A = αL KA D + (1 − α)L A , (9)
objects in images are detected by an Mask-RCNN model [46].
L Q A→R = βL KR D + (1 − β)L , R
(10) We utilized the official dataset split, and the number of
instances for training, validation, and test sets are 212,923,
where α and β serve as hyper-parameters to balance the two
26,534, and 25,263, respectively. For Q→A and QA→R, four
KD losses.
options serve as candidates and only one is correct.
Finally, we train the entire framework in an end-to-end
Regarding the evaluation metric, we adopted the popular
fashion,
accuracy metric for Q→A, QA→R, and Q→AR. For Q→AR,
L = L Q→A + L Q A→R . (11) the prediction is right only when both the answer and
rationale are selected correctly. Since the test set labels are
Different from previous methods that output answers and not available, we reported the performance of our best model
rationales separately, in our framework, the answers and
rationales are predicted at the same time. In this way, the 4 youtube.com/user/movieclips

Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on July 17,2024 at 11:47:08 UTC from IEEE Xplore. Restrictions apply.
3842 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 32, 2023

TABLE IV TABLE V
P ERFORMANCE C OMPARISON ON VCR VALIDATION AND A BLATION S TUDY OF THE P ROPOSED M ETHOD ON THE VCR VALIDATION
T ESTING S ETS S ET. KD-R AND KD-A D ENOTE THE KD B ETWEEN QR→A AND
QA→R AND QR→A AND Q→A, R ESPECTIVELY

couple the Q→A and QA→R as well as the effectiveness

of the proposed method.
• Recent VL-Transformers, such as VL-BERT [31] and
UNITER [30], drastically outperform traditional mod-
els by large margins. With our ARC framework, these
two models can be enhanced with further performance
improvements.
• Traditional state-of-the-art VQA methods all perform less
favorably than the VCR ones. The reasons for this are
on the test set once and performed other experiments on the two-fold: 1) the answers in previous VQA datasets are
validation set. composed of a few keywords, while the answer length
The PyTorch toolkit is leveraged to implement our method in VCR is relatively longer (7.5 words for answers
and all the experiments were conducted on a single GeForce and 16 words for rationales on average). And 2) VCR
RTX 2080 Ti GPU. We strictly followed the training and infer- demands higher-order reasoning capability, which is far
ence protocols of every baseline. We employed grid-search to beyond the simple recognition in VQA datasets.
decide the optimal value of the hyper-parameters of α and β
in Equation 9 and 10, respectively. Both parameters are tuned B. Ablation Study
in the range of [0.0, 1.0] with a step size of 0.2.
We conducted detailed experiments to validate the effective-
ness of each KD module and shown the results in Table V.
B. Baselines From this table, we have the following observations:
We compared our method with the following two sets of • When incorporating the knowledge distillation module
baselines. between QR→A and QA→R into the baseline model,
1) VQA Baselines: RevisitedVQA [47], BottomUpTop- a significant performance enhancement can be observed.
Down [10], MLB [48] and MUTAN [49]. These methods are Take the R2C baseline as an example, our method boosts
originally developed for VQA and adapted for VCR in this it by 1.5% (Q→A), 1.5% (QA→R) and 2.0% (Q→AR).
paper. • We then introduced the knowledge distillation module
2) Methods Specifically Designed for VCR: We between Q→A and QR→A to the baseline model and
employed our method to the following baselines to observed further improvement. These two experiments
validate its effectiveness: Traditional VCR models validate the effectiveness of our knowledge distillation
– R2C [3], CCN [4], andTAB-VCR [44]; and recent modules.
VL-Transformers – UNITER [30] and VL-BERT [31]. • All models perform quite well on QR→A. This result
is intuitive since the rationale is introduced as inputs
for predicting answers. In addition, it also supports our
VI. E XPERIMENTAL R ESULTS
method implementation to make it as the proxy between
A. Overall Performance Comparison Q→A and QA→R.
The results of validation and testing sets are reported in
Table I, and the key observations are listed below. C. Qualitative Results
• For all five VCR baselines, with our ARC framework, Besides the above quantitative results, we also illustrate
they can all benefit significant gains. For example, com- several cases from the R2C baseline and our method in
pared with CCN, 2.8%, a 2.4% and 3.6% improvement Figure 4. In detail, for the first example, although R2C gives
on Q→A, QA→R, and Q→AR on the test set can the right answer, it fails to select the correct rationale. It is
be observed, respectively. This verifies the necessity to because the rationale option B contains more overlaps with the

Fig. 4. Qualitative results from R2C and our method. The predicted probability of each option is illustrated on the rightmost.

Fig. 5. Attention distribution illustration from the QA→R model of baselines and ours.

query (e.g., ‘[person1]’, ‘attractive’), and R2C leverages such TABLE VI

a shortcut to perform reasoning. While with our method, the P ERFORMANCE C OMPARISON ON THE O UT- OF -D OMAIN DATA
right rationale can be more confidently predicted. The second
example is even more confusing, as R2C reasons correctly,
while the answer is not right. The key reason is that R2C
takes answering and reasoning into two independent VQA
instances. The last row demonstrates a failure case, where
our model chooses the right answer but predicts the wrong
rationale. In fact, the rationale predicted by our framework is
also an explanation for the right answer although it slightly
deviates from the ground-truth.
Thereafter, we further illustrate two instances with attention question input, such as ‘explosion’ and ‘boss’ in the first
weight distribution in Figure 5. From this figure, we observe and second example, respectively, and therefore reduces the
that our framework guides the baseline to focus more on the language shortcut reliance.

Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on July 17,2024 at 11:47:08 UTC from IEEE Xplore. Restrictions apply.
3844 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 32, 2023

Fig. 6. Validation set instance distribution and model performance according to question types.

D. Model Performance on Out-of-Domain Data Regarding the future directions, since this work demon-
As previously shown in Table III, all existing models strates the potential of process coupling for enhancing visual
perform much less favorably on the out-of-domain data. understanding, studying solutions for jointly training from
To examine whether coupling answering and reasoning can more model components, such as the attention module, is thus
improve the generalization of these models, we applied our promising.
method to these data and reported the results in Table VI.
In addition, we also pre-trained the BERT model on the R EFERENCES
rewritten textual data to test whether language pre-pretraining [1] S. Antol et al., “VQA: Visual question answering,” in Proc. IEEE Int.
will bring further benefits. As can be observed, with our Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 2425–2433.
proposed ARC, all the three models obtain some performance [2] Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei, “Visual7W: Grounded
question answering in images,” in Proc. IEEE Conf. Comput. Vis. Pattern
improvements. For instance, our method gains a 3.6%, 2.5%, Recognit. (CVPR), Jun. 2016, pp. 4995–5004.
and 3.3% absolute improvement over R2C of the w/o language [3] R. Zellers, Y. Bisk, A. Farhadi, and Y. Choi, “From recognition to
pre-train setting on the three accuracy metrics, respectively. cognition: Visual commonsense reasoning,” in Proc. IEEE/CVF Conf.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 6713–6724.
It is evident that our method demonstrates better generalization [4] A. Wu, L. Zhu, Y. Han, and Y. Yang, “Connective cognition network
capability on these skewed data. for directional visual commonsense reasoning,” in Proc. Adv. Neural Inf.
Process. Syst., 2019, pp. 5670–5680.
[5] J. Lu, D. Batra, D. Parikh, and S. Lee, “ViLBERT: Pretraining task-
E. Results w.r.t Question Types agnostic visiolinguistic representations for vision-and-language tasks,”
in Proc. Adv. Neural Inf. Process. Syst., 2019, pp. 13–23.
Figure 6 illustrates the question types (extracted by the [6] F. Yu et al., “ERNIE-ViL: Knowledge enhanced vision-language rep-
corresponding matching words) in the validation set. We then resentations through scene graphs,” in Proc. AAAI Conf. Artif. Intell.,
2021, pp. 3208–3216.
show the model performance of R2C and our method with [7] P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual captions:
respect to question types. In a nutshell, our method achieves A cleaned, hypernymed, image alt-text dataset for automatic image
consistent improvements on almost all categories. Especially, captioning,” in Proc. 56th Annu. Meeting Assoc. Comput. Linguistics
(Long Papers), vol. 1, 2018, pp. 2556–2565.
compared with binary questions like is and do, our method [8] M. Malinowski, M. Rohrbach, and M. Fritz, “Ask your neurons: A
shows more advantage on more challenging where, how, neural-based approach to answering questions about images,” in Proc.
and what questions. However, both methods struggle with IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 1–9.
how questions, as these questions demand high-level visual [9] Q. Wu, D. Teney, P. Wang, C. Shen, A. Dick, and A. van den Hengel,
“Visual question answering: A survey of methods and datasets,” Comput.
understanding and are therefore difficult to address. Vis. Image Understand., vol. 163, pp. 21–40, Oct. 2017.
[10] P. Anderson et al., “Bottom-up and top-down attention for image
captioning and visual question answering,” in Proc. IEEE/CVF Conf.
VII. C ONCLUSION AND F UTURE W ORK Comput. Vis. Pattern Recognit., Jun. 2018, pp. 6077–6086.
[11] W. Guo, Y. Zhang, J. Yang, and X. Yuan, “Re-attention for visual ques-
Existing VCR models perform the answering and explaining tion answering,” IEEE Trans. Image Process., vol. 30, pp. 6730–6743,
processes in a separate manner, leading to poor generaliza- 2021.
tion and undesirable language shortcuts between answers and [12] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, “Neural module
networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
rationales. This paper first discusses the disadvantage of the Jun. 2016, pp. 39–48.
separate training strategy, followed by a novel knowledge [13] J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick,
distillation framework to couple the two processes. Our frame- and R. Girshick, “CLEVR: A diagnostic dataset for compositional
language and elementary visual reasoning,” in Proc. IEEE Conf. Comput.
work consists of two KD modules, i.e., KD-A and KD-R, Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 1988–1997.
where the former is leveraged to align the predicted logits [14] P. Wang, Q. Wu, C. Shen, A. Dick, and A. van den Hengel, “FVQA:
between Q→A and QR→A, and the latter aims to main- Fact-based visual question answering,” IEEE Trans. Pattern Anal. Mach.
Intell., vol. 40, no. 10, pp. 2413–2427, Oct. 2018.
tain semantic consistency between QA→R and QR→A with
[15] K. Marino, M. Rastegari, A. Farhadi, and R. Mottaghi, “OK-VQA: A
feature-level knowledge distillation. We apply this framework visual question answering benchmark requiring external knowledge,”
to several state-of-the-art baselines and studied its effective- in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR),
ness on the VCR benchmark dataset. With the quantitative and Jun. 2019, pp. 3190–3199.
[16] R. R. Selvaraju et al., “Taking a HINT: Leveraging explanations to make
qualitative experimental results, the viability of jointly training vision and language models more grounded,” in Proc. IEEE/CVF Int.
Q→A and QR→A is explicitly testified. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 2591–2600.

[17] Y. Guo, L. Nie, Z. Cheng, F. Ji, J. Zhang, and A. Del Bimbo, “AdaVQA: [41] X. Jin et al., “Knowledge distillation via route constrained optimiza-
Overcoming language priors with adapted margin cosine loss,” in Proc. tion,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019,
13th Int. Joint Conf. Artif. Intell., Aug. 2021, pp. 708–714. pp. 1345–1354.
[18] A. Agrawal, D. Batra, D. Parikh, and A. Kembhavi, “Don’t just assume; [42] J. Yim, D. Joo, J. Bae, and J. Kim, “A gift from knowledge distillation:
look and answer: Overcoming priors for visual question answering,” Fast optimization, network minimization and transfer learning,” in
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,
pp. 4971–4980. pp. 7130–7138.
[19] Q. Li, Q. Tao, S. R. Joty, J. Cai, and J. Luo, “VQA-E: Explaining, elab- [43] Y. Guo, L. Nie, Z. Cheng, Q. Tian, and M. Zhang, “Loss re-scaling
orating, and enhancing your answers for visual questions,” in Proc. Eur. VQA: Revisiting the language prior problem from a class-imbalance
Conf. Comput. Vis. Cham, Switzerland: Springer, 2018, pp. 570–586. view,” IEEE Trans. Image Process., vol. 31, pp. 227–238, 2022.
[20] B. N. Patro and V. P. Namboodiri, “Explanation vs attention: A two- [44] J. Lin, U. Jain, and A. G. Schwing, “TAB-VCR: Tags and attributes
player game to obtain attention for VQA,” in Proc. AAAI Conf. Artif. based VCR baselines,” in Proc. Adv. Neural Inf. Process. Syst., 2019,
Intell., 2020, pp. 11848–11855. pp. 15589–15602.
[21] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and [45] A. Rohrbach et al., “Movie description,” Int. J. Comput. Vis., vol. 123,
D. Batra, “Grad-CAM: Visual explanations from deep networks via no. 1, pp. 94–120, 2017.
gradient-based localization,” in Proc. IEEE Int. Conf. Comput. Vis. [46] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc.
(ICCV), Oct. 2017, pp. 618–626. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2980–2988.
[22] B. Patro, M. Lunayach, S. Patel, and V. Namboodiri, “U-CAM: Visual [47] A. Jabri, A. Joulin, and L. van der Maaten, “Revisiting visual question
explanation using uncertainty based class activation maps,” in Proc. answering baselines,” in Proc. Eur. Conf. Comput. Vis. Cham, Switzer-
IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 7443–7452. land: Springer, 2016, pp. 727–739.
[23] A. Das, H. Agrawal, L. Zitnick, D. Parikh, and D. Batra, “Human [48] J. Kim, K. W. On, W. Lim, J. Kim, J. Ha, and B. Zhang, “Hadamard
attention in visual question answering: Do humans and deep networks product for low-rank bilinear pooling,” in Proc. Int. Conf. Learn.
look at the same regions?” Comput. Vis. Image Understand., vol. 163, Represent. (ICLR), 2017, pp. 1–14.
pp. 90–100, Oct. 2017. [49] H. Ben-Younes, R. Cadene, M. Cord, and N. Thome, “MUTAN:
Multimodal tucker fusion for visual question answering,” in Proc. IEEE
[24] D. H. Park et al., “Multimodal explanations: Justifying decisions and
Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2631–2639.
pointing to the evidence,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Recognit., Jun. 2018, pp. 8779–8788. [50] L. H. Li, M. Yatskar, D. Yin, C. Hsieh, and K. Chang, “VisuaIBERT:
A simple and performant baseline for vision and language,” 2019,
[25] J. Wu and R. Mooney, “Faithful multimodal explanation for visual arXiv:1908.03557.
question answering,” in Proc. ACL Workshop BlackboxNLP: Analyzing
Interpreting Neural Netw. (NLP), 2019, pp. 103–112.
[26] W. Yu, J. Zhou, W. Yu, X. Liang, and N. Xiao, “Heterogeneous graph
learning for visual commonsense reasoning,” in Proc. Adv. Neural Inf.
Process. Syst., 2019, pp. 2765–2775. Zhenyang Li received the B.Eng. degree from
[27] X. Zhang, F. Zhang, and C. Xu, “Explicit cross-modal representation Shandong University and the master’s degree from
learning for visual commonsense reasoning,” IEEE Trans. Multimedia, the University of Chinese Academy of Sciences.
vol. 24, pp. 2986–2997, 2022. He is currently pursuing the Ph.D. degree with
[28] J. Zhu and H. Wang, “Multiscale conditional relationship graph network the School of Computer Science and Technology,
for referring relationships in images,” IEEE Trans. Cognit. Develop. Shandong University, supervised by Prof. Liqiang
Syst., vol. 14, no. 2, pp. 752–760, Jun. 2022. Nie. His research interests include multi-modal
[29] X. Zhang, F. Zhang, and C. Xu, “Multi-level counterfactual contrast computing, especially visual question answering.
for visual commonsense reasoning,” in Proc. 29th ACM Int. Conf.
Multimedia, Oct. 2021, pp. 1793–1802.
[30] Y. Chen et al., “UNITER: Universal image-text representation learning,”
in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2020,
pp. 104–120.
[31] W. Su et al., “VL-BERT: Pre-training of generic visual-linguistic repre- Yangyang Guo (Member, IEEE) is currently a
sentations,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2020, pp. 1–16. Research Fellow with the National University of
[32] J. Lu, V. Goswami, M. Rohrbach, D. Parikh, and S. Lee, “12-in- Singapore. He has authored or coauthored several
1: Multi-task vision and language representation learning,” in Proc. articles in top journals, such as IEEE T RANSAC -
TIONS ON I MAGE P ROCESSING , IEEE T RANSAC -
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020,
TIONS ON M ULTIMEDIA , IEEE T RANSACTIONS
pp. 10434–10443.
ON K NOWLEDGE AND DATA E NGINEERING , IEEE
[33] R. Zellers et al., “MERLOT RESERVE: Neural script knowledge T RANSACTIONS ON N EURAL N ETWORKS AND
through vision and language and sound,” in Proc. IEEE/CVF Conf. L EARNING S YSTEMS, and ACM TOIS. He was a
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 16354–16366. recipient as an Outstanding Reviewer for IEEE
[34] G. E. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a T RANSACTIONS ON M ULTIMEDIA and WSDM
neural network,” 2015, arXiv:1503.02531. 2022. He is a Regular Reviewer for journals, including IEEE T RANSAC -
[35] Y. Shang, B. Duan, Z. Zong, L. Nie, and Y. Yan, “Lipschitz continuity TIONS ON I MAGE P ROCESSING , IEEE T RANSACTIONS ON M ULTIMEDIA ,
guided knowledge distillation,” in Proc. IEEE/CVF Int. Conf. Comput. IEEE T RANSACTIONS ON K NOWLEDGE AND DATA E NGINEERING, IEEE
Vis. (ICCV), Oct. 2021, pp. 10655–10664. T RANSACTIONS ON C IRCUITS AND S YSTEMS FOR V IDEO T ECHNOLOGY,
[36] J. Ba and R. Caruana, “Do deep nets really need to be deep,” in Proc. ACM TOIS, and ToMM.
Adv. Neural Inf. Process. Syst., 2014, pp. 2654–2662.
[37] P. Liu, W. Liu, H. Ma, Z. Jiang, and M. Seok, “KTAN: Knowledge
transfer adversarial network,” in Proc. Int. Joint Conf. Neural Netw.
(IJCNN), Jul. 2020, pp. 1–7. Kejie Wang is currently pursuing the B.Eng. degree
[38] Z. Shen, Z. He, and X. Xue, “Meal: Multi-model ensemble via adver- in computer science with Shandong University. His
sarial learning,” in Proc. AAAI Conf. Artif. Intell., 2019, pp. 4886–4893. research interests include visual question answering
[39] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and and computer vision.
Y. Bengio, “FitNets: Hints for thin deep nets,” in Proc. Int. Conf. Learn.
Represent. (ICLR), 2015, pp. 1–13.
[40] S. Zagoruyko and N. Komodakis, “Paying more attention to atten-
tion: Improving the performance of convolutional neural networks via
attention transfer,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2017,
pp. 1–13.

Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on July 17,2024 at 11:47:08 UTC from IEEE Xplore. Restrictions apply.
3846 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 32, 2023

Yinwei Wei (Member, IEEE) received the M.S. He is a member of the ICME Steering Committee. He has received many
degree from Tianjin University and the Ph.D. awards, such as ACM MM and SIGIR Best Paper Honorable Mention in
degree from Shandong University. He is currently a 2019, the SIGMM Rising Star in 2020, the TR35 China 2020, the DAMO
Research Fellow with NExT, National University of Academy Young Fellow in 2020, and the SIGIR Best Student Paper in 2021.
Singapore. Several works have been published in top Meanwhile, he is the regular Area Chair of ACM MM, NeurIPS, IJCAI, and
forums, such as ACM MM, IEEE T RANSACTIONS AAAI. He is an Associate Editor of IEEE T RANSACTIONS ON K NOWLEDGE
ON M ULTIMEDIA , and IEEE T RANSACTIONS ON AND DATA E NGINEERING , IEEE T RANSACTIONS ON M ULTIMEDIA , IEEE
I MAGE P ROCESSING. His research interests include T RANSACTIONS ON C IRCUITS AND S YSTEMS FOR V IDEO T ECHNOLOGY,
multimedia computing and recommendation. He has ACM ToMM, and Information Sciences.
served as a PC Member for several conferences, such
as MM, AAAI, and IJCAI, and a Reviewer for IEEE
T RANSACTIONS ON PATTERN A NALYSIS AND M ACHINE I NTELLIGENCE,
IEEE T RANSACTIONS ON I MAGE P ROCESSING, and IEEE T RANSACTIONS
ON M ULTIMEDIA .

Mohan Kankanhalli (Fellow, IEEE) received the

Liqiang Nie (Senior Member, IEEE) received the B.Tech. degree from IIT Kharagpur and the M.S. and
B.Eng. degree from Xi’an Jiaotong University and Ph.D. degrees from the Rensselaer Polytechnic Insti-
the Ph.D. degree from the National University of
tute. He is currently the Provost’s Chair Professor
Singapore (NUS). He is currently a Professor and the with the Department of Computer Science, National
Dean of the School of Computer Science and Tech- University of Singapore. He is also the Director
nology, Harbin Institute of Technology (Shenzhen). of N-CRiPT and also the Deputy Executive Chair-
He has coauthored more than 200 articles and person of AI Singapore (Singapore’s National AI
four books and received more than 14,000 Google Program). His current research interests include mul-
Scholar citations. His research interests include timedia computing, multimedia security and privacy,
multimedia computing and information retrieval. image/video processing, and social media analysis.

Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on July 17,2024 at 11:47:08 UTC from IEEE Xplore. Restrictions apply.

Module 7
100% (1)
Module 7
16 pages
2 Simple Strain
0% (1)
2 Simple Strain
21 pages
From Recognition To Cognition: Visual Commonsense Reasoning
No ratings yet
From Recognition To Cognition: Visual Commonsense Reasoning
29 pages
Beyond VQA
No ratings yet
Beyond VQA
16 pages
TAB-VCR: Tags and Attributes Based Visual Commonsense Reasoning Baselines
No ratings yet
TAB-VCR: Tags and Attributes Based Visual Commonsense Reasoning Baselines
18 pages
Explainable High-order Visual Question Reasoning
No ratings yet
Explainable High-order Visual Question Reasoning
10 pages
Interpretable Visual Question Answering Via Reasoning Supervision
No ratings yet
Interpretable Visual Question Answering Via Reasoning Supervision
5 pages
Tell-And-Answer Towards Explainable Visual Question
No ratings yet
Tell-And-Answer Towards Explainable Visual Question
9 pages
CLEVR: A Diagnostic Dataset For Compositional Language and Elementary Visual Reasoning
No ratings yet
CLEVR: A Diagnostic Dataset For Compositional Language and Elementary Visual Reasoning
17 pages
An Analysis of Graph Convolutional Networks and Recent Datasets For Visual Question Answering
No ratings yet
An Analysis of Graph Convolutional Networks and Recent Datasets For Visual Question Answering
24 pages
Survey on VQA
No ratings yet
Survey on VQA
30 pages
Visual Question Answering: A State of The Art Review: Sruthy Manmadhan Binsu C. Kovoor
No ratings yet
Visual Question Answering: A State of The Art Review: Sruthy Manmadhan Binsu C. Kovoor
41 pages
From Image To Language A Critical Analysis of Visual Question Answering (VQA)
No ratings yet
From Image To Language A Critical Analysis of Visual Question Answering (VQA)
40 pages
Visual Question Answering A State of The Art Review
No ratings yet
Visual Question Answering A State of The Art Review
41 pages
Updated PPT presentation-ISA-1 Phase-2
No ratings yet
Updated PPT presentation-ISA-1 Phase-2
27 pages
VQA: Visual Question Answering
No ratings yet
VQA: Visual Question Answering
25 pages
Teney Tips and Tricks CVPR 2018 Paper
No ratings yet
Teney Tips and Tricks CVPR 2018 Paper
10 pages
Graph Neural Networks For Visual Question Answering: A Systematic Review
No ratings yet
Graph Neural Networks For Visual Question Answering: A Systematic Review
38 pages
VQA: Visual Question Answering
No ratings yet
VQA: Visual Question Answering
28 pages
Major Project Phase 1
No ratings yet
Major Project Phase 1
18 pages
Are Red Roses Red? Evaluating Consistency of Question-Answering Models
No ratings yet
Are Red Roses Red? Evaluating Consistency of Question-Answering Models
11 pages
2017-3-R2
No ratings yet
2017-3-R2
9 pages
Thesis-Improving Visual Question
No ratings yet
Thesis-Improving Visual Question
84 pages
Overcoming Language Priors in Visual Question Answering With Adversarial Regularization
No ratings yet
Overcoming Language Priors in Visual Question Answering With Adversarial Regularization
11 pages
sar
No ratings yet
sar
10 pages
2411.11150v1
No ratings yet
2411.11150v1
27 pages
2023 Acl-Short 65
No ratings yet
2023 Acl-Short 65
15 pages
Re GAT
No ratings yet
Re GAT
10 pages
Modular Visual Question Answering Via Code Generation: Our Code and Annotated Programs Are Available at
No ratings yet
Modular Visual Question Answering Via Code Generation: Our Code and Annotated Programs Are Available at
14 pages
In Factuality
No ratings yet
In Factuality
8 pages
Reducing Language Biases in Visual Question Answering With Visually-Grounded Question Encoder
No ratings yet
Reducing Language Biases in Visual Question Answering With Visually-Grounded Question Encoder
17 pages
dl (2)
No ratings yet
dl (2)
2 pages
Hci Report
No ratings yet
Hci Report
5 pages
Multi-Modal_Structure-Embedding_Graph_Transformer_for_Visual_Commonsense_Reasoning
No ratings yet
Multi-Modal_Structure-Embedding_Graph_Transformer_for_Visual_Commonsense_Reasoning
11 pages
REFEERENCE Accuracy 43
No ratings yet
REFEERENCE Accuracy 43
11 pages
1709.08203v1
No ratings yet
1709.08203v1
7 pages
dl (3)
No ratings yet
dl (3)
3 pages
VQA Report
No ratings yet
VQA Report
30 pages
paper-120
No ratings yet
paper-120
12 pages
Question Aware Vision Transformer For Multimodal Reasoning
No ratings yet
Question Aware Vision Transformer For Multimodal Reasoning
15 pages
20536-Article Text-24549-1-2-20220628
No ratings yet
20536-Article Text-24549-1-2-20220628
9 pages
dl (4)
No ratings yet
dl (4)
3 pages
Guiding Vision-Language Model Selection For Visual Question-Answering Across Tasks, Domains, and Knowledge Types
No ratings yet
Guiding Vision-Language Model Selection For Visual Question-Answering Across Tasks, Domains, and Knowledge Types
19 pages
Exploring Diverse Methods in Visual Question Answering
No ratings yet
Exploring Diverse Methods in Visual Question Answering
5 pages
Prompting_Large_Language_Models_with_Knowledge-Injection_for_Knowledge-Based_Visual_Question_Answering
No ratings yet
Prompting_Large_Language_Models_with_Knowledge-Injection_for_Knowledge-Based_Visual_Question_Answering
15 pages
dl (1)
No ratings yet
dl (1)
2 pages
Causalvlr: A Toolbox and Benchmark For Visual-Linguistic Causal Reasoning
No ratings yet
Causalvlr: A Toolbox and Benchmark For Visual-Linguistic Causal Reasoning
5 pages
Visual Question Answering On Image Sets
No ratings yet
Visual Question Answering On Image Sets
16 pages
1 s2.0 S0262885621000706 Main
No ratings yet
1 s2.0 S0262885621000706 Main
11 pages
image captioning
No ratings yet
image captioning
9 pages
DC GCN
No ratings yet
DC GCN
11 pages
output_2
No ratings yet
output_2
3 pages
这里面提到了去偏门控函数设计是我要借鉴的
No ratings yet
这里面提到了去偏门控函数设计是我要借鉴的
13 pages
simpleaug
No ratings yet
simpleaug
16 pages
vqa
No ratings yet
vqa
14 pages
9412 TOA Task Oriented Active
No ratings yet
9412 TOA Task Oriented Active
14 pages
VQA3
No ratings yet
VQA3
10 pages
2404.18144v1-pages-5
No ratings yet
2404.18144v1-pages-5
10 pages
Visual Commonsense Graphs: Reasoning About The Dynamic Context of A Still Image
No ratings yet
Visual Commonsense Graphs: Reasoning About The Dynamic Context of A Still Image
33 pages
Neuro Symbloic 3
No ratings yet
Neuro Symbloic 3
19 pages
2104 12756 PDF
No ratings yet
2104 12756 PDF
27 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
CBR Inorganik Chem G-1
No ratings yet
CBR Inorganik Chem G-1
20 pages
Pam 8002
No ratings yet
Pam 8002
15 pages
Foundation Investigation Report
No ratings yet
Foundation Investigation Report
20 pages
Audi ABS: Security Access To Modules
No ratings yet
Audi ABS: Security Access To Modules
5 pages
Experiment 03 Tilt Angle[1]
No ratings yet
Experiment 03 Tilt Angle[1]
3 pages
Geometric Geodesy Geometric Geodesy
100% (1)
Geometric Geodesy Geometric Geodesy
10 pages
Software Engineering
No ratings yet
Software Engineering
65 pages
Autosys Tutorial Platinum
No ratings yet
Autosys Tutorial Platinum
378 pages
ECTRX Formative A
No ratings yet
ECTRX Formative A
7 pages
Dissertation
No ratings yet
Dissertation
138 pages
Kenneth R. Laker, University of Pennsylvania
No ratings yet
Kenneth R. Laker, University of Pennsylvania
32 pages
Universiti Teknologi Malaysia Fakulti Sains Jabatan Sains Matematik
No ratings yet
Universiti Teknologi Malaysia Fakulti Sains Jabatan Sains Matematik
3 pages
Ibn Rushd's Defence of Philosophy As A Response To Ghazali's Challenge in The Name of Islamic Theology
100% (1)
Ibn Rushd's Defence of Philosophy As A Response To Ghazali's Challenge in The Name of Islamic Theology
14 pages
Chem 30 Unit Plan - Cameron Stuchly
No ratings yet
Chem 30 Unit Plan - Cameron Stuchly
9 pages
Matecconf prmr21 01026
No ratings yet
Matecconf prmr21 01026
8 pages
Dx.a en
No ratings yet
Dx.a en
7 pages
E-Notes Compiled Srinath
No ratings yet
E-Notes Compiled Srinath
24 pages
Part 2
No ratings yet
Part 2
356 pages
Tips For Creating A Block Language
No ratings yet
Tips For Creating A Block Language
4 pages
Assignment 7b Fall2017
No ratings yet
Assignment 7b Fall2017
4 pages
EDS-505A/508A/516A Series: 5, 8, and 16-Port Managed Ethernet Switches
No ratings yet
EDS-505A/508A/516A Series: 5, 8, and 16-Port Managed Ethernet Switches
3 pages
Chapter 9 - Ratios and Rates
No ratings yet
Chapter 9 - Ratios and Rates
36 pages
Electronic Handwheel For STB5100 Operating Instruction
No ratings yet
Electronic Handwheel For STB5100 Operating Instruction
6 pages
Page Level Tracing, Debugging, Error Handling (Example)
No ratings yet
Page Level Tracing, Debugging, Error Handling (Example)
24 pages
Damidsol 180
No ratings yet
Damidsol 180
2 pages
MoL Public Notice No 2009-68 English
No ratings yet
MoL Public Notice No 2009-68 English
100 pages
Convergence and Divergence of Sequences
No ratings yet
Convergence and Divergence of Sequences
12 pages

Joint_Answering_and_Explanation_for_Visual_Commonsense_Reasoning

Uploaded by

Joint_Answering_and_Explanation_for_Visual_Commonsense_Reasoning

Uploaded by

3836 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL.

Joint Answering and Explanation for Visual

V ISUAL Question Answering (VQA) is to answer a natu-

work is among the first efforts to jointly explore the two

these problems, a VCR model is expected to benefit from the

We then employ the KL divergence to align the predicted

couple the Q→A and QA→R as well as the effectiveness

query (e.g., ‘[person1]’, ‘attractive’), and R2C leverages such TABLE VI

Mohan Kankanhalli (Fellow, IEEE) received the

You might also like