Joint_Answering_and_Explanation_for_Visual_Commonsense_Reasoning
Joint_Answering_and_Explanation_for_Visual_Commonsense_Reasoning
32, 2023
Abstract— Visual Commonsense Reasoning (VCR), deemed as prediction. In light of this, the task of Visual Commonsense
one challenging extension of Visual Question Answering (VQA), Reasoning (VCR) [3] has recently been introduced to bridge
endeavors to pursue a higher-level visual comprehension. VCR this gap. Beyond answering the cognition-level questions
includes two complementary processes: question answering over
a given image and rationale inference for answering explanation. (Q→A) as canonical VQA does, VCR further prompts to
Over the years, a variety of VCR methods have pushed more predict a rationale for the right answer option (QA→R),
advancements on the benchmark dataset. Despite significance of as shown in Figure 1.
these methods, they often treat the two processes in a separate VCR is made more challenging than VQA roughly in terms
manner and hence decompose VCR into two irrelevant VQA of the following two aspects: 1) On the data side – The images
instances. As a result, the pivotal connection between question
answering and rationale inference is broken, rendering existing in the VCR dataset describe more sophisticated scenes (e.g.,
efforts less faithful to visual reasoning. To empirically study this social interactions or mental states). Therefore, the collected
issue, we perform some in-depth empirical explorations in terms questions are rather difficult and often demand high-level
of both language shortcuts and generalization capability. Based visual reasoning capabilities (e.g., why or how). And 2) on
on our findings, we then propose a plug-and-play knowledge the task side – It is difficult holistically predict both the right
distillation enhanced framework to couple the question answering
and rationale inference processes. The key contribution lies in answer and rationale. VCR models should first predict the right
the introduction of a new branch, which serves as a relay to answer, based on which the acceptable rationale can be further
bridge the two processes. Given that our framework is model- inferred from a few candidates. As question answering about
agnostic, we apply it to the existing popular baselines and validate the complex scenes has been proved non-trivial, inferring the
its effectiveness on the benchmark dataset. As demonstrated right rationale simultaneously in VCR thus leads to more
in the experimental results, when equipped with our method,
these baselines all achieve consistent and significant performance difficulties.
improvements, evidently verifying the viability of processes In order to address the above research challenges, several
coupling. prior efforts have been dedicated to VCR over the past few
Index Terms— Visual commonsense reasoning, language years. The initial attempts contribute to designing specific
shortcut, knowledge distillation. architectures to approach VCR, such as R2C [3] and CCN [4].
In addition, recent BERT-based pre-training approaches, such
I. I NTRODUCTION as ViLBERT [5] and ERNIE-ViL [6], are reckoned as a better
Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on July 17,2024 at 11:47:08 UTC from IEEE Xplore. Restrictions apply.
LI et al.: JOINT ANSWERING AND EXPLANATION FOR VISUAL COMMONSENSE REASONING 3837
Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on July 17,2024 at 11:47:08 UTC from IEEE Xplore. Restrictions apply.
3838 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 32, 2023
architectures [3], [4], [26], [27]. For instance, R2C [3] adopts a TABLE I
three-step fashion, grounding texts with respect to the involved R ATIONALE P REDICTION P ERFORMANCE C OMPARISON
W ITH AND W ITHOUT Q UESTIONS
objects; contextualizing the answer with the corresponding
question and objects; and reasoning over the shared represen-
tation. Inspired by the neuron connectivity of brains, CCN [4]
dynamically models the visual neuron connectivity, which
is contextualized by the queries and responses. HGL [26]
leverages a vision-to-answer and a dual question-to-answer
heterogeneous graphs to seamlessly bridge vision and lan-
guage. Following approaches explored more fine-grained cues, III. P ITFALL OF E XISTING VCR M ETHODS
such as ECMR [27] incorporating the syntactic information
into visual reasoning [28], MLCC [29] using counterfactual VCR is challenging due to the fact that the model is required
thinking to generate informative samples. to not only answer a question but also reason about the
Thereafter, BERT-based pre-training approaches have been answer prediction. Accordingly, the answering and reasoning
extensively explored in the vision-and-language domain. Most processes are complementary and inseparable from each other.
of these methods employ a pretrain-then-finetune scheme However, existing methods often treat Q→A and QA→R sep-
and achieve significant performance improvement on VQA arately, resulting in sub-optimal visual reasoning as revealed
benchmarks including VCR [5], [30], [31]. The models in the following two aspects.
are often first pre-trained on large-scale vision-and-language
datasets (such as Conceptual Captions [7]), and then fine-tuned
A. Language Shortcut Recognition
on the downstream VCR. For instance, some models are
pre-trained on image-text datasets to align the visual-linguistic Besides the base Q→A model, representative approaches
clues through single-stream architectures (e.g., VL-BERT [31], mostly adopt another independent model for QA→R. Since
UNITER [30], 12-in-1 [32]). MERLOT RESERVE [33] there is no connection between these two processes, we hereby
introduces more modalities, i.e., sound and video, into raise a question: what kinds of clues do these methods employ,
cross-modal pre-training, and significantly outperforms other excluding the Q→A reasoning information? With this concern,
methods. at the first step, we recognize that the overlapped words
However, one limitation still prevents these methods from between right answers and rationales dominate the rationale
further advancement, namely, the Q→A and QA→R pro- inference. For instance, in Figure 1, the correct rationale
cesses are tackled independently. Such a strategy makes the largely overlaps with the right answer, e.g., the ‘[person1]’
answer prediction and rationale inference become two inde- and ‘[person2]’ tag, and the word ‘speaking’. This may lead
pendent VQA tasks. In this work, we propose to address this the models to predict rationale based on these shortcuts rather
issue by combining these two processes. than performing visual explanation for Q→A [17], [43].
The evidence is elaborated as follows.
1) QA→R Performance w/o Q: We input only the correct
C. Knowledge Distillation answer as the query to predict rationales with other settings
untouched and show the results in Table I. One can observe
The past few years have witnessed the noticeable develop- that the three models only degrade slightly when removing
ment of KD [34], [35]. As an effective model compression the question input. As VCR is a question-driven task, visual
tool, KD transfers knowledge from a cumbersome large net- reasoning becomes meaningless under such input removal
work (teacher) to a compacter network (student). Based on its conditions.
application scope, previous KD methods can be roughly cate- 2) Attention Distribution Over Queries: As attention plays
gorized into two groups – logit-based and feature-based. The an essential part in current VCR models, we then design
logit-based methods [34], [36] encourage the student to imitate experiments to explore the shortcut from this perspective.
the output from a teacher model. For example, the vanilla In this experiment, we consider only the QA→R. Given the
KD utilizes the softened logits from a pre-trained teacher attention map W ∈ R(lq +la )×lr for a query-rationale, wherein
network as extra supervision to instruct the student [34]. the query is the original question appended with the right
In contrast, feature-based methods [35], [37], [38] attempt answer. pair, where lq , la , and lr respectively denote the
to transfer knowledge in intermediate features between the length of the question, answer, and rationale, we calculate the
two networks. FitNet [39] directly aligns the embeddings attention contribution from the answer side only and analyze
of each input. Attention Transfer [40] extends FitNet from below. Specifically, we find that the median attention value
embeddings to attention matrices for different levels of feature obtained from answers in three methods (HGL, R2C, and
maps. To close the performance gap between the teacher and CCN) on the validation set is 0.72, 0.78, and 0.86, respectively,
student, RCO [41] presents route-constrained hint learning, indicating that the these models pay more attention to answers
which is employed to supervise the student by the outputs of rather than the holistic question-answer inputs. In this way,
hint layers from the teacher. Besides, FSP [42] estimates the they mainly rely on the answer information to predict the
Gram matrix across layers to reflect the data flow of how the right rationale, while the questions are somewhat ignored. Two
teacher network learns. examples produced by R2C are illustrated in Figure 2.
Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on July 17,2024 at 11:47:08 UTC from IEEE Xplore. Restrictions apply.
LI et al.: JOINT ANSWERING AND EXPLANATION FOR VISUAL COMMONSENSE REASONING 3839
TABLE III
M ODEL P ERFORMANCE C OMPARISON OVER T WO V ERSIONS OF THE
VALIDATION S ET. Origin AND Rewritten D ENOTE THE M ODELS A RE
T ESTED ON THE O RIGINAL AND R EWRITTEN VCR VALIDATION
S ET, R ESPECTIVELY. LP R EPRESENTS F URTHER L ANGUAGE
P RE -T RAINING ON THE R EWRITTEN T EXTUAL DATA
IV. M ETHOD
The above probing tests demonstrate that separately treating
the two processes in VCR leads to unsatisfactory outcomes.
To overcome this, we propose an ARC framework to couple
Fig. 2. Illustration of the attention distribution of QA→R from R2C.
Q→A and QA→R together. In this work, we introduce
The orange line splits the question and the answer from the query. another new branch, namely QR→A, as a bridge to achieve
this goal.
TABLE II
R EWRITTEN E XAMPLES F ROM O UR E XPERIMENTS A. Preliminary
Before delving into the details of our ARC, we first outline
the three processes involved in our framework and their
corresponding learning functions.
1) Process Definition: Our framework contains three pro-
cesses: two of them are formally defined by the original
VCR [3] (i.e., Q→A and QA→R) and the other is a new
process introduced by this work (QR→A). Note that all three
processes are formulated in a multiple-choice format. That is,
given an image I and a query related to the image, the model
B. Generalization on Out-of-Domain Data is required to select the right option from several candidate
responses.
In addition to the shortcut problem, thereafter, we also Q→A: The query is embodied with a question Q, and
study the generalization capability of existing methods on the candidate responses are a set of answer choices A. The
the out-of-domain data. To implement this, we first rewrite objective is to select the right answer A+ ,
some sentences (including questions, answers, and rationales)
while maintaining their semantics. The Paraphrase-Online A+ = arg max f A (Ai | Q, I ), (1)
Ai ∈A
tool3 is leveraged to achieve this, whereby we substitute
some verbs/nouns with synonyms. We show some examples where f A denotes the Q→A model.
in Table II. In the next step, we apply the above methods to QA→R: The query is the concatenation of the question Q
the rewritten data and evaluate their performance. We also test and the right answer A+ . A set of rationales R constitutes the
the model performance with further BERT pre-training on the candidate responses and the model is expected to choose the
rewritten textual data. From the results in Table III, we find right rationale R+ ,
that all models degrade drastically on these new instances,
implying that they largely overfit on the in-domain data while R+ = arg max f R (Ri | Q, I, A+ ), (2)
lack generalization ability on the out-of-domain ones. Ri ∈R
Based on these two findings, we notice that the current where f R denotes the QA→R model.
models fail to conduct visual reasoning based on the answering One can see that it is difficult to directly connect these
clues. They instead, leverage superficial correlations to infer two since the involved parameters are not shared and the
rationale or simply overfit on the in-domain data. To approach input to QA→R includes the ground-truth answer rather than
the predicted one. In view of this, to bridge these two,
3 https://www.paraphrase-online.com/ we introduce the QR→A as a proxy.
Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on July 17,2024 at 11:47:08 UTC from IEEE Xplore. Restrictions apply.
3840 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 32, 2023
Fig. 3. The overall architecture of our proposed framework. There are three processes, i.e., Q→A, QR→A, and QA→R. We bridge these processes with
two KD modules: KD-A is leveraged to align the predicted logits between QR→A and Q→A; and KD-R aims to maintain semantic consistency between
QR→A and QA→R via the feature-level knowledge distillation operation.
QR→A: The query and responses are respectively the con- and reasoning. In the next, we present our model-agnostic
catenation of the question Q along with the right rationale R+ framework to couple the two successive processes, i.e., Q→A
and answer choices. The objective is, and QA→R, together.
A+ = arg max f C (Ai | Q, I, R+ ), (3)
Ai ∈A B. Proposed Method
where fC denotes the QR→A model. On the one hand, In order to effectively integrate the above three processes,
QR→A shares the consistent objective with Q→A. On the we propose a dual-level knowledge distillation framework and
other hand, its reasoning should be similar to that of QA→R. illustrate it in Figure 3. Specifically, after extracting the fused
These two factors make QR→A a good proxy for connecting multi-modal features with the aforementioned backbones, two
Q→A and QA→R. parallel KD modules are introduced to bridge the QR→A with
2) Training Pipeline: In general, for the feature extractors, the other two processes.
the VCR model often uses a pre-trained CNN network to Our first KD module is leveraged to align the predicted
obtain the visual features from the input image I , and an scores from QR→A and Q→A. The two processes share the
RNN-based or Transformer-based model to extract the textual same objective and QR→A offers ample information for pick-
features of the query and response. Thereafter, a multi-modal ing the right answer. In this way, we take the predictions from
fusion module is employed to obtain the joint representation, QR→A as the teacher and the predictions from Q→A as the
followed by the classifier to predict the logit ỹi for the student. Our second KD module is required to align the feature
response i. learning between QR→A and QA→R. Despite the objectives
To achieve the objectives in Equation 1 and 2, previ- being different, the reasoning of these two is actually similar.
ous methods often separately optimize the following two That is, answering with the given right rationale is simply the
cross-entropy losses, reverse of reasoning given the right answer, which allows these
two processes to share akin features.
exp ỹiA
X|A|
L A
= − y A
log , 1) Knowledge Distillation Between QR→A and Q→A
i i P A
j exp ỹ j (KD-A): Since the rationale is predicted for explaining the
(4)
X|R| exp ỹiR right answer, we thus empirically leverage the other direction,
,
R R
L = − y i log P R i.e., incorporating the correct rationale with the given question
i
j exp ỹ j
for answer prediction. It is intuitive that when combined
where yiA and yiR denote the ground truth label of answer Ai with the correct rationale, the answering confidence should
and rationale Ri , respectively. be enhanced. In light of this, we simply take the output logits
In existing VCR methods, models f A and f R often share from QR→A as the teacher and use them guide the learning
identical architecture and are trained separately. As a result, of Q→A.
the two processes are reduced into two independent VQA Specifically, in this KD module, the knowledge is encoded
tasks, resulting in connection absence between answering and transferred in the form of softened scores: it aligns the
Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on July 17,2024 at 11:47:08 UTC from IEEE Xplore. Restrictions apply.
LI et al.: JOINT ANSWERING AND EXPLANATION FOR VISUAL COMMONSENSE REASONING 3841
output probability p A (from the student f A ) with the pC (from Algorithm 1 Training Procedure of ARC
the teacher f C ). To avoid the over-confidence problem [36],
a relaxation temperature T > 1 is introduced to soften the
logits ỹC from f C . The same relaxation is applied to the
output logits ỹ A of f A ,
exp ( ỹiC /T )
piC = P|A| ,
( C /T )
j=1 exp ỹ j
(5)
exp ( ỹiA /T )
A
.
i
p = P|A|
j=1 exp ( ỹ j /T )
A
Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on July 17,2024 at 11:47:08 UTC from IEEE Xplore. Restrictions apply.
3842 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 32, 2023
TABLE IV TABLE V
P ERFORMANCE C OMPARISON ON VCR VALIDATION AND A BLATION S TUDY OF THE P ROPOSED M ETHOD ON THE VCR VALIDATION
T ESTING S ETS S ET. KD-R AND KD-A D ENOTE THE KD B ETWEEN QR→A AND
QA→R AND QR→A AND Q→A, R ESPECTIVELY
Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on July 17,2024 at 11:47:08 UTC from IEEE Xplore. Restrictions apply.
LI et al.: JOINT ANSWERING AND EXPLANATION FOR VISUAL COMMONSENSE REASONING 3843
Fig. 4. Qualitative results from R2C and our method. The predicted probability of each option is illustrated on the rightmost.
Fig. 5. Attention distribution illustration from the QA→R model of baselines and ours.
Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on July 17,2024 at 11:47:08 UTC from IEEE Xplore. Restrictions apply.
3844 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 32, 2023
Fig. 6. Validation set instance distribution and model performance according to question types.
D. Model Performance on Out-of-Domain Data Regarding the future directions, since this work demon-
As previously shown in Table III, all existing models strates the potential of process coupling for enhancing visual
perform much less favorably on the out-of-domain data. understanding, studying solutions for jointly training from
To examine whether coupling answering and reasoning can more model components, such as the attention module, is thus
improve the generalization of these models, we applied our promising.
method to these data and reported the results in Table VI.
In addition, we also pre-trained the BERT model on the R EFERENCES
rewritten textual data to test whether language pre-pretraining [1] S. Antol et al., “VQA: Visual question answering,” in Proc. IEEE Int.
will bring further benefits. As can be observed, with our Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 2425–2433.
proposed ARC, all the three models obtain some performance [2] Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei, “Visual7W: Grounded
question answering in images,” in Proc. IEEE Conf. Comput. Vis. Pattern
improvements. For instance, our method gains a 3.6%, 2.5%, Recognit. (CVPR), Jun. 2016, pp. 4995–5004.
and 3.3% absolute improvement over R2C of the w/o language [3] R. Zellers, Y. Bisk, A. Farhadi, and Y. Choi, “From recognition to
pre-train setting on the three accuracy metrics, respectively. cognition: Visual commonsense reasoning,” in Proc. IEEE/CVF Conf.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 6713–6724.
It is evident that our method demonstrates better generalization [4] A. Wu, L. Zhu, Y. Han, and Y. Yang, “Connective cognition network
capability on these skewed data. for directional visual commonsense reasoning,” in Proc. Adv. Neural Inf.
Process. Syst., 2019, pp. 5670–5680.
[5] J. Lu, D. Batra, D. Parikh, and S. Lee, “ViLBERT: Pretraining task-
E. Results w.r.t Question Types agnostic visiolinguistic representations for vision-and-language tasks,”
in Proc. Adv. Neural Inf. Process. Syst., 2019, pp. 13–23.
Figure 6 illustrates the question types (extracted by the [6] F. Yu et al., “ERNIE-ViL: Knowledge enhanced vision-language rep-
corresponding matching words) in the validation set. We then resentations through scene graphs,” in Proc. AAAI Conf. Artif. Intell.,
2021, pp. 3208–3216.
show the model performance of R2C and our method with [7] P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual captions:
respect to question types. In a nutshell, our method achieves A cleaned, hypernymed, image alt-text dataset for automatic image
consistent improvements on almost all categories. Especially, captioning,” in Proc. 56th Annu. Meeting Assoc. Comput. Linguistics
(Long Papers), vol. 1, 2018, pp. 2556–2565.
compared with binary questions like is and do, our method [8] M. Malinowski, M. Rohrbach, and M. Fritz, “Ask your neurons: A
shows more advantage on more challenging where, how, neural-based approach to answering questions about images,” in Proc.
and what questions. However, both methods struggle with IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 1–9.
how questions, as these questions demand high-level visual [9] Q. Wu, D. Teney, P. Wang, C. Shen, A. Dick, and A. van den Hengel,
“Visual question answering: A survey of methods and datasets,” Comput.
understanding and are therefore difficult to address. Vis. Image Understand., vol. 163, pp. 21–40, Oct. 2017.
[10] P. Anderson et al., “Bottom-up and top-down attention for image
captioning and visual question answering,” in Proc. IEEE/CVF Conf.
VII. C ONCLUSION AND F UTURE W ORK Comput. Vis. Pattern Recognit., Jun. 2018, pp. 6077–6086.
[11] W. Guo, Y. Zhang, J. Yang, and X. Yuan, “Re-attention for visual ques-
Existing VCR models perform the answering and explaining tion answering,” IEEE Trans. Image Process., vol. 30, pp. 6730–6743,
processes in a separate manner, leading to poor generaliza- 2021.
tion and undesirable language shortcuts between answers and [12] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, “Neural module
networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
rationales. This paper first discusses the disadvantage of the Jun. 2016, pp. 39–48.
separate training strategy, followed by a novel knowledge [13] J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick,
distillation framework to couple the two processes. Our frame- and R. Girshick, “CLEVR: A diagnostic dataset for compositional
language and elementary visual reasoning,” in Proc. IEEE Conf. Comput.
work consists of two KD modules, i.e., KD-A and KD-R, Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 1988–1997.
where the former is leveraged to align the predicted logits [14] P. Wang, Q. Wu, C. Shen, A. Dick, and A. van den Hengel, “FVQA:
between Q→A and QR→A, and the latter aims to main- Fact-based visual question answering,” IEEE Trans. Pattern Anal. Mach.
Intell., vol. 40, no. 10, pp. 2413–2427, Oct. 2018.
tain semantic consistency between QA→R and QR→A with
[15] K. Marino, M. Rastegari, A. Farhadi, and R. Mottaghi, “OK-VQA: A
feature-level knowledge distillation. We apply this framework visual question answering benchmark requiring external knowledge,”
to several state-of-the-art baselines and studied its effective- in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR),
ness on the VCR benchmark dataset. With the quantitative and Jun. 2019, pp. 3190–3199.
[16] R. R. Selvaraju et al., “Taking a HINT: Leveraging explanations to make
qualitative experimental results, the viability of jointly training vision and language models more grounded,” in Proc. IEEE/CVF Int.
Q→A and QR→A is explicitly testified. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 2591–2600.
Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on July 17,2024 at 11:47:08 UTC from IEEE Xplore. Restrictions apply.
LI et al.: JOINT ANSWERING AND EXPLANATION FOR VISUAL COMMONSENSE REASONING 3845
[17] Y. Guo, L. Nie, Z. Cheng, F. Ji, J. Zhang, and A. Del Bimbo, “AdaVQA: [41] X. Jin et al., “Knowledge distillation via route constrained optimiza-
Overcoming language priors with adapted margin cosine loss,” in Proc. tion,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019,
13th Int. Joint Conf. Artif. Intell., Aug. 2021, pp. 708–714. pp. 1345–1354.
[18] A. Agrawal, D. Batra, D. Parikh, and A. Kembhavi, “Don’t just assume; [42] J. Yim, D. Joo, J. Bae, and J. Kim, “A gift from knowledge distillation:
look and answer: Overcoming priors for visual question answering,” Fast optimization, network minimization and transfer learning,” in
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,
pp. 4971–4980. pp. 7130–7138.
[19] Q. Li, Q. Tao, S. R. Joty, J. Cai, and J. Luo, “VQA-E: Explaining, elab- [43] Y. Guo, L. Nie, Z. Cheng, Q. Tian, and M. Zhang, “Loss re-scaling
orating, and enhancing your answers for visual questions,” in Proc. Eur. VQA: Revisiting the language prior problem from a class-imbalance
Conf. Comput. Vis. Cham, Switzerland: Springer, 2018, pp. 570–586. view,” IEEE Trans. Image Process., vol. 31, pp. 227–238, 2022.
[20] B. N. Patro and V. P. Namboodiri, “Explanation vs attention: A two- [44] J. Lin, U. Jain, and A. G. Schwing, “TAB-VCR: Tags and attributes
player game to obtain attention for VQA,” in Proc. AAAI Conf. Artif. based VCR baselines,” in Proc. Adv. Neural Inf. Process. Syst., 2019,
Intell., 2020, pp. 11848–11855. pp. 15589–15602.
[21] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and [45] A. Rohrbach et al., “Movie description,” Int. J. Comput. Vis., vol. 123,
D. Batra, “Grad-CAM: Visual explanations from deep networks via no. 1, pp. 94–120, 2017.
gradient-based localization,” in Proc. IEEE Int. Conf. Comput. Vis. [46] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc.
(ICCV), Oct. 2017, pp. 618–626. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2980–2988.
[22] B. Patro, M. Lunayach, S. Patel, and V. Namboodiri, “U-CAM: Visual [47] A. Jabri, A. Joulin, and L. van der Maaten, “Revisiting visual question
explanation using uncertainty based class activation maps,” in Proc. answering baselines,” in Proc. Eur. Conf. Comput. Vis. Cham, Switzer-
IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 7443–7452. land: Springer, 2016, pp. 727–739.
[23] A. Das, H. Agrawal, L. Zitnick, D. Parikh, and D. Batra, “Human [48] J. Kim, K. W. On, W. Lim, J. Kim, J. Ha, and B. Zhang, “Hadamard
attention in visual question answering: Do humans and deep networks product for low-rank bilinear pooling,” in Proc. Int. Conf. Learn.
look at the same regions?” Comput. Vis. Image Understand., vol. 163, Represent. (ICLR), 2017, pp. 1–14.
pp. 90–100, Oct. 2017. [49] H. Ben-Younes, R. Cadene, M. Cord, and N. Thome, “MUTAN:
Multimodal tucker fusion for visual question answering,” in Proc. IEEE
[24] D. H. Park et al., “Multimodal explanations: Justifying decisions and
Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2631–2639.
pointing to the evidence,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Recognit., Jun. 2018, pp. 8779–8788. [50] L. H. Li, M. Yatskar, D. Yin, C. Hsieh, and K. Chang, “VisuaIBERT:
A simple and performant baseline for vision and language,” 2019,
[25] J. Wu and R. Mooney, “Faithful multimodal explanation for visual arXiv:1908.03557.
question answering,” in Proc. ACL Workshop BlackboxNLP: Analyzing
Interpreting Neural Netw. (NLP), 2019, pp. 103–112.
[26] W. Yu, J. Zhou, W. Yu, X. Liang, and N. Xiao, “Heterogeneous graph
learning for visual commonsense reasoning,” in Proc. Adv. Neural Inf.
Process. Syst., 2019, pp. 2765–2775. Zhenyang Li received the B.Eng. degree from
[27] X. Zhang, F. Zhang, and C. Xu, “Explicit cross-modal representation Shandong University and the master’s degree from
learning for visual commonsense reasoning,” IEEE Trans. Multimedia, the University of Chinese Academy of Sciences.
vol. 24, pp. 2986–2997, 2022. He is currently pursuing the Ph.D. degree with
[28] J. Zhu and H. Wang, “Multiscale conditional relationship graph network the School of Computer Science and Technology,
for referring relationships in images,” IEEE Trans. Cognit. Develop. Shandong University, supervised by Prof. Liqiang
Syst., vol. 14, no. 2, pp. 752–760, Jun. 2022. Nie. His research interests include multi-modal
[29] X. Zhang, F. Zhang, and C. Xu, “Multi-level counterfactual contrast computing, especially visual question answering.
for visual commonsense reasoning,” in Proc. 29th ACM Int. Conf.
Multimedia, Oct. 2021, pp. 1793–1802.
[30] Y. Chen et al., “UNITER: Universal image-text representation learning,”
in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2020,
pp. 104–120.
[31] W. Su et al., “VL-BERT: Pre-training of generic visual-linguistic repre- Yangyang Guo (Member, IEEE) is currently a
sentations,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2020, pp. 1–16. Research Fellow with the National University of
[32] J. Lu, V. Goswami, M. Rohrbach, D. Parikh, and S. Lee, “12-in- Singapore. He has authored or coauthored several
1: Multi-task vision and language representation learning,” in Proc. articles in top journals, such as IEEE T RANSAC -
TIONS ON I MAGE P ROCESSING , IEEE T RANSAC -
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020,
TIONS ON M ULTIMEDIA , IEEE T RANSACTIONS
pp. 10434–10443.
ON K NOWLEDGE AND DATA E NGINEERING , IEEE
[33] R. Zellers et al., “MERLOT RESERVE: Neural script knowledge T RANSACTIONS ON N EURAL N ETWORKS AND
through vision and language and sound,” in Proc. IEEE/CVF Conf. L EARNING S YSTEMS, and ACM TOIS. He was a
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 16354–16366. recipient as an Outstanding Reviewer for IEEE
[34] G. E. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a T RANSACTIONS ON M ULTIMEDIA and WSDM
neural network,” 2015, arXiv:1503.02531. 2022. He is a Regular Reviewer for journals, including IEEE T RANSAC -
[35] Y. Shang, B. Duan, Z. Zong, L. Nie, and Y. Yan, “Lipschitz continuity TIONS ON I MAGE P ROCESSING , IEEE T RANSACTIONS ON M ULTIMEDIA ,
guided knowledge distillation,” in Proc. IEEE/CVF Int. Conf. Comput. IEEE T RANSACTIONS ON K NOWLEDGE AND DATA E NGINEERING, IEEE
Vis. (ICCV), Oct. 2021, pp. 10655–10664. T RANSACTIONS ON C IRCUITS AND S YSTEMS FOR V IDEO T ECHNOLOGY,
[36] J. Ba and R. Caruana, “Do deep nets really need to be deep,” in Proc. ACM TOIS, and ToMM.
Adv. Neural Inf. Process. Syst., 2014, pp. 2654–2662.
[37] P. Liu, W. Liu, H. Ma, Z. Jiang, and M. Seok, “KTAN: Knowledge
transfer adversarial network,” in Proc. Int. Joint Conf. Neural Netw.
(IJCNN), Jul. 2020, pp. 1–7. Kejie Wang is currently pursuing the B.Eng. degree
[38] Z. Shen, Z. He, and X. Xue, “Meal: Multi-model ensemble via adver- in computer science with Shandong University. His
sarial learning,” in Proc. AAAI Conf. Artif. Intell., 2019, pp. 4886–4893. research interests include visual question answering
[39] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and and computer vision.
Y. Bengio, “FitNets: Hints for thin deep nets,” in Proc. Int. Conf. Learn.
Represent. (ICLR), 2015, pp. 1–13.
[40] S. Zagoruyko and N. Komodakis, “Paying more attention to atten-
tion: Improving the performance of convolutional neural networks via
attention transfer,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2017,
pp. 1–13.
Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on July 17,2024 at 11:47:08 UTC from IEEE Xplore. Restrictions apply.
3846 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 32, 2023
Yinwei Wei (Member, IEEE) received the M.S. He is a member of the ICME Steering Committee. He has received many
degree from Tianjin University and the Ph.D. awards, such as ACM MM and SIGIR Best Paper Honorable Mention in
degree from Shandong University. He is currently a 2019, the SIGMM Rising Star in 2020, the TR35 China 2020, the DAMO
Research Fellow with NExT, National University of Academy Young Fellow in 2020, and the SIGIR Best Student Paper in 2021.
Singapore. Several works have been published in top Meanwhile, he is the regular Area Chair of ACM MM, NeurIPS, IJCAI, and
forums, such as ACM MM, IEEE T RANSACTIONS AAAI. He is an Associate Editor of IEEE T RANSACTIONS ON K NOWLEDGE
ON M ULTIMEDIA , and IEEE T RANSACTIONS ON AND DATA E NGINEERING , IEEE T RANSACTIONS ON M ULTIMEDIA , IEEE
I MAGE P ROCESSING. His research interests include T RANSACTIONS ON C IRCUITS AND S YSTEMS FOR V IDEO T ECHNOLOGY,
multimedia computing and recommendation. He has ACM ToMM, and Information Sciences.
served as a PC Member for several conferences, such
as MM, AAAI, and IJCAI, and a Reviewer for IEEE
T RANSACTIONS ON PATTERN A NALYSIS AND M ACHINE I NTELLIGENCE,
IEEE T RANSACTIONS ON I MAGE P ROCESSING, and IEEE T RANSACTIONS
ON M ULTIMEDIA .
Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on July 17,2024 at 11:47:08 UTC from IEEE Xplore. Restrictions apply.