0% found this document useful (0 votes)
9 views

Multi-Modal_Structure-Embedding_Graph_Transformer_for_Visual_Commonsense_Reasoning

The document presents a novel method called Multi-modal Structure-embedding Graph Transformer (MSGT) aimed at enhancing visual commonsense reasoning (VCR) by effectively modeling intra-modal and inter-modal correlations through graph-based representations. The MSGT framework consists of three key components: multi-modal structure embedding, a structure-injecting graph transformer, and a scored pooling classification module, which collectively improve the depth and feature extraction capabilities of the model. Experimental results demonstrate that MSGT outperforms existing state-of-the-art methods on the VCR benchmark dataset.

Uploaded by

kayopo8074
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Multi-Modal_Structure-Embedding_Graph_Transformer_for_Visual_Commonsense_Reasoning

The document presents a novel method called Multi-modal Structure-embedding Graph Transformer (MSGT) aimed at enhancing visual commonsense reasoning (VCR) by effectively modeling intra-modal and inter-modal correlations through graph-based representations. The MSGT framework consists of three key components: multi-modal structure embedding, a structure-injecting graph transformer, and a scored pooling classification module, which collectively improve the depth and feature extraction capabilities of the model. Experimental results demonstrate that MSGT outperforms existing state-of-the-art methods on the VCR benchmark dataset.

Uploaded by

kayopo8074
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

IEEE TRANSACTIONS ON MULTIMEDIA, VOL.

26, 2024 1295

Multi-Modal Structure-Embedding Graph


Transformer for Visual Commonsense Reasoning
Jian Zhu , Hanli Wang , Senior Member, IEEE, and Bin He , Member, IEEE

Abstract—Visual commonsense reasoning (VCR) is a referring expressions [1], visual question answering (VQA) [2],
challenging reasoning task that aims to not only answer the [3], visual grounding [4], [5], image and video captioning [6],
question based on a given image but also provide a rationale [7]. Most of these works still belong to recognition-level tasks
justifying for the choice. Graph-based networks are appropriate to
represent and extract the correlation between image and language that just provide proper prediction without convincing reasons
for reasoning, where how to construct and learn graphs based on such as likely intents and goals. For instance, referring expres-
such multi-modal Euclidean data is a fundamental problem. Most sions is to localize the entities described by the language in
existing graph-based methods view visual regions and linguistic image and inter-modal associations can be utilized for better
words as identical graph nodes, ignoring inherent characteristics recognition [1]; VQA is an n-way classification task to select
of multi-modal data. In addition, these approaches typically only
have one graph-learning layer, and the performance declines as the a correct choice from answers (typically short phrases such as
model goes deeper. To address these issues, a novel method named fresh, frozen) based on given image and question. For further
Multi-modal Structure-embedding Graph Transformer (MSGT) study, researchers gradually focus on cognition-level reasoning
is proposed. Specifically, an answer-vision graph and an tasks and the visual commonsense reasoning (VCR) [8] task
answer-question graph are constructed to represent and model is proposed. Different from VQA, VCR usually contains two
intra-modal and inter-modal correlations in VCR simultaneously,
where additional multi-modal structure representations are four-way multiple choice subtasks with the purpose of not only
initialized and embedded according to visual region distances and answering the question but also providing a rationale justifying
linguistic word orders for more reasonable graph representation. for the choice. In addition, the choices in VCR are represented
Then, a structure-injecting graph transformer is designed to inject with more complex visual and linguistic expressions instead of
embedded structure priors into the semantic correlation matrix simple short phrases.
for the evolution of node features and structure representations,
which can stack more layers to make model deeper and extract To tackle the VCR task, various models using holistic atten-
more powerful features with instructive priors. To adaptively fuse tion [8], [9], [10] and graph-based networks [11], [12], [13] are
graph features, a scored pooling mechanism is further developed proposed. For instance, Zeller et al. [8] utilized attention mecha-
to select valuable clues for reasoning from learnt node features. nism to contextualize the answer with the given image and ques-
Experiments demonstrate the superiority of the proposed MSGT tion for reasoning. Wen et al. [9] introduced an extra knowledge
framework compared with state-of-the-art methods on the VCR
benchmark dataset. base to train attention modules and transferred the knowledge to
attention modules for VCR. However, these methods dealt with
Index Terms—Visual commonsense reasoning, multi-modal questions and answers as sequential data, and it is difficult to
structure embedding, graph transformer, scored pooling.
model the correlation between words far apart at the mercy of
word order. On the other hand, graph-based representations can
I. INTRODUCTION address the above issue via constructing edges between nodes
in the graph to correlate the words or objects with each other.
N RECENT years, visual and language tasks have attracted
I increasing attention from both academia and industry, such as
Wu et al. [11] developed visual neuron connectivity to fully
model the correlation of visual content and fused the sentence
representation with that of visual neurons for graph convolution
Manuscript received 21 February 2023; revised 6 April 2023 and 17 May network (GCN) to infer answers or rationales. Zhang et al. [12]
2023; accepted 19 May 2023. Date of publication 24 May 2023; date of current constructed graphs for visual objects and sentences respectively,
version 18 January 2024. This work was supported in part by the National
Natural Science Foundation of China under Grants 61976159 and 62133011, and then integrated them for cross-modal reasoning. Different
in part by the Shanghai Innovation Action Project of Science and Technology from the homogeneous graphs used above, the heterogeneous
under Grant 20511100700, and in part by the Shanghai Municipal Science and graphs [14], [15] can simultaneously represent data from differ-
Technology Major Project under Grant 2021SHZDZX0100. The associate editor
coordinating the review of this manuscript and approving it for publication was ent domains; and Yu et al. [13] introduced the heterogeneous
Dr. Xinxiao Wu. (Corresponding author: Hanli Wang.) graph into VCR to build cross-modal correlations. However, the
Jian Zhu and Hanli Wang are with the Department of Computer Science heterogeneous graph in [13] was just constructed based on the
and Technology, Key Laboratory of Embedded System and Service Computing
(Ministry of Education), Tongji University, Shanghai 200092, China (e-mail: answer representations attended over visual objects or questions,
[email protected]; [email protected]). which is not capable of modeling intra-modal and inter-modal
Bin He is with the Frontiers Science Center for Intelligent Autonomous correlations sufficiently.
Systems, Shanghai 201210, China (e-mail: [email protected]).
The source code of this work can be found in https://mic.tongji.edu.cn. In general, the aforementioned methods adopting graph-based
Digital Object Identifier 10.1109/TMM.2023.3279691 networks have several drawbacks. Firstly, the graphs learnt

1520-9210 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on July 17,2024 at 11:53:23 UTC from IEEE Xplore. Restrictions apply.
1296 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 26, 2024

Fig. 1. Illustration of the proposed MSGT architecture. After feature encoding for input visual and linguistic information by pre-trained networks, the framework
consists of three parts: (1) multi-modal structure embedding, (2) structure-injecting graph transformer for contextualization and reasoning, and (3) scored pooling
classification. To be concise, the graphs in the figure are diagrams, which are more complex in reality.

by these methods view visual regions and linguistic words as graph nodes as identical ones. Secondly, a multi-layer structure-
identical nodes, focusing on semantic similarities but ignoring injecting graph transformer is designed to combine additional
inherent characteristics of multi-modal data. Secondly, these embedded structure priors with semantic similarity calculated
methods [12], [13] only utilize one layer of graph convolution for guiding graph node aggregation and graph structure evo-
network [16], [17], and the performance declines as the num- lution, enabling graphs to evolve and transmit valuable infor-
ber of layers increases due to over-smoothing of graph learning. mation between layers. Thirdly, a scored pooling mechanism is
These drawbacks hinder the model from learning multi-modal devised to fuse graph features adaptively and choose valuable
graphs effectively for the VCR task. To address these issues, a clues for reasoning in both token-level and feature-level. The ex-
novel method named Multi-modal Structure-embedding Graph periments verify the effectiveness of the proposed framework in
Transformer (MSGT) is proposed, which mainly consists of comparison with the state-of-the-art methods on the VCR bench-
three modules: a multi-modal structure embedding (MSE) mod- mark dataset. The rest of this article is organized as follows. The
ule, a structure-injecting graph transformer (SGT) module and related works are discussed in Section II. The proposed MSGT
a scored pooling (SP) classification module. Specifically, an framework is detailed in Section III. Section IV presents the ex-
answer-vision graph and an answer-question graph are con- perimental results. Finally, we conclude this work in Section V.
structed to represent and model correlations in VCR simultane-
ously, where additional multi-modal structure representations II. RELATED WORK
are initialized and embedded according to visual region dis-
tances and linguistic word orders. In this way, the proposed A. Visual Commonsense Reasoning
framework can both model intra-modal and inter-modal cor- Visual commonsense reasoning (VCR) is a cognition-level
relations in heterogeneous graphs with inherent characteristics reasoning task, which chooses answer according to question and
of multi-modal data considered. To utilize multi-layer networks image, then provides a rationale justifying for the choice. Vari-
to learn graph node representations with multi-modal structure ous frameworks have been proposed and promoted the research
priors, we design SGT to inject embedded structure priors into process of the VCR task. At the earliest, Zeller et al. [8] adopted
the semantic correlation matrix for the evolution of graph node LSTM with attention mechanism to contextualize answers with
features and structure representations. Different from traditional respect to the question and image context in an sequential man-
Transformer models, the inputs of the SGT layers contain hetero- ner for reasoning. Later, Lin et al. [10] utilized an extra de-
geneous graph structure priors instead of positional encodings tector to capture object attributions to enhance visual features
for sequence learning. These graph structure priors are more in VCR. Wen et al. [9] transferred commonsense knowledge
appropriate to represent correlations between multi-modal data, learnt from an additional linguistic knowledge base to the VCR
which can evolve layer by layer. In this way, over-smoothing of task. To better represent the association between visual and lin-
graph learning can also be relieved, since the framework needs guistic data, several graph-learning-based methods [11], [12],
to consider discriminative structure information of tokens in [13] were proposed for VCR and obtained better performances.
cross-modal data. At the final classification stage, scored pool- Wu et al. [11] developed visual neuron connectivity to build vi-
ing is designed to adaptively fuse graph features for multi-layer sual graph along with the meanings of questions and answers
perceptron (MLP) to make prediction. for GCN to learn. Zhang et al. [12] constructed graphs for visual
The major contributions of this work are summarized as fol- objects and sentences respectively, and then integrated them to
lows. Firstly, a novel graph construction method via multi-modal perform cross-modal reasoning. Yu et al. [13] adopted heteroge-
structure embedding is proposed, which considers inherent char- neous graphs which didn’t consider intra-modal correlations for
acteristics of multi-modal data in VCR and represents the cor- reasoning. These graph-learning-based methods cannot deploy
relation more reasonably compared with viewing multi-modal deeper architecture and are enslaved to over-smoothing of graph

Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on July 17,2024 at 11:53:23 UTC from IEEE Xplore. Restrictions apply.
ZHU et al.: MULTI-MODAL STRUCTURE-EMBEDDING GRAPH TRANSFORMER FOR VISUAL COMMONSENSE REASONING 1297

learning. In addition, a number of researchers jointly employed Transformers mentioned above are applied on well-defined non-
multiple cross-modal datasets to obtain excellent alignment be- Euclidean graph data. In this work, graph Transformer is inves-
tween visual and linguistic information and perform VCR [18], tigated on the cross-modal reasoning task VCR with Euclidean
[19], [20], [21], [22], [23]. Li et al. [18] and Su et al. [19] fed com- data for the first time, and embedded structure priors are injected
binations of visual and linguistic data from several cross-modal into the semantic correlation matrix to update node features and
datasets into BERT model to learn cross-modal alignment, then structure representations better.
fine-tuned the model on the VCR dataset [8]. Chen et al. [20]
designed four tasks to synergistically learn cross-modal align-
ment. Li et al. [21] proposed a cross-modal contrastive learning III. PROPOSED MULTI-MODAL STRUCTURE-EMBEDDING
method to enhance data and learn more generalizable represen- GRAPH TRANSFORMER
tations. Cho et al. [22] proposed a unified framework that learned In this section, the overview of the proposed framework
different tasks in a single architecture with the same language is firstly introduced. Then, the multi-modal structure em-
modeling objective. Song et al. [23] introduced external knowl- bedding (MSE) module, the structure-injecting graph trans-
edge base into the pre-trained VL-BERT model [19] to supply former (SGT) module and the scored pooling (SP) classification
association priors for VCR. These models benefit from mass data module are described in detail.
to learn cross-modal alignment. Despite achieving good perfor-
mances, vast computing resources are also required. Moreover,
large-scale pretraining models mainly focus on cross-modal A. Overview
alignment in the semantic space, but do not explore the asso-
ciation mechanism and reasoning factor of multi-modal data The VCR task is to infer the answer and the rationale for the
specifically and inherently existing in reasoning tasks. There- choice based on the given image and question, where all the
fore, this work intensively explores the association mechanism questions, answers and rationales are in the form of mixtures of
in VCR, while large-scale pretraining models are motivated by visual and linguistic words. Specifically, the VCR task consists
obtaining better initial visual and linguistic features. Evolution- of the following three subtasks:
ary heterogeneous graphs with multi-modal structure priors are 1) Q → A: Given an image and a question, the model should
designed to represent and model intra-modal and inter-modal choose the right option from four answers.
correlations in VCR simultaneously, which are learnt by stacked 2) QA → R: Given an image, a question and the correct an-
layers of structure-injecting graph transformer. swer, the model should choose the right option from four
rationales.
3) Q → AR: Given an image and a question, the model
B. Transformer-Based Model should choose both the right answer and rationale.
Transformer [24] is a self-attention-based model, which To do Q → AR, the models need to be trained to do Q → A
is firstly proposed for machine translation and rapidly ap- and QA → R, which can be viewed as two four-way classifi-
plied to multiple natural language processing tasks [25], [26]. cation problems. The results of Q → AR can be obtained by
Dai et al. [25] proposed compressive transformer to compress combining the results of Q → A and QA → R.
past memories for long-range sequence learning. Li et al. [26] de- The architecture of the proposed MSGT framework is illus-
veloped flat-lattice transformer for Chinese named entity recog- trated in Fig. 1. The image regions, the question and the an-
nition with excellent parallelization ability, which converted swer data are firstly fed into pre-trained networks to extract fea-
spans from lattice structure to flat structure. Later, Transformer is tures. Then an answer-vision structure-embedding graph and
modified and applied in many computer vision tasks such as im- an answer-question structure-embedding graph are constructed
age classification [27] and object detection [28]. Chen et al. [27] based on the initial features in the MSE module. The SGT mod-
reshaped image as sequential pixels, which were fed into a se- ule is further adopted to learn the evolution of the graph at both
quence Transformer to auto-regressively predict pixels without the contextualization and reasoning stages. Finally, the reason-
considering the 2D structure of the input image. Carion et al. [28] ing graph is represented as a vector via the proposed SP for
proposed the DETR model with a Transformer encoder-decoder classification.
architecture to detect objects by bipartite matching between pre-
dictions and ground-truths, and then to reason the object re-
lations and generate the final prediction set. Recently, several B. Multi-Modal Structure Embedding
efforts are devoted to generalizing Transformer-based models Graph is an appropriate structure to represent associated data,
to graph-structured data [29], [30]. Zhao et al. [29] sampled where the data is usually homogeneous. The multi-modal data
ego-graphs as the input of Transformer for node classification, processed in VCR has inherent characteristics, and it is more
and proposed a proximity-enhanced attention mechanism to cal- difficult to construct heterogeneous graphs appropriately. With
culate the structural bias for Transformer. Ying et al. [30] pro- regard to visual regions and linguistic words, image distances
posed a Transformer-based architecture incorporating spatial and word orders are important structure characteristics. There-
encoding, edge encoding and centrality encoding of the graph fore, Multi-modal Structure Embedding is proposed in this work
as the structural bias, which achieved excellent results on the to assist with graph construction and dynamically modeling
open graph benchmark large-scale challenge [31]. The graph intra-modal and inter-modal associations.

Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on July 17,2024 at 11:53:23 UTC from IEEE Xplore. Restrictions apply.
1298 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 26, 2024

To obtain the initial graphs, the input image and texts are
pre-processed in object-level and word-level for graph construc-
tion. Given No objects O = {oi }N o
i=1 in an image, the visual fea-
tures S = {si }i=1 , si ∈ R are extracted by a pre-trained CNN,
No d

where d is the feature dimension. Attribution information such


as color and shape is also embedded into the features to make
visual representation more discriminative. Note that the whole
image is viewed as an individual object at the first position in the
object sequence and fed into CNN to obtain the overall visual
q
features. Given a question with Nuq words U q = {uqi }N u
i=1 or a
a
a Nu
answer with Nu words U = {ui }i=1 , the linguistic features
a a
q
of the question T q = {tqi }N i=1
u
, tqi ∈ Rd or those of the answer
a
N
T a = {tai }i=1
u
, tai ∈ Rd are obtained by a pre-trained BERT and
a subsequent bidirectional LSTM.
The VCR task is answer-oriented, therefore it is necessary
to lay emphasis on the correlation between answers and image
regions as well as the correlation between answers and ques-
tions. Motivated by this, an answer-vision heterogeneous graph
Gav = {Vav , Eav } and an answer-question heterogeneous graph
Gaq = {Vaq , Eaq } are constructed. For the answer-vision het-
erogeneous graph Gav , the nodes Vav derive from image regions
and words of the answer, and the features a
of nodes are initialized
a Nu
by S = {si }N i=1
o
and T a
= {t }
i i=1 . The edges between nodes
are divided into three parts: the edges between vision nodes,
the edges between answer nodes, and the edges between vision
nodes and answer nodes. With regard to No vision nodes, image
distances are used to calculate edge weights. Given two visual Fig. 2. Illustration of the proposed SGT, where embedding weights are not
shared for different SGT layers.
regions with positions (Xi , Yi ) and (Xj , Yj ) respectively in the
image, the edge weight can be calculated as
 q q
a a
Dijv
= lv ∗ (Xi − Xj )2 + (Yi − Yj )2 , (1) as Eaq ∈ R(Nu +Nu )×(Nu +Nu ) . As a result, these heterogeneous
  graphs adopt fine-grained regions and words as nodes and pre-
v
Eij = Ψv Dij v
. (2) liminarily express intra-modal and inter-modal correlations with
Since X, Y are normalized to [0, 1], lv is a factor to rescale the structure embedding between visual and linguistic data for fur-
visual distance to an appropriate value for embedding. Ψv is ther learning.
the embedding function for visual edge weight calculation. For
Nua answer nodes, word orders are applied to do edge weight C. Structure-Injecting Graph Transformer for
calculation. For instance, the edge weight matrix of a sentence Contextualization and Reasoning
with three words is Instead of the traditional GCN, the structure-injecting graph
⎛⎡ ⎤⎞
0 1 2 transformer (SGT, illustrated in Fig. 2) is proposed in this work to
⎜⎢ ⎥⎟ learn from these graphs for the evolution of node representations
a
E3×3 = Ψa ⎝⎣1 0 1⎦⎠ , (3)
and graph structures when contextualizing and reasoning.
2 1 0 1) Contextualization: With the answer-vision heterogeneous
where Ψa is the embedding function for answer edge weight cal- graph and the answer-question heterogeneous graph, SGT firstly
culation. Note that the edge weight between the visual node and learns to contextualize answer words with image regions and
the answer node is obtained by using value 1 for the embedding question words respectively. For the answer-vision heteroge-
function Ψa . In this way, cross-modal information can flow with neous graph Gav , the node representations Vav and the structure
strong correlation. Finally, the embedding results of the three embedding priors Eav are input into the SGT layer. To model
parts mentioned above make up the structure embedding repre- the intra-modal and inter-modal correlations, the query matrix
o a o a
o a o a
sentation for answer-vision graph Eav ∈ R(Nu +Nu )×(Nu +Nu ) .
av
Mquery ∈ R(Nu +Nu )×d , the key matrix Mkey av
∈ R(Nu +Nu )×d
o a
q a
Similarly, for Nu question nodes and Nu answer nodes in av
and the value matrix Mvalue ∈ R(Nu +Nu )×d are generated from
the answer-question heterogeneous graph, the edge weights be- original node features by three fully-connected layers separately.
a
tween answer nodes or question nodes are embedded like E3×3 The traditional Transformer just calculates semantic correlation
above. The edge weight between the answer node and the ques- matrix according to the semantic similarity between Mquery and
tion node is also embedded from value 1. The structure em- Mvalue , where although the input features contain positional en-
bedding representation for answer-question graph is denoted coding indicating linguistic word orders, such a manner is not

Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on July 17,2024 at 11:53:23 UTC from IEEE Xplore. Restrictions apply.
ZHU et al.: MULTI-MODAL STRUCTURE-EMBEDDING GRAPH TRANSFORMER FOR VISUAL COMMONSENSE REASONING 1299

a a
suitable for cross-modal data. Therefore, the multi-modal struc- graph structure representation E cls ∈ RNu ×Nu for classification
ture embedding representation Eav as the structure prior Est is are obtained as
added to the semantic correlation matrix to jointly generate node  cls cls 
aggregation matrix in the first SGT layer, which can be formu- V ,E = Φ(n) ({Vareason , Eareason }) , (12)
lated as where Φ(n) (·) represents n-time SGT layer operation. Since the
Est = Eav , (4) size of reasoning graph is relatively smaller than that of answer-
  vision heterogeneous graph or answer-question heterogeneous
av
Mquery · (Mkey )
av T
graph, the number of SGT layers for reasoning is set as 2 to
evo
Eav = softmax Est + √ . (5) avoid over-fitting. This indicates Eq. (12) can be re-written as
d
{V cls , E cls } = Φ(Φ({Vareason , Eareason })).
As for the following SGT layers, the structure prior Est is
rescaled by a factor lsgt to a normalization level and embed- D. Classification
ded based on the node aggregation matrix from the former SGT
The final node representation of the evolutionary reason-
via the function Ψsgt as
ing graph is required to be pooled as a vector for the classi-
Est = Ψsgt (lsgt ∗ Eav
evo
). (6) fication network, which is a two-layer MLP with the ReLU
activation function adopted in [8], [10], [12], [13] as well.
The node features are updated via node aggregation matrix, Different from the average or maximum pooling strategy in pre-
which can be formulated as vious works, a scored pooling mechanism is proposed, which
Vavupdate = Eav
evo
· Mvalue
av
. (7) can operate adaptive pooling according to the input sequential
Nua
features. Given the sequential features V cls = {Vicls }i=1 con-
Here, the multi-head attention strategy in the original Trans- sisting of Nua 3d-dimensional vectors, each vector is scored
former is employed for diverse embedding learning. The up- by a fully-connected layer at each position to yield the output
dated features of nodes are further refined by a feed-forward 3d-dimensional scored vector as
network (FFN) block to generate the final evolutionary node  
representations Vavevo as Scorei = scored_f c Vicls . (13)
   
F F N Vavupdate = W2 · relu W1 · Vavupdate + b1 + b2 , (8) Then, the scores in each score vector at position j are normalized
by the softmax function to obtain the pooling weights W pool ∈
a
where W1 , b1 , W2 and b2 are learnable parameters. RNu ×3d as
The evolutionary answer-vision heterogeneous graph repre-  
sentation Gevo Wjpool = softmax Score1j , Score2j , . . . , ScoreNua j . (14)
av output by SGT layers can be formulated as
The classification vector f cls ∈ R3d is pooled as
av = {Vav , Eav }.
Gevo evo evo
(9)
a
Gevo 
Nu
av is fed into the next layer of SGT, and the final graph repre-
fjcls = Wijpool Vijcls . (15)
sentation Gfinal
av is obtained after several layers of SGT, denoted
i=1
as
 final final  Finally, the classification vector f cls is fed into the classification
av = Vav , Eav
Gfinal . (10) network to get the scalar prediction score p. In this way, the
Similarly, the original answer-question heterogeneous graph framework fuses graph features adaptively and chooses valuable
representation Gaq can be learnt to generate the final graph rep- clues for reasoning at both token-level and feature-level.
With regard to the VCR task containing four answers, four
aq = {Vaq , Eaq }.
resentation Gfinal final final

2) Reasoning: After contextualizing answer words with im- triple inputs {(Ii , Qi , Ai )}4i=1 are judged separately, and the
age regions and question words, the vision-guidance answer predictions can be denoted as {pi }4i=1 . Finally, the classification
word representation Vav_a final a
∈ RNu ×d and the question-guidance probabilities y are calculated by the softmax function based on
a
answer word representation Vaq_a final
∈ RNu ×d are selected for fur- the four predictions as
 
ther reasoning, which are concatenated with the original answer yi }4i=1 = softmax {pi }4i=1 .
{ (16)
word representation T a to obtain the initial sequential reasoning
a
feature Vareason ∈ RNu ×3d as Typically, such a four-way classification problem is optimized
  by the cross-entropy loss function [8], [10], [12], [13], which can
Vareason = concat T a , Vav_a final final
, Vaq_a , (11) be formulated as
 4
n 
where concat([·]) is the concatenation operator.  
The evolutionary reasoning graph Greason = {Vareason , Eareason } LossCE = yik log yik , (17)
is constructed based on the sequential reasoning feature Vareason , k=1 i=1

which models the correlation between reasoning answer words where n is the number of samples, k is the sample index, and y is
in the manner similar to MSE. Then, the sequential SGT layers the ground-truth. Adopting such a loss function, the framework
are applied to the reasoning graph for evolutionary learning. As a tends to compare four options to select the most suitable one.
a
result, the final node representation V cls ∈ RNu ×3d and the final However, such an optimization strategy doesn’t focus on the

Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on July 17,2024 at 11:53:23 UTC from IEEE Xplore. Restrictions apply.
1300 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 26, 2024

inherent correctness of the logic to some extent. Therefore, the


binary cross-entropy loss function is additionally introduced into
the VCR task, which can be formulated as
4  

n 
sig 
LossBCE = yik log yik
k=1 i=1
    sig 
+ 1 − yik log 1 − yik , (18)

yik )sig = sigmoid(pki ). Finally, the loss function used in


where (
this work is the combination of cross-entropy loss and binary
cross-entropy loss as
Loss = LossCE + α · LossBCE , (19) Fig. 3. Learning rate curve of the Noam method with β = 0.25, λ = 1024,
and warmup_steps = 2000.
which balances the relative and absolute correctness of the logic,
and α is a balance factor.

step_num · warmup_steps−1.5 , (20)
E. Rationale Prediction
where β is an adjustment factor, λ is the basic learning rate,
The models for the subtasks Q → A and QA → R pos-
step_num is the current training step, and warmup_steps is the
sess identical architecture and are trained separately. As for
number of steps for model to warm up. The learning rate curve of
Q → A, the original question text and answer text are input
the Noam method is shown in Fig. 3, which contains two stages:
into the question encoder and the answer encoder. With regard
warm-up stage and descend stage. Compared with the traditional
to QA → R, the framework combines the original question text
learning rate decay strategy, the model parameters obtained by
with the predicted answer text as the conditional question text for
the Noam [24] method can be trained more softly at the warm-up
the question encoder, and the candidate rationale text serves
stage to obtain more suitable optimization results and converge
as the candidate answer text for the answer encoder. As a re-
more quickly at the descend stage. The original Noam method
sult, the data format of QA → R is identical to that of Q → A,
doesn’t have the adjustment factor β, which is added to make
and the model for Q → A can also do QA → R. In the frame-
the learning rate to be a suitable value in the follow-up training
work, the image encoder, question encoder and answer encoder
period. The balance factor α for the loss function in Eq. (19)
firstly extract the visual and linguistic features. Then, the frame-
is set to 1, the batch size is 96, and the model is trained for 30
work is trained to solve a four-way classification problem choos-
epochs with early stopping.
ing the most suitable rationale based on the given visual features
and linguistic features of the conditional question.
IV. EXPERIMENT
F. Implementation In this section, the experimental setting and result are pre-
sented. The proposed framework is evaluated on the VCR bench-
The proposed framework is implemented on PyTorch and
mark dataset [8] compared with state-of-the-art methods. Case
trained with 4 Tesla V100 GPUs. At the stage of feature ex-
study and parameter analysis are also provided.
traction for original visual and linguistic data, each object in
the given image is extracted as a 512-dimensional vector by
the ResNet101-based backbone [32], which was proposed in A. Dataset
TAB-VCR [10] and also used in ECMR [12]. Each word is em- Extensive experiments are carried out on the VCR benchmark
bedded as a 768-dimensional feature by a pre-trained BERT [33], dataset [8], which is composed of 290 k four-way multi-choice
and the sequential words as a whole sentence are further input QA problems derived from 110 k movie scenes. Different from
into a single-layer bidirectional LSTM model to obtain 512- VQA datasets where the answer is usually a single word, the
dimensional vectors for each word. The SGTs for cross-modal answer and rationale in the VCR dataset are more complicated
contextualization and reasoning have 512 and 1536 hidden cells and in the form of mixtures of visual and linguistic words. The
with 8 heads, respectively. The factors for embedding functions average lengths of the answers and rationales are over 7.5 words
lv and lsgt are set to 32 and 64. The dropout rate is set as 0.3 in and 16 words, respectively. Following the data partition prac-
Bi-LSTM and 0.1 in SGT. The parameters of SGT and those tice [8], the training set consists of 80,418 images with 212,923
of the remaining modules are trained by Adam [34] with two questions, the validation set is composed of 9,929 images with
separate strategies. For the parameters not involved in SGT, the 26,534 questions.
learning rate is initialized as 0.0002 and factored by 0.25 when
the validation accuracy doesn’t increase for 3 epochs. The weight B. Evaluation Metric and Baseline
decay rate is set as 0.0001. For the parameters in SGT, the learn-
The evaluation metric is classification accuracy, which is a
ing rate is calculated by the Noam [24] method as
 ratio of correctly classified samples to all test samples. The com-
learning rate = β · λ−0.5 · min step_num−0.5 , peting methods are divided into three categories: (1) text-only

Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on July 17,2024 at 11:53:23 UTC from IEEE Xplore. Restrictions apply.
ZHU et al.: MULTI-MODAL STRUCTURE-EMBEDDING GRAPH TRANSFORMER FOR VISUAL COMMONSENSE REASONING 1301

TABLE I judgements. With regard to the VQA baselines, these methods


COMPARISON OF ACCURACY FOR THREE SUBTASKS IN VCR ACHIEVED BY THE
COMPETING METHODS ON THE VALIDATION SET OF VCR DATASET
are not specifically designed to deal with complex expressions
in VCR, and there are still performance gaps between these
methods and the VCR methods. For the VCR methods, the
graph-based models outperform the plain-attention-based mod-
els in general, since graphs can represent and model the cor-
relation between complex cross-modal data more effectively.
Compared with previous graph-based models, the proposed
MSGT framework achieves the best performance with 72.2%
for the Q → A task, 73.6% for the QA → R task, 53.3% for
the Q → AR task on the validation set, respectively. The im-
provements obtained by the proposed MSGT indicate the im-
portance of roundly modeling inter-modal and intra-modal cor-
relations for VCR. In comparison to ECMR [12] where graph
structures are fixed and only one GCN layer is used, the pro-
posed MSGT adopts evolutionary heterogeneous graphs with
multi-modal structure embedding and stacks SGT layers to build
a more powerful contextualization network. As a result, MSGT
gains the improvements of 1.5% for the Q → A task, 1.6% for
the QA → R task, 2.2% for the Q → AR task over ECMR,
baselines, including BERT [33], BERT (response only) [33], respectively. As for the VC R-CNN [40] method using causal
ESIM+ELMO [35] and LSTM+ELMO [35], which don’t uti- intervention, the CL-VCR [41] method with robust training and
lize visual information and can be used to verify the impor- JAE [42] method adopting knowledge distillation, the proposed
tance of visual context in VCR; (2) VQA baselines, includ- MSGT has superiority as well.
ing RevisitedVQA [36], BottomUpTopDown [37], MLB [38]
and MUTAN [39], which are originally designed for VQA and D. Visualization and Analysis
modified to perform VCR (compared with these methods, the An instance of multi-modal structure embedding for the VCR
capability of the proposed framework to model the correla- task obtained by the proposed MSGT is shown in Fig. 4. As can
tion between the complex response and the question or im- be observed, with regard to three types of edges, the model can
age can be evaluated); (3) VCR methods, including R2C [8], roughly focus on more relevant tokens such as neighbouring
CKRM [9], TAB-VCR [10], CCN [11], HGL [13], ECMR [12], words and visual regions. All the cross-modal edges in the four
VC R-CNN [40], CL-VCR [41] and JAE [42]. sub-figures on the right gain positive emphasis weights. Here, the
The brief descriptions of the competing VCR methods are model just needs relatively correct embedding results, indicating
presented below. R2C adopts bilinear attention mechanism and the direction for model to learn. Some inappropriate values in
LSTM to associate image with text for reasoning. CKRM is an the embedding results can be viewed as noise, which will be
attention-based model to transfer external knowledge into the overcome by the SGT layers.
VCR task. TAB-VCR integrates attribute information into vi- The instances of a successful case and a failure case for the
sual features and assigns extra tags to image grounding for the VCR task obtained by the proposed MSGT are shown in Fig. 5.
VCR task. CCN employs a connective cognition network and re- The Heatmap in Fig. 5 for the SGT layer is the average of h
organizes visual neuron connectivity to do VCR. HGL operates heatmaps generated by the multi-head mechanism, which can
heterogeneous graph learning based on the cross-modal corre- reflect the overall correlation between tokens. In Fig. 5(a), as for
lation between image and text. ECMR integrates visual graph the Q → A task, the heatmaps for these heterogeneous graphs
and linguistic graph for cross-modal reasoning. JAE presents succeed to catch the key clues such as < person2 >&“eat”,
a plug-and-play knowledge distillation enhanced framework to < person2 >&“buy” and “eat”&“buy”. With regard to the
do VCR. VC R-CNN employs region-based CNN to perform QA → R task, most tokens are correctly associated with “store”,
causal intervention for visual feature enhancement. CL-VCR “food”, “goods”. As a consequence, MSGT completes the VCR
adopts a curriculum-based masking approach to training model task successfully. In Fig. 5(b), MSGT mistakenly pays attention
more robustly for VCR. to < truck > when doing the Q → A task, since the men in
Fig. 5(b) look as seeing the truck. With regard to the QA → R
C. Quantitative Result task, MSGT is not capable of understanding the semantics, and
turns to select the rationale most similar to the answer.
The classification accuracy results achieved by the competing
methods and the proposed MSGT for the three sub-tasks in VCR
are shown in Table I. As can be observed, the text-only base- E. Ablation Study
lines obtain poor performances since the sentences in VCR are To evaluate the effectiveness of the proposed MSE, SGT and
closely related to the visual information. By merely adopting lin- SP modules in MSGT, related models are designed for the pur-
guistic information, the models are incapable of making correct pose of ablation study as described below.

Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on July 17,2024 at 11:53:23 UTC from IEEE Xplore. Restrictions apply.
1302 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 26, 2024

Fig. 4. Instance of multi-modal structure embedding for the VCR task obtained by the proposed MSGT. To be distinguishable, < person1 > means visual
object, [person1] means linguistic entity. The heatmaps on the right indicate multi-modal structure embedding for MSE. The heatmaps in answer are for the
Q → A task, and those in reason are for the QA → R task.

Base: A variant of the R2C model [8] replaces the backbone TABLE II
ABLATION STUDY ON THE VALIDATION SET FOR THREE SUBTASKS IN VCR
with ResNet101, which uses bilinear attention to contextualize
and Bi-LSTM to reason. The architecture is similar to MSGT
consisting of encoding, contextualization, reasoning and classi-
fication.
Base+Transformer+SP: A framework uses Transformer to
perform contextualization and reasoning, then scored pooling
is applied to fuse the sequential features for classification. This
framework is used to evaluate the effect of MSE and evolutionary
learning.
Base+MSE+GCN+SP: A variant model of MSGT replaces
the SGT layers with one GCN layer. This model is utilized to Base+MSE+Transformer+SP, the proposed MSGT adopting
evaluate the effect of SGT. SGT instead of GCN and traditional Transformer achieves better
Base+MSE+Transformer+SP: A variant model of MSGT re- performances for the three subtasks, indicating that stacked SGT
places the SGT layers with the Transformer layers. Specifically, layers are more powerful to model the correlation with struc-
the variant model firstly enhances the visual data, and then pro- ture priors. In comparison with Base+MSE+SGT+Mean pool-
cesses the multi-modal data in a sequential manner with the plain ing and Base+MSE+SGT+Max pooling, the proposed MSGT
positional encodings in Transformer. This model is also utilized with scored pooling is capable of fusing sequential features more
to evaluate the effect of SGT. adaptively, and gains improvements in the three subtasks.
Base+MSE+SGT+Mean pooling: A variant model of MSGT
employs mean pooling, which evaluates the effect of SP.
Base+MSE+SGT+Max pooling: A variant model of MSGT F. Parameter Analysis
adopts max pooling, which is also used to evaluate the effect of Layer number: The SGT for contextualization is the key
SP. component to extract effective features from heterogeneous
Base+MSE+SGT+SP: The proposed MSGT model incorpo- graphs for reasoning. To study the effect of the number of
rates MSE, SGT and SP. SGT layers used for contextualization, the performance gained
The ablation study results for the three subtasks in VCR by the proposed MSGT with different number of SGT lay-
are shown in Table II. As observed from the results, the per- ers is given in Table III. As can be observed, the model with
formance of the base model is barely satisfactory, indicating three layers achieves the best performance for the Q → A task
the effectiveness of the overall architecture. Compared with while the performance for the QA → R task becomes better
Base+Transformer+SP, Base+MSE+Transformer+SP obtains as the layer number increases. Compared with the Q → A task,
the improvements of 0.7% for the Q → A task, 0.7% for the the QA → R task provides more clues for model to learn, there-
QA → R task, and 1.0% for the Q → AR task on the valida- fore the model for the QA → R task is less likely to be overfit.
tion set respectively, which demonstrates the effectiveness of However, the Q → AR task is the most essential task in VCR,
the MSE module. Compared with Base+MSE+GCN+SP and and the proposed MSGT ultimately adopts three layers of SGT

Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on July 17,2024 at 11:53:23 UTC from IEEE Xplore. Restrictions apply.
ZHU et al.: MULTI-MODAL STRUCTURE-EMBEDDING GRAPH TRANSFORMER FOR VISUAL COMMONSENSE REASONING 1303

Fig. 5. Instances of (a) successful case and (b) failure case for the VCR task obtained by the proposed MSGT. The percentages in brackets are the probabilities
predicted by MSGT, and the choices filled in brown are ground-truths. To be distinguishable, < person1 > means visual object, [person1] means linguistic
entity. The heatmaps for the ground-truth choices on the right indicate graph structures output by the last SGT layer for contextualization. The heatmaps in answer
are for the Q → A task, and those in reason are for the QA → R task.

TABLE III for contextualization, which obtains the highest accuracy for the
PERFORMANCE COMPARISON GAINED BY MSGT WITH DIFFERENT NUMBER
OF LAYERS FOR CONTEXTUALIZATION ON THE VALIDATION SET FOR THREE
Q → AR task.
SUBTASKS Normalization factors lv , lsgt : To investigate the effect of the
normalization factors lv and lsgt for structure embedding, the
performances achieved by MSGT with different values of nor-
malization factors are given in Table IV. As the values of normal-
ization factors increase, the classification results become better
since the framework can more exactly distinguish different val-
ues of original data for rescaling. When lv > 32 and lsgt > 64,
the learning of the embedding functions becomes more difficult,

Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on July 17,2024 at 11:53:23 UTC from IEEE Xplore. Restrictions apply.
1304 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 26, 2024

TABLE IV classification can further enhance the capability of distinguish-


PERFORMANCE COMPARISON ACHIEVED BY MSGT WITH DIFFERENT VALUES
OF NORMALIZATION FACTOR lv AND lsgt ON THE VALIDATION SET FOR THREE
ing samples. The experiments conducted on the benchmark VCR
SUBTASKS dataset demonstrate the effectiveness of the proposed method.

REFERENCES
[1] J. Zhu and H. Wang, “Multi-scale conditional relationship graph network
for referring relationships in images,” IEEE Trans. Cogn. Devlop. Syst.,
vol. 14, no. 2, pp. 752–760, Jun. 2022.
[2] W. Guo, Y. Zhang, J. Yang, and X. Yuan, “Re-attention for visual question
answering,” IEEE Trans. Image Process., vol. 30, pp. 6730–67432021.
[3] H. Zhong et al., “Self-adaptive neural module transformer for visual ques-
tion answering,” IEEE Trans. Multimedia, vol. 23, pp. 1264–1273, 2021.
[4] H. Shi, H. Li, Q. Wu, and K. N. Ngan, “Query reconstruction network
for referring expression image segmentation,” IEEE Trans. Multimedia,
TABLE V
vol. 23, pp. 995–1007, 2021.
PERFORMANCE COMPARISON GAINED BY MSGT WITH α VALUES ON THE
[5] J. Liu, W. Wang, L. Wang, and M.-H. Yang, “Attribute-guided attention for
VALIDATION SET FOR THREE SUBTASKS
referring expression generation and comprehension,” IEEE Trans. Image
Process., vol. 29, pp. 5244–5258, 2020.
[6] L. Yang, H. Wang, P. Tang, and Q. Li, “CaptionNet: A tailor-made re-
current neural network for generating image descriptions,” IEEE Trans.
Multimedia, vol. 23, pp. 835–845, 2021.
[7] H. Wang, P. Tang, Q. Li, and M. Cheng, “Emotion expression with
fact transfer for video description,” IEEE Trans. Multimedia, vol. 24,
pp. 715–727, 2022.
[8] R. Zellers, Y. Bisk, A. Farhadi, and Y. Choi, “From recognition to cogni-
tion: Visual commonsense reasoning,” in Proc. IEEE/CVF Conf. Comput.
Vis. Pattern Recognit., 2019, pp. 6713–6724.
[9] Z. Wen and Y. Peng, “Multi-level knowledge injecting for visual common-
sense reasoning,” IEEE Trans. Circuits Syst. Video Technol., vol. 31, no. 3,
pp. 1042–1054, Mar. 2021.
and the performance slightly declines. Therefore, we set lv = 32 [10] J. Lin, U. Jain, and A. G. Schwing, “TAB-VCR: Tags and attributes
and lsgt = 64 in this work. based VCR baselines,” in Proc. Adv. Neural Inf. Process. Syst., 2019,
Loss balance factor α: To investigate the effect of the loss pp. 15615–15628.
[11] A. Wu, L. Zhu, Y. Han, and Y. Yang, “Connective cognition network
balance factor α, the accuracy results achieved by the proposed for directional visual commonsense reasoning,” in Proc. Adv. Neural Inf.
MSGT with different α values are reported in Table V. When Process. Syst., 2019, pp. 5669–5679.
α is set to 0, the combined loss is identical to the cross-entropy [12] X. Zhang, F. Zhang, and C. Xu, “Explicit cross-modal representation learn-
ing for visual commonsense reasoning,” IEEE Trans. Multimedia, vol. 24,
loss. The accuracy becomes higher as α increases, and reaches pp. 2986–2997, 2022.
the highest when α = 1, indicating the binary cross-entropy loss [13] W. Yu, J. Zhou, W. Yu, X. Liang, and N. Xiao, “Heterogeneous graph
can enhance the capability of model judgement. Compared with learning for visual commonsense reasoning,” in Proc. Adv. Neural Inf.
Process. Syst., 2019, pp. 2769–2779.
the QA → R task, the introduction of the binary cross-entropy [14] W. Guan et al., “Bi-directional heterogeneous graph hashing towards ef-
loss is more effective in the Q → A task, since the QA → R ficient outfit recommendation,” in Proc. 30th ACM Int. Conf. Multimedia,
task is relatively easier. 2022, pp. 268–276.
[15] W. Guan et al., “Personalized fashion compatibility modeling via
metapath-guided heterogeneous graph learning,” in Proc. 45th Int. ACM
SIGIR Conf. Res. Develop. Inf. Retrieval, 2022, pp. 482–491.
V. CONCLUSION [16] I. Spinelli, S. Scardapane, and A. Uncini, “Adaptive propagation graph
convolutional network,” IEEE Trans. Neural Netw. Learn. Syst., vol. 32,
Visual commonsense reasoning is a challenging task since it no. 10, pp. 4755–4760, Oct. 2021.
is difficult to effectively represent and model the complex cor- [17] L. Shi, Y. Zhang, J. Cheng, and H. Lu, “Skeleton-based action recognition
with multi-stream adaptive graph convolutional networks,” IEEE Trans.
relation between visual and linguistic data. To tackle this issue, Image Process., vol. 29, pp. 9532–9545, 2020.
a framework named MSGT is proposed to represent the cor- [18] L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, and K.-W. Chang, “VisualBERT:
relation between cross-modal data via evolutionary structure- A simple and performant baseline for vision and language,” Aug. 2019,
arXiv:1908.03557.
embedding graphs. MSGT is able to learn graph evolution to [19] W. Su et al., “VL-BERT: Pre-training of generic visual-linguistic repre-
capture valuable information for reasoning. Specifically, the sentations,” in Proc. Int. Conf. Learn. Representations, 2020.
MSE module constructs the answer-vision heterogeneous graph [20] Y.-C. Chen et al., “UNITER: Universal image-text representation learn-
ing,” in Proc. 16th Eur. Conf. Comput. Vis., 2020, pp. 104–120.
and the answer-question heterogeneous graph to simultaneously [21] W. Li et al., “UNIMO: Towards unified-modal understanding and gen-
model the intra-modal and inter-modal associations in VCR with eration via cross-modal contrastive learning,” in Proc. Assoc. Comput.
inherent characteristics of multi-modal data considered. The Linguistics, 2021, pp. 2592–2607.
[22] J. Cho, J. Lei, H. Tan, and M. Bansa, “Unifying vision-and-language
SGT module is designed to learn evolution of node features and tasks via text generation,” in Proc. Int. Conf. Mach. Learn., 2021,
graph structures in the contextualization and reasoning stages. pp. 1931–1942.
As a result, valuable clues can be roundly and adaptively selected [23] D. Song, S. Ma, Z. Sun, S. Yang, and L. Liao, “KVL-BERT: Knowledge
enhanced visual-and-linguistic BERT for visual commonsense reasoning,”
and integrated from graphs at both token-level and feature-level. Knowl.-Based Syst., vol. 230, Oct. 2021, Art. no. 107408.
The proposed scored pooling mechanism and the loss function [24] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf.
combining cross-entropy and binary cross-entropy functions for Process. Syst., 2017, pp. 5998–6008.

Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on July 17,2024 at 11:53:23 UTC from IEEE Xplore. Restrictions apply.
ZHU et al.: MULTI-MODAL STRUCTURE-EMBEDDING GRAPH TRANSFORMER FOR VISUAL COMMONSENSE REASONING 1305

[25] Z. Dai et al., “Transformer-XL: Attentive language models beyond a Jian Zhu received the B.S. degree in automation
fixed-length context,” in Proc. 57th Assoc. Comput. Linguistics, 2019, from Hunan University, Changsha, China, in 2018.
pp. 2978–2988. He is currently working toward the Ph.D. degree with
[26] X. Li, H. Yan, and X. Qiu, “FLAT: Chinese ner using flat-lattice trans- the Department of Computer Science and Technol-
former,” in Proc. 58th Assoc. Comput. Linguistics, 2020, pp. 6836–6842. ogy, Tongji University, Shanghai, China. His research
[27] M. Chen et al., “Generative pretraining from pixels,” in Proc. Int. Conf. interests include computer vision, cross-modal re-
Mach. Learn., 2020, pp. 1691–1703. trieval, and reasoning.
[28] N. Carion et al., “End-to-end object detection with transformers,” in Proc.
16th Eur. Conf. Comput. Vis., 2020, pp. 213–229.
[29] J. Zhao et al., “Gophormer: Ego-Graph Transformer for Node Classifica-
tion,” Oct. 2021, arXiv:2110.13094.
[30] C. Ying et al., “Do transformers really perform bad for graph representa-
tion?,” in Proc. Adv. Neural Inf. Process. Syst., 2021, pp. 28877–28888. Hanli Wang (Senior Member, IEEE) received the
[31] W. Hu et al., “OGBLSC: A large-scale challenge for machine learning on B.E. and M.E. degrees in electrical engineering from
graphs,” in Proc. NeurIPS D&B, Oct. 2021. Zhejiang University, Hangzhou, China, in 2001 and
[32] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image 2004, respectively, and the Ph.D. degree in com-
recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, puter science from City University of Hong Kong,
pp. 770–778. Kowloon, Hong Kong, in 2007. From 2007 to 2008,
[33] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training he was a Research Fellow at the Department of Com-
of deep bidirectional transformers for language understanding,” in Proc. puter Science, City University of Hong Kong. From
North Amer. Chapter Assoc. Comput. Linguistics: Hum. Lang. Technol., 2007 to 2008, he was also a Visiting Scholar at Stan-
2019, pp. 4171–4186. ford University, Palo Alto, CA, USA. From 2008 to
[34] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2009, he was a Research Engineer at Precoad, Inc.,
in Proc. Int. Conf. Learn. Representations, 2014. Menlo Park, CA, USA. From 2009 to 2010, he was an Alexander von Humboldt
[35] Q. Chen et al., “Enhanced LSTM for natural language inference,” in Proc. Research Fellow at the University of Hagen, Hagen, Germany. Since 2010, he
Assoc. Comput. Linguistics, 2017, pp. 1657–1668. has been a full Professor with the Department of Computer Science and Technol-
[36] A. Jabri, A. Joulin, and L. Van Der Maaten, “Revisiting visual ques- ogy, Tongji University, Shanghai, China. His research interests include computer
tion answering baselines,” in Proc. 14th Eur. Conf. Comput. Vis., 2016, vision, multimedia signal processing, and machine learning.
pp. 727–739.
[37] P. Anderson et al., “Bottom-up and top-down attention for image caption-
ing and visual question answering,” in Proc. IEEE/CVF Conf. Comput.
Vis. Pattern Recognit., 2018, pp. 6077–6086. Bin He (Member, IEEE) received the B.S. degree
[38] J.-H. Kim et al., “Hadamard product for low-rank bilinear pooling,” in in engineering machinery from Jilin University,
Proc. Int. Conf. Learn. Representations, 2017. Changchun, China, in 1996, and the Ph.D. degree
[39] H. Ben-Younes, R. Cadene, M. Cord, and N. Thome, “MUTAN: Multi- in mechanical and electronic control engineering
modal tucker fusion for visual question answering,” in Proc. IEEE Conf. from Zhejiang University, Hangzhou, China, in 2001,
Comput. Vis. Pattern Recognit., 2017, pp. 2612–2620. where he held a postdoctoral research appointment
[40] T. Wang, J. Huang, H. Zhang, and Q. Sun, “Visual commonsense R- with the State Key Lab of Fluid Power Transmission
CNN,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, and Control, from 2001 to 2003. He joined Tongji
pp. 10757–10767. University, Shanghai, China, in 2003. He is currently
[41] K. Ye and A. Kovashka, “A case study of the shortcut effects in vi- a Professor with the Department of Control Science
sual commonsense reasoning,” in Proc. AAAI Conf. Artif. Intell., 2021, and Engineering, Tongji University and with the Fron-
pp. 3181–3189. tiers Science Center for Intelligent Autonomous Systems, Shanghai. His research
[42] Z. Li et al., “Joint answering and explanation for visual commonsense interests include intelligent robots, autonomous systems, intelligent perception,
reasoning,” Feb. 2022, arXiv:2202.12626. and wireless networks.

Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on July 17,2024 at 11:53:23 UTC from IEEE Xplore. Restrictions apply.

You might also like