0% found this document useful (0 votes)
51 views

A Transformer-Based Model With Self-Distillation For Multimodal Emotion Recognition in Conversations

Uploaded by

minshinrj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views

A Transformer-Based Model With Self-Distillation For Multimodal Emotion Recognition in Conversations

Uploaded by

minshinrj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

This article has been accepted for publication in IEEE Transactions on Multimedia.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3271019

A Transformer-based Model with Self-distillation


for Multimodal Emotion Recognition in
Conversations
Hui Ma, Jian Wang, Hongfei Lin, Bo Zhang, Yijia Zhang, and Bo Xu

Abstract—Emotion recognition in conversations (ERC), the Monica Turn Rachel

task of recognizing the emotion of each utterance in a con- Visual Audio Text Text Audio Visual

versation, is crucial for building empathetic machines. Exist- Wendy, we had a deal!
Yeah, you promised! 1
ing studies focus mainly on capturing context- and speaker- Wendy! Wendy! Wendy!
sensitive dependencies on the textual modality but ignore the [anger] 2 Who was that? [neutral]

significance of multimodal information. Different from emotion Wendy bailed. I have no 3


recognition in textual conversations, capturing intra- and inter- waitress. [sadness]
Oh... that's too bad.
4
modal interactions between utterances, learning weights between Bye bye. [sadness]
different modalities, and enhancing modal representations play Twelve dollars an hour. 5
important roles in multimodal ERC. In this paper, we propose a [neutral] Mon. I wish I could,
transformer-based model with self-distillation (SDT)1 for the task. 6 but I've made plans to
walk around. [neutral]
The transformer-based model captures intra- and inter-modal You know, Rachel,
when you ran out of
interactions by utilizing intra- and inter-modal transformers, and your wedding, I was 7

learns weights between modalities dynamically by designing a there for you. [anger]

hierarchical gated fusion strategy. Furthermore, to learn more


expressive modal representations, we treat soft labels of the
Fig. 1. A multimodal conversation example from the Friends TV series.
proposed model as extra training supervision. Specifically, we
introduce self-distillation to transfer knowledge of hard and soft
labels from the proposed model to each modality. Experiments
on IEMOCAP and MELD datasets demonstrate that SDT out- these methods show promising performance, most of them fo-
performs previous state-of-the-art baselines. cus primarily on textual conversations without leveraging other
Index Terms—Multimodal emotion recognition in conversa- modalities (i.e., acoustic and visual modalities). According to
tions, intra- and inter-modal interactions, multimodal fusion, Mehrabian [16], people express emotions in a variety of ways,
modal representation. including verbal, vocal, and facial expressions. Therefore, mul-
timodal information is more useful for understanding emotions
I. I NTRODUCTION than unimodal information.
Unlike emotion recognition in textual conversations, we
E MOTION recognition in conversations (ERC) aims to
automatically recognize the emotion of each utterance
in a conversation. The task has recently become an impor-
argure that three key characteristics are essential for mul-
timodal ERC: intra- and inter-modal interactions between
tant research topic due to its wide applications in opinion conversation utterances, different contributions of modalities,
mining [1], health care [2], and building empathic dialogue and efficient modal representations. An example is shown in
systems [3], etc. Unlike traditional emotion recognition (ER) Fig. 1. (1) To understand the importance of intra- and inter-
on context-free sentences, modeling context- and speaker- modal interactions, let us focus on the single and multiple
sensitive dependencies lie at the heart of ERC. modalities, respectively. Recognizing “anger” emotion of the
Existing mainstream works on ERC can generally be cat- 7th utterance spoken by Monica is difficult using only “You
egorized into sequence- and graph-based methods. Sequence- know, Rachel, when you ran out of your wedding, I was
based methods [4]–[11] use recurrent neural networks or there for you.”, but it becomes easy when looking back to
transformers to model long-distance contextual information the textual expression of the 6th utterance because Rachel has
in a conversation. In contrast, graph-based methods [12]–[15] made plans to walk around. Additionally, we believe there
design graph structures for conversations and then use graph are two types of inter-modal interactions: interactions between
neural networks to capture multiple dependencies. Although the same and different utterances. First, as stated above, it
is hard to identify the emotion of the 7th utterance using
Hui Ma, Jian Wang (Corresponding author), Hongfei Lin, Bo Zhang and Bo its textual expression; however, it also would be easy when
Xu are with the School of Computer Science and Technology, Dalian Univer- fused with its visual and acoustic expressions since they burst
sity of Technology, Dalian 116024, China (e-mail:[email protected];
[email protected]; [email protected]; [email protected]; instantaneously. Second, we know that the textual expression
[email protected]). of the 5th utterance shows “neutral” emotion, and hence it
Yijia Zhang is with the School of Information Science and could be possible to identify “neutral” emotion of the 6th
Technology, Dalian Maritime University, Dalian 116024, China (e-
mail:[email protected]) utterance by interacting this utterance’s visual expression and
1 The code is available at https://github.com/butterfliesss/SDT. the 5th utterance’s textual expression. (2) To understand the

Authorized licensed use limited to: PSG COLLEGE OF TECHNOLOGY. Downloaded on July 18,2024 at 09:38:26 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3271019

importance of contributions of different modalities, let us focus are conducted to investigate the impact of positional and
on the 3rd utterance. The textual and acoustic expressions play speaker embeddings, intra- and inter-modal transformers,
more critical roles in recognizing “sadness” emotion than the self-distillation loss functions, and hierarchical gated
visual expression because a smiling face usually means “joy” fusion strategy.
emotion. (3) To understand the importance of efficient modal The rest of this paper is organized as follows: Section II
representations, let us focus on the textual expression of the discusses the related work; Section III formalizes the task
1st utterance, which contains multiple exclamation points. If definition and describes the proposed model; Section IV gives
the learned representation does not include the meaning of “!”, the experimental settings; Section V presents the experimental
it is challenging to identify “anger” emotion. results and discussion; Finally, Section VI concludes the paper
Therefore, it is valuable to capture intra- and inter-modal and provides directions for further work.
interactions between utterances, dynamically learn weights
between modalities, and enhance modal representations for
multimodal ERC. However, existing studies of the task have II. RELATED WORK
some limitations in achieving these characteristics. On the
A. Emotion Recognition in Conversations
one hand, most methods have drawbacks in modeling intra-
and inter-modal interactions. For example, CMN [4], ICON ERC has attracted widespread interest among researchers
[5], and DialogueRNN [6] concatenate unimodal features at with the increase in available conversation datasets, such as
the input level, and thus cannot capture intra-modal interac- IMEOCAP [18], AVEC [19], and MELD [20], etc. Early stud-
tions explicitly. While DialogueTRM [10] designs hierarchical ies primarily used lexicon-based methods [21], [22]. Recent
transformer and multi-grained interaction fusion modules to works have generally resorted to deep neural networks and
explore intra- and inter-modal emotional behaviors, it ig- focused on modeling context- and speaker-sensitive depen-
nores inter-modal interactions between different utterances. dencies. We divide the existing methods into two categories:
MMGCN [13] and MM-DFN [17] are graph-based fusion speaker-ignorant and speaker-dependent methods, according to
methods that require manually constructed graph structures to whether they utilize speaker information.
represent conversations. On the other hand, existing methods Speaker-ignorant methods do not distinguish speakers and
rely on the designed model to learn modal representations, but focus only on capturing contextual information in a conversa-
no work focuses on further improving modal representations tion. HiGRU [7] contains two gated recurrent units (GRUs) to
using model-agnostic techniques for ERC. model contextual relationships between words and utterances,
In this work, a transformer-based model with self- respectively. AGHMN [23] uses a hierarchical memory net-
distillation (SDT) is proposed to take into account the three work to enhance utterance representations and introduce an
aforementioned characteristics. First, we introduce intra- and attention GRU to model contextual information. MVN [11]
inter-modal transformers in a modality encoder to capture utilizes a multi-view network to model word- and utterance-
intra- and inter-modal interactions, and take positional and level dependencies in a conversation. In contrast, speaker-
speaker embeddings as additional inputs of these transform- dependent methods model both context- and speaker-sensitive
ers to capture contextual and speaker information. Next, a dependencies. DialogueRNN [6] leverages three distinct GRUs
hierarchical gated fusion strategy is proposed to dynamically to update speaker, context, and emotional states in a conver-
fuse information from multiple modalities. Then, we predict sation, respectively. DialogueGCN [12] uses a graph convolu-
emotion labels of conversation utterances based on fused mul- tional network to model speaker and conversation sequential
timodal representations in an emotion classifier. We call the information. HiTrans [8] consists of two hierarchical trans-
above three components a transformer-based model. Finally, to formers to capture global contextual information and exploits
learn more effective modal representations, we introduce self- an auxiliary task to model speaker-sensitive dependencies.
distillation into the proposed transformer-based model, which However, most of them are proposed for the textual modal-
transfers knowledge of hard and soft labels from the model to ity, ignoring the effectiveness of other modalities. Due to the
each modality. We treat the proposed model as the teacher and promising performance in the multimodal community, some
design three students according to three existing modalities. approaches tend to address multimodal ERC. DialogueTRM
These students are trained by distilling knowledge from the [10] explores intra- and inter-modal emotional behaviors using
teacher to learn better modal representations. hierarchical transformer and multi-grained interaction fusion
In summary, our contributions are as follows: modules, respectively. MMGCN [13] constructs a fully con-
• We propose a transformer-based model for multimodal nected graph to model multimodal and long-distance con-
ERC that contains a modality encoder for capturing textual information, and speaker embeddings are added for
intra- and inter-modal interactions between conversation encoding speaker information. MM-DFN [17] designs a graph-
utterances and a hierarchical gated fusion strategy for based dynamic fusion module to reduce redundancy and
adaptively learning weights between modalities. enhance complementarity between modalities. MMTr [24]
• To learn more effective modal representations, we devise preserves the integrity of main modal representations and
self-distillation that transfers knowledge of hard and soft enhances weak modal representations by using multi-head
labels from the proposed model to each modality. attention. UniMSE [25] performs modality fusion at syntactic
• Experiments on two benchmark datasets show the superi- and semantic levels and introduces inter-modality contrastive
ority of our proposed model. In addition, several studies learning to differentiate fusion representations among samples.

Authorized licensed use limited to: PSG COLLEGE OF TECHNOLOGY. Downloaded on July 18,2024 at 09:38:26 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3271019

This paper focuses on exploring intra- and inter-modal interac- unlabelled video data. Wang et al. [48] proposed K-injection
tions between utterances, learning weights between modalities, subnetworks to distill linguistic and acoustic knowledge rep-
and enhancing modal representations for multimodal ERC. resenting group emotions and transfer implicit knowledge into
the audiovisual model for group emotion recognition. Schon-
B. Multimodal Language Analysis eveld et al. [49] applied KD to further improve performance for
facial expression recognition. Most existing models belong to
Multimodal language analysis is a rapidly growing field
offline distillation, which requires training a teacher network.
and includes various tasks [26], such as multimodal emotion
In contrast, self-distillation needs no extra network except for
recognition, sentiment analysis, and personality traits recogni-
the network itself. While self-distillation has been successfully
tion. The key in this area is to fuse multimodal information.
applied in computer vision and natural language processing
Early studies on multimodal fusion mainly included early
[50]–[52], it focuses on unimodal tasks.
fusion and late fusion. Early fusion [27], [28] integrates
In this work, we adopt the idea of self-distillation to enhance
features of different modalities at the input level. Late fusion
modal representations for multimodal ERC. Moreover, output-
[29], [30] constructs distinct models for each modality and
based knowledge is used only due to the following reasons: (1)
then ensembles their outputs by majority voting or weighted
Soft labels can be used as training supervision which contain
averaging, etc. Unfortunately, as stated in [31], these two kinds
dark knowledge [38] and can provide effective regularization
of fusion methods cannot effectively capture intra- and inter-
for the model [53]. (2) Intuitively, the features of different
modal interactions.
modalities vary widely, and hence matching fused multimodal
Subsequently, model fusion has become popular and various
features with unimodal features is inappropriate2 . (3) Our
models have been proposed. TFN [32] models unimodal,
teacher and student networks lying in the same model have
bimodal, and trimodal interactions explicitly by computing
different architectures that results in an inability to inject rela-
Cartesian product. LMF [31] utilizes low-rank weight tensors
tionships between different layers of the teacher network into
for multimodal fusion, which reduces the complexity of TFN.
the student network [41]. Therefore, we adopt output-based
MFN [33] learns cross-modal interactions with an attention
knowledge rather than feature- and relation-based knowledge.
mechanism and stores information over time by a multi-view
gated memory. MulT [34] utilizes cross-modal transformers
to model long-range dependencies across modalities. Rahman III. M ETHODOLOGY
et al. [35] fine-tuned large pre-trained transformer models for
multimodal language by designing a multimodal adaptation A. Task Definition
gate (MAG). Self-MM [36] uses a unimodal label generation A conversation is composed of N consecutive utterances
strategy to acquire independent unimodal supervision and then {u1 , u2 , · · · , uN } and M speakers {s1 , s2 , · · · , sM }. Each
learns multimodal and unimodal tasks jointly. Yuan et al. [37] utterance ui is spoken by a speaker sϕ(ui ) , where ϕ is the
adopted transformer encoders to model intra- and inter-modal mapping between an utterance and its corresponding speaker’s
interactions between modality sequences. In order to capture index. Moreover, ui involves textual (t), acoustic (a), and
intra- and inter-modal interactions between conversation utter- visual (v) modalities, and their feature representations are de-
ances and meanwhile learn weights between modalities, we noted as uti ∈ Rdt , uai ∈ Rda , and uvi ∈ Rdv , respectively. We
present a transformer-based model. represent textual, acoustic, and visual modality sequences of
all utterances in the conversation as Ut = [ut1 ; ut2 ; · · · ; utN ] ∈
C. Knowledge Distillation RN ×dt , Ua = [ua1 ; ua2 ; · · · ; uaN ] ∈ RN ×da , and Uv =
Knowledge distillation (KD) aims at transferring knowledge [uv1 ; uv2 ; · · · ; uvN ] ∈ RN ×dv , respectively. The ERC task aims
from a large teacher network to a small student network. The to predict the emotion label of each utterance ui from pre-
knowledge mainly includes soft labels of the last output layer defined emotion categories.
(i.e., output-based knowledge) [38], features of intermediate
layers (i.e., feature-based knowledge) [39], and relationships
between different layers (i.e., relation-based knowledge) [40]. B. Overview
Depending on the learning schemes, existing methods on KD Fig. 2 gives an overview of our proposed SDT. After
are categorized into three classes: offline distillation [41], [42], extracting utterance-level unimodal features, the transformer-
online distillation [43], [44], and self-distillation [45], [46]. based model consists of three modules: a modality encoder
In offline distillation, the teacher network is first trained and module for capturing intra- and inter-modal interactions be-
then the pre-trained teacher distills its knowledge to guide the tween different utterances, a hierarchical gated fusion module
student training. In online distillation, the teacher and student for adaptively learning weights between modalities, and an
networks are updated simultaneously, and hence its training emotion classifier module for predicting emotion labels. Fur-
process is only one-phase. Self-distillation is a special case of thermore, we introduce self-distillation and devise two kinds of
online distillation that teaches a single network using its own losses to transfer knowledge from our proposed model within
knowledge. each modality to learn better modal representations.
Recently, KD has been used for multimodal emotion recog-
nition. For example, Albanie et al. [47] transferred visual 2 We tried to add feature-based knowledge, but the performance drops
knowledge into a speech emotion recognition model using significantly.

Authorized licensed use limited to: PSG COLLEGE OF TECHNOLOGY. Downloaded on July 18,2024 at 09:38:26 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3271019

Feature Extraction Modality Encoder Hierarchial Gated Fusion Emotion Classifier


Intra-modal s
Transformer
(t ® t ) H¢t
Ut s
Inter-modal

Conv1D
Transformer C
Text RoBERTa
(a ® t ) s
S (v ® t )
Intra-modal s
Transformer
(a ® a) H¢a H¢ µ
Y

Classifier
s

Softmax
Ua Inter-modal

Conv1D
Transformer C
Audio openSMILE
(t ® a) s
S (v ® a )
Intra-modal s
Transformer
(v ® v ) H¢v
Uv s LTask
Inter-modal
Conv1D

Transformer C
Visual DenseNet
(t ® v) s
S ( a ® v)

Classifier v Classifier a Classifier t


Positional Embeddings s Sigmoid Function
µv
Y µa
Y µt
Y LKL Y
S Speaker Embeddings Forward Flow
Element-wise Addition Supervision from Soft Labels

Element-wise Product Supervision from Hard Labels LCE


C Concatenation Self-distillation

Fig. 2. The overall architecture of SDT. After extracting utterance-level unimodal features, it consists of four key components: Modality Encoder, Hierarchical
Gated Fusion, Emotion Classifier, and Self-distillation.

C. Modality Encoder one-hot vector of speaker sj , i.e., 1 in the jth position and 0
The modality encoder obtains modality-enhanced modality otherwise.
sequence representations that can learn intra- and inter-modal Hence, speaker embeddings corresponding
 to the conversa- 
interactions between conversation utterances. tion can be represented as SE = sϕ(u1 ) ; sϕ(u2 ) ; · · · ; sϕ(uN ) .
Overall, we augment positional and speaker embeddings to
Temporal Convolution: To ensure that three unimodal
the convolved sequence:
sequence representations lie in the same space, we feed them
into a 1D convolutional layer: Hm = U′m + PE + SE. (4)

U′m = Conv1D (Um , km ) ∈ R N ×d


, m ∈ {t, a, v}, (1) Here, Hm is the low-level positional- and speaker-aware
utterance sequence representation for m modality.
where km is the size of convolutional kernel for m modality, Intra- and Inter-modal Transformers: We introduce
N is the number of utterances in the conversation, and d is intra- and inter-modal transformers to model intra- and inter-
the common dimension. modal interactions for the utterance sequence, respectively.
Positional Embeddings: To utilize positional and sequen- These transformers adopt the transformer encoder [54], which
tial information of the utterance sequence, we introduce posi- contains three inputs, queries Q ∈ RTq ×dk , keys K ∈ RTk ×dk ,
tional embeddings [54] to augment the convolved sequence: and values V ∈ RTk ×dv . We denote the transformer encoder
pos
 as Transformer (Q, K, V).
PE(pos,2i) = sin 10000 2i/d , For the intra-modal transformer, we take Hm as queries,
(2) keys, and values:
pos

PE(pos,2i+1) = cos 10000 2i/d ,
Hm→m = Transformer (Hm , Hm , Hm ) ∈ RN ×d , (5)
where pos is the utterance index and i is the dimension index.
where m ∈ {t, a, v}. The intra-modal transformer enhances
Speaker Embeddings: To capture speaker information of
m-modality sequence representation by itself and thus can cap-
the utterance sequence, we also design speaker embeddings to
ture intra-modal interactions between the utterance sequence.
augment the convolved sequence. Speaker sj in conversations
For the inter-modal transformer, we take Hm as queries,
is mapped into a vector:
and Hn as keys and values:
sj = Vs o (sj ) ∈ Rd , j = 1, 2, · · · , M, (3) Hn→m = Transformer (Hm , Hn , Hn ) ∈ RN ×d , (6)
d×M
where M is the total number of speakers, Vs ∈ R is a where m ∈ {t, a, v} and n ∈ {t, a, v} − {m}. The inter-
trainable speaker embedding matrix, and o (sj ) ∈ RM is a modal transformer enables m modality to get information from

Authorized licensed use limited to: PSG COLLEGE OF TECHNOLOGY. Downloaded on July 18,2024 at 09:38:26 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3271019

n modality and hence can capture inter-modal interactions Task Loss: We utilize the cross-entropy loss for estimating
between the utterance sequence. the quality of emotion predictions during training:
In summary, n-enhanced m-modality sequence representa- N C
tion, Hn→m , is obtained from the modality encoder module, 1 XX
LT ask = − yi,j log (b
yi,j ), (14)
where n, m ∈ {t, a, v}. N i=1 j=1

where N represents the number of utterances in the conver-


D. Hierarchical Gated Fusion sation, and C represents the number of emotion classes. yi
We design a hierarchical gated fusion module containing and ybi denote the ground-truth one-hot vector and probability
unimodal- and multimodal-level gated fusions to adaptively vector for the emotion of ui , respectively.
obtain enhanced single-modality sequence representation and
dynamically learn weights between these enhanced modality
representations, respectively. F. Self-distillation
Unimodal-level Gated Fusion: We first use a gated mech- Soft labels containing informative dark knowledge can be
anism to filter out irrelevant information in Hn→m : used as training supervision; hence, we devise self-distillation
to transfer knowledge of hard and soft labels to each modality,
gn→m = σ (Wn→m · Hn→m ) , (7) and guide the model in learning more expressive modal
H′n→m = Hn→m ⊗ gn→m , (8) representations.
We treat our proposed transformer-based model as the
where Wn→m ∈ Rd×d is a weight matrix, σ is the sigmoid teacher and design three students according to existing modal-
function, ⊗ is the element-wise product, and gn→m denotes ities. Specifically, a classifier consisting of an FC and softmax
the gate. layer only used during training, is set after each unimodal-
Then, we concatenate H′m→m , H′n1 →m , and H′n2 →m , fol- level gated fusion. During training, textual, acoustic, and visual
lowed by a fully connected (FC) layer to obtain enhanced modality encoders with their corresponding unimodal-level
m-modality sequence representation: gated fusions and classifiers are trained as three students (i.e.,
H′m = Wm · H′m→m ; H′n1 →m ; H′n2 →m +bm ∈ RN ×d , (9)
  student t, student a, student v) via distilling from the teacher.
The output of student m is its predicted emotion probabil-
where m ∈ {t, a, v}, n1 and n2 represent other two modalities, ities:
Wm ∈ R3d×d and bm ∈ Rd are trainable parameters.

We set H′m = [h′m1 ; h′m2 ; · · · ; h′mN ], where h′mi is en- Em = Wm · ReLU (H′m ) + b′m ∈ RN ×C , (15)
hanced m-modality representation for the utterance ui .
Multimodal-level Gated Fusion: We also design a gated Ŷm = softmax (Em ) ,
τ (16)
mechanism using the softmax function to dynamically learn Ŷm = softmax (Em /τ ) ,
weights between enhanced modalities for each utterance. where m ∈ {t, a, v}, Wm ′
∈ Rd×C and b′m ∈ RC are train-
Specifically, the final multimodal representation of the ut- able parameters. τ is the temperature to soften Ŷm (written
terance ui is calculated by: τ
as Ŷm after softened) and a higher τ produces a softer distri-
[gti ; gai ; gvi ] = softmax ([W·h′ti ; W·h′ai ; W·h′vi ]), (10) bution over classes [38]. We set Ŷm = [ŷm1 ; ŷm2 ; · · · ; ŷmN ]
τ τ τ τ
X and Ŷm = [ŷm1 ; ŷm2 ; · · · ; ŷmN ].
h′i = h′mi ⊗ gmi , (11) During training, we introduce two kinds of losses to train
m∈{t,a,v} the student m to learn better enhanced m-modality sequence
representation, where m ∈ {t, a, v}.
where W ∈ Rd×d is a weight matrix, gti , gai , and gvi are
Cross Entropy Loss: We minimize the cross entropy loss
learned weights of t, a, v modalities for the utterance ui ,
between the predicted probability of the student m and the
respectively.
ground-truth:
Thus, multimodal sequence representation of conversation
utterances is obtained and denoted as H′ = [h′1 ; h′2 ; · · · ; h′N ]. N C
1 XX
Lm
CE = − yi,j log (b
ymi,j ), (17)
N i=1 j=1
E. Emotion Classifier
To calculate probabilities over C emotion categories, H′ is where y bmi is the emotion probability vector of the student
fed into a classifier with an FC and softmax layer: m for ui . In this way, knowledge from hard labels is directly
introduced to the student to learn better modal representations.
E = We · H′ + be ∈ RN ×C , (12) KL Divergence Loss: To make the output probability of
Ŷ = softmax (E) , (13) the student m approximate the output of the teacher (i.e., soft
labels), the Kullback-Leibler (KL) divergence loss between
where We ∈ Rd×C and be ∈ RC are trainable parameters. We them is minimized:
set Ŷ = [ŷ1 ; ŷ2 ; · · · ; ŷN ], where ŷi is the emotion probability N C τ
!
vector for the utterance ui . Finally, we choose argmax (ŷi ) as m 1 XX τ y
bmi,j
LKL = y
bmi,j log τ , (18)
the predicted emotion label for ui . N i=1 j=1
y
bi,j

Authorized licensed use limited to: PSG COLLEGE OF TECHNOLOGY. Downloaded on July 18,2024 at 09:38:26 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3271019

TABLE I Acoustic Modality: Following [13], we use openSMILE


S TATISTICS OF THE TWO DATASETS . [57] for acoustic feature extraction. openSMILE, a flexible
#Conversations #Utterances
feature extraction toolkit for signal processing, provides a
Dataset #Classes scriptable console application to configure modular feature
Train+Val Test Train+Val Test
extraction components. After using openSMILE toolkit, an FC
IEMOCAP 120 31 5810 1623 6
MELD 1153 280 11098 2610 7 layer reduces the dimensionality of acoustic feature represen-
tation to 1582 for IEMOCAP and 300 for MELD.
Visual Modality: Following [13], we use DenseNet [58]
where y τ
bmi biτ are soften probability distributions of the
and y pre-trained on Facial Expression Recognition Plus dataset
student m and the teacher, respectively. In this way, knowledge for visual feature extraction. DenseNet, an effective CNN
from soft labels is transferred to the student to learn better architecture, consists of multiple dense blocks, each of which
modal representations. contains multiple layers. The output of DenseNet is set to 342;
With both hard and soft labels, the overall loss can be that is, the dimensionality of visual feature representation is
expressed as: 342.

L = γ1 LT ask + γ2 LCE + γ3 LKL , (19)


X C. Baselines
LCE = Lm
CE , (20) We compare SDT with the following baseline models.
m∈{t,a,v} CMN [4]: It uses two GRUs and memory networks to
X model contextual information for both speakers, but it is only
LKL = Lm
KL , (21) available for dyadic conversations.
m∈{t,a,v}
ICON [5]: It is an extension of CMN that captures inter-
where γ1 , γ2 , and γ3 are hyper-parameters that control the speaker emotional influences using another GRU. Similar to
weights of the three kinds of losses. In experiments, we set CMN, the model is applied to dyadic conversations.
γ1 = γ2 = γ3 = 1. DialogueRNN [6]: It adopts three distinct GRUs to track
the speaker, context, and emotional states in conversations,
IV. E XPERIMENTAL S ETTINGS respectively.
The above models concatenate textual, acoustic, and visual
A. Datasets and Evaluations
features to obtain multimodal utterance representations.
We use IEMOCAP [18] and MELD [20] datasets to evaluate MMGCN [13]: It constructs a conversation graph based on
the proposed model. The statistics of the two datasets are listed all three modalities and designs a multimodal fused graph con-
in Table I. volutional network to model contextual dependencies across
IEMOCAP: The dataset consists of two-way conversations multiple modalities.
of ten speakers, containing 153 conversations and 7, 433 DialogueTRM [10]: It uses a hierarchical transformer
utterances. The dataset is divided into five sessions, where the to manage the differentiated context preference within each
first four sessions are used for training, while the last one is modality and designs a multi-grained interactive fusion for
for testing. Each utterance is labeled with one of six emotions: learning different contributions across modalities for an utter-
happy, sad, neutral, angry, excited, and frustrated. ance.
MELD: This is a multi-speaker conversation dataset col- MM-DFN [17]: It designs a graph-based dynamic fusion
lected from the Friends TV series, containing 1, 433 conver- module to fuse multimodal context features, and this module
sations and 13, 708 utterances. Each utterance is labeled with could reduce redundancy and enhance complementarity be-
one of seven emotions: neutral, surprise, fear, sadness, joy, tween modalities.
disgust, and anger. MMTr [24]: It uses distinct bidirectional long short-term
Evaluation Metrics: Following previous works [6], [12], memory networks (Bi-LSTMs) to learn contextual representa-
we report the overall accuracy and weighted average F1-score tions at the speaker’s self-context level and contextual context
to measure overall performance, and also present the accuracy level, and designs a cross-modal fusion module to enhance
and F1-score on each emotion class. weak modal representations.
UniMSE [25]: It uses T5 to fuse acoustic and visual
B. Feature Extraction modal features with multi-level textual features, and performs
We extract utterance-level unimodal features as follows. inter-modality contrastive learning to obtain discriminative
Textual Modality: Following [55], we employ RoBERTa multimodal representations.
Large model [56] to extract textual features. Roberta, a For a fair comparison, we re-run all baselines, except
pre-trained model using a multi-layer transformer encoder MMTr and UniMSE, whose source codes are not released3 .
architecture, builds on BERT which can efficiently learn 3 We carefully implemented DialogueTRM to explore its performance using
textual representations. We fine-tune RoBERTa for emotion our extracted features, since its source code is not available; MMTr uses
recognition from conversation transcripts and then take [CLS ] basically same feature extractors as us, and therefore we did not implement
it; UniMSE uses T5 to learn contextual information on textual sequences and
tokens’ embeddings at the last layer as textual features. The embeds multimodal fusion layers into T5, and hence our extracted features
dimensionality of textual feature representation is 1024. cannot be used for UniMSE and we also did not implement it.

Authorized licensed use limited to: PSG COLLEGE OF TECHNOLOGY. Downloaded on July 18,2024 at 09:38:26 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3271019

TABLE II
R ESULTS ON THE IEMOCAP DATASET; “*”: BASELINES ARE RE - IMPLEMENTED USING OUR EXTRACTED FEATURES ; BOLD FONT DENOTES THE BEST
PERFORMANCE .

IEMOCAP
Models happy sad neutral angry excited frustrated
ACC w-F1
ACC F1 ACC F1 ACC F1 ACC F1 ACC F1 ACC F1
CMN 24.31 30.30 56.33 62.02 52.34 52.41 61.76 60.17 56.19 60.76 72.44 61.27 56.87 56.33
ICON 25.00 31.30 67.35 73.17 55.99 58.50 69.41 66.29 70.90 67.09 71.92 65.08 62.85 62.25
DialogueRNN 25.00 34.95 82.86 84.58 54.43 57.66 61.76 64.42 90.97 76.30 62.20 59.55 65.43 64.29
MMGCN 32.64 39.66 72.65 76.89 65.10 62.81 73.53 71.43 77.93 75.40 65.09 63.43 66.61 66.25
DialogueTRM 61.11 57.89 84.90 81.25 69.27 68.56 76.47 65.99 76.25 76.13 50.39 58.09 68.52 68.20
MM-DFN 44.44 44.44 77.55 80.00 71.35 66.99 75.88 70.88 74.25 76.42 58.27 61.67 67.84 67.85
MMTr - - - - - - - - - - - - 72.27 71.91
UniMSE - - - - - - - - - - - - 70.56 70.66
DialogueRNN* 57.64 57.64 77.96 80.25 75.52 70.56 68.24 64.99 73.91 75.95 59.06 62.41 69.38 69.37
MMGCN* 50.00 56.25 78.78 81.43 71.35 67.57 68.24 66.29 75.92 76.82 65.09 64.92 69.62 69.61
DialogueTRM* 72.22 62.84 85.71 83.33 69.27 68.12 79.41 66.67 67.22 75.00 57.22 63.28 69.87 69,93
MM-DFN* 57.64 52.87 84.49 86.07 76.04 71.66 70.59 65.04 73.24 75.26 55.91 62.19 69.87 69.91
SDT (Ours) 72.71 66.19 79.51 81.84 76.33 74.62 71.88 69.73 76.79 80.17 67.14 68.68 73.95 74.08
w/o self-distillation 71.53 58.52 79.59 79.43 69.27 70.65 69.41 67.05 69.23 77.09 67.98 68.07 70.73 71.10

In addition, we re-implement DialogueRNN, MMGCN, Dia- UniMSE by 2.46% and 1.09%, respectively. Similar to IEMO-
glogueTRM, and MM-DFN with our extracted features, CAP, SDT performs superior on most emotion classes in terms
namely DialogueRNN*, MMGCN*, DiaglogueTRM*, and of F1-score.
MM-DFN*. We use the same data splits to implement all Overall, the above results indicate the effectiveness of
models. SDT. Furthermore, we have several similar findings on the
two datasets: (1) DialogueTRM has a superior performance
D. Implementation Details compared to DialogueRNN, MMGCN, and MM-DFN that
use TextCNN [60] to extract textual features. This is because
We implement the proposed model using Pytorch4 and use
textual modality plays a more important role for ERC [13],
Adam [59] as optimizer with an initial learning rate of 1.0e−4
and DialogueTRM extracts textual features using BERT [61],
for IEMOCAP and 5.0e − 6 for MELD. The batch size is
which is more powerful than TextCNN. (2) The baselines
16 for IEMOCAP and 8 for MELD, and the temperature
gain further improvement and achieve comparable results
τ for the two datasets are set to 1 and 8, respectively. For
when using our extracted utterance features. The results show
the 1D convolutional layers, the number of input channels
that our feature extractor is more effective and sequence-
are set to 1024, 1582, and 342 for textual, acoustic, and
and graph-based baselines can achieve similar performance
visual modalities, respectively (i.e., their corresponding feature
using our extracted features. (3) Even without self-distillation,
dimensions) on IEMOCAP. On MELD, these parameters are
our proposed model is still comparable to strong baselines,
set to 1024, 300, and 342, respectively. In addition, the number
demonstrating the power of the proposed transformer-based
of output channels and kernel size are set to 1024 and 1
model.
respectively for all three modalities on the two datasets. For
the transformer encoder, the hidden size, number of attention
heads, feed-forward size, and number of layers are set to 1024, B. Ablation Study
8, 1024, and 1, respectively. To prevent overfitting, we set the We carry out ablation experiments on IEMOCAP and
L2 weight decay to 1.0e − 5 and employ dropout with a rate MELD. Table IV reports the results under different ablation
of 0.5. All results are averages of 10 runs. settings.
Ablation on Transformer-based Model: Positional em-
V. R ESULTS AND D ISCUSSION beddings, speaker embeddings, intra-modal transformers, and
A. Overall Results inter-modal transformers are four crucial components of our
Table II and Table III present the performance of baselines proposed transformer-based model. We remove only one com-
and SDT on IEMOCAP and MELD datasets, respectively. On ponent at a time to evaluate the effectiveness of the compo-
IEMOCAP dataset, SDT performs better than all baselines and nent. From Table IV, we conclude that: (1) All components
outperforms MMTr by 1.68% and 2.17% in terms of overall are useful because removing one of them leads to perfor-
accuracy and weighted F1-score, respectively. In addition, mance degradation. (2) Positional and speaker embeddings
SDT achieves a significant improvement on most emotion have considerable effects on the two datasets, which means
classes in terms of F1-score. On MELD dataset, SDT achieves capturing sequential and speaker information are valuable.
the best performance compared to all baselines in terms (3) Inter-modal transformers are more important than intra-
of overall accuracy and weighted F1-score, and outperforms modal transformers on the two datasets. This indicates that
inter-modal interactions between conversation utterances could
4 https://pytorch.org/ provide more helpful information.

Authorized licensed use limited to: PSG COLLEGE OF TECHNOLOGY. Downloaded on July 18,2024 at 09:38:26 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3271019

TABLE III
R ESULTS ON THE MELD DATASET.

MELD
Models neutral surprise fear sadness joy disgust anger
ACC w-F1
ACC F1 ACC F1 ACC F1 ACC F1 ACC F1 ACC F1 ACC F1
DialogueRNN 82.17 76.56 46.62 47.64 0.00 0.00 21.15 24.65 49.50 51.49 0.00 0.00 48.41 46.01 60.27 57.95
MMGCN 84.32 76.96 47.33 49.63 2.00 3.64 14.90 20.39 56.97 53.76 1.47 2.82 42.61 45.23 61.34 58.41
DialogueTRM 83.20 79.41 56.94 55.27 12.00 17.39 27.88 36.48 60.45 60.30 16.18 20.18 51.01 49.79 65.10 63.80
MM-DFN 79.06 75.80 53.02 50.42 0.00 0.00 17.79 23.72 59.20 55.48 0.00 0.00 50.43 48.27 60.96 58.72
MMTr - - - - - - - - - - - - - - 64.64 64.41
UniMSE - - - - - - - - - - - - - - 65.09 65.51
DialogueRNN* 85.11 79.60 54.09 56.72 10.00 12.66 29.81 38.63 62.94 63.81 22.06 27.27 53.62 53.24 66.70 65.31
MMGCN* 81.53 79.20 58.36 57.75 8.00 13.79 31.73 39.40 69.90 63.43 20.59 24.56 52.17 53.49 66.40 65.21
DialogueTRM* 83.44 79.54 54.45 57.09 24.00 27.91 33.17 40.95 60.45 62.79 22.06 28.04 58.26 53.96 66.70 65.76
MM-DFN* 83.52 79.65 63.35 58.17 32.00 26.67 26.44 35.71 63.68 64.89 19.12 24.76 49.28 52.15 66.55 65.48
SDT (Ours) 83.22 80.19 61.28 59.07 13.80 17.88 34.90 43.69 63.24 64.29 22.65 28.78 56.93 54.33 67.55 66.60
w/o self-distillation 82.01 80.00 57.65 57.96 20.00 23.81 32.21 41.61 65.17 64.22 25.00 27.42 57.97 54.05 66.97 66.26

TABLE IV
R ESULTS OF ABLATION STUDIES ON THE TWO DATASETS . 75.00
73.95
Uni-Cat-Transformer
75.00

73.30
74.08 Uni-Cat-Transformer

Add Add
73.16
Concatenation 72.20 Concatenation
71.97 72.00
Ours Ours
72.00
IEMOCAP MELD 69.64

w-F1 (%)
ACC (%)
69.56 69.00

ACC w-F1 ACC w-F1 69.00


67.55 66.60
67.10 67.12
SDT 73.95 74.08 67.55 66.60 67.00
66.00
65.97
66.16

Transformer-based model 66.00 65.68


65.32

w/o positional embeddings 72.27 72.39 66.86 66.20 65.00


65.00

w/o speaker embeddings 71.84 72.03 67.13 66.18 64.00 64.00

w/o intra-modal transformers 73.38 73.36 67.13 66.21 IEMOCAP MELD IEMOCAP MELD

Dataset Dataset
w/o inter-modal transformers 72.09 72.26 66.97 65.55
Self-distillation (a) Overall accuracy (b) Weighted F1-score
w/o LCE 73.07 73.32 67.39 66.37
w/o LKL 72.95 73.03 67.09 66.33 Fig. 3. Performance of different fusion methods on the two datasets. Bold
Modality font means that the improvement to all baselines is statistically significant
Text 66.42 66.58 66.82 65.52 (t-test with p < 0.05).
Audio 59.77 59.34 48.12 40.81
Visual 41.47 42.71 48.05 32.01
Text + Audio 72.52 72.75 67.05 66.24
Text + Visual 69.01 69.07 67.20 66.18 Effect of Different Fusion Strategies: To investigate the
Audio + Visual 62.05 62.26 47.24 40.21 effect of our proposed hierarchical gated fusion module, we
compare it with two typical information fusion strategies: (1)
Add: representations are fused via element-wise addition. (2)
Concatenation: representations are directly concatenated and
Ablation on Self-distillation Loss Functions: There are
followed by an FC layer. Add treats all representations equally,
two kinds of losses ( i.e., LCE and LKL ) for self-distillation.
while Concatenation could implicitly choose the important
To verify the importance of these losses, we remove one loss at
information due to the FC layer. For a fair comparison, we
a time. Table IV shows that LCE and LKL are complementary
replace the hierarchical gated fusion module of our model with
and our model performs best when all losses are included. The
hierarchical add and concatenation operations to implement
result demonstrates that transferring knowledge of both hard
the Add and Concatenation fusion strategies, respectively. In
and soft labels from the proposed transformer-based model to
addition, we also compare SDT with a general transformer-
each modality can further boost the model performance.
based fusion method (i.e., unimodal features are concatenated
Effect of Different Modalities: To show the effect of and then fed into a transformer encoder) that we call Uni-Cat-
different modalities, we remove one or two modalities at a Transformer.
time. From Table IV, we observe that: (1) For unimodal As shown in Fig. 3, compared with other fusion strategies,
results, the textual modality has far better performance than the our proposed hierarchical gated fusion strategy significantly
other two modalities, indicating that the textual feature plays outperforms them. The result indicates that directly fusing
a leading role in ERC. This finding is consistent with previous representations with Add and Concatenation is sub-optimal.
works [10], [13], [17]. (2) Any bimodal results are better Our proposed hierarchical gated fusion module first filters
than its own unimodal results. Moreover, fusing the textual out irrelevant information at the unimodal level and then
modality and acoustic or visual modality performs superior dynamically learns weights between different modalities at the
to the fusion of the acoustic and visual modalities due to the multimodal level, which can more effectively fuse multimodal
importance of textual features. (3) Using all three modalities representations.
gives the best performance. The result can validate that emo- In addition, our model achieves a significant performance
tion is affected by verbal, vocal, and visual expressions, and improvement over Uni-Cat-Transformer that demonstrates the
integrating multimodal information is essential for ERC. effectiveness of the proposed SDT in multimodal fusion.

Authorized licensed use limited to: PSG COLLEGE OF TECHNOLOGY. Downloaded on July 18,2024 at 09:38:26 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3271019

distillation. However without self-distillation, multimodal rep-


2.00 0.40

resentations of similar emotions (i.e., “happy” and “excited”,


1.50 0.30
“angry” and “frustrated”) are difficult to separate; furthermore,
representations of “neutral” emotion are intermingled with
Loss

Loss
1.00 0.20
other emotions. By comparing Fig. 5(b) and Fig. 5(c), we
observe that our model with self-distillation yields a better
0.50 0.10

separation and representations of different emotions are less


0.00 0.00 mixed together. Therefore, introducing self-distillation training
0 50 100 150 200 0 50 100 150 200

Epochs Epochs could learn more effective multimodal representations.


(a) LCE & LT ask (b) LKL On the other hand, we also show the visualization results
with different genders of speakers in Fig. 6. Fig. 6(b) and
Fig. 4. Trends of all losses during training on the IEMOCAP dataset.
Fig. 6(c) form two large clusters respectively corresponding
to the gender of the speaker. This interesting finding indicates
Interestingly, Uni-Cat-Transformer has poorer performance that with or without self-distillation, our model can distinguish
than Add and Concatenation on IEMOCAP; however, it shows the gender of the speaker, which may also be helpful for ERC.
an acceptable performance on MELD. This may be because
interactions between modalities are not as complex on MELD
as on IEMOCAP, and hence modeling modal interactions
E. Case Study
by multiple transformer encoders could generate some noise
on MELD that makes Ui-Cat-Transformer gain comparable To demonstrate the efficacy of SDT, we present a case
performance with Add and Concatenation. In contrast, SDT study. Fig. 7 shows a conversation that comes from MELD.
has superior performance than all baselines as it contains a SDT identifies the emotions of all utterances successfully,
hierarchical gated fusion module to filter out noise informa- while DialogueRNN* and MMGCN* predict the 3rd utterance
tion, which further illustrates the usefulness of our proposed as “surprise” incorrectly, probably because a question mark
hierarchical gated fusion strategy. “?” generally expresses “surprise”. This could indicate the
more powerful multimodal fusion capability of our proposed
C. Trends of Losses model. On the other hand, using only the textual modality, our
During training, we illustrate the trends of all types of losses model recognizes the 4th utterance as “neutral” wrongly. To
on IEMOCAP dataset to better understand how these losses explore the reason behind it, we visualize multi-head attention
work, and Fig. 4 displays the results. weights of SDT (only text) and SDT for the 4th utterance,
From Fig. 4(a), we find that LT ask , LtCE , LaCE , and LvCE respectively. For SDT, we find that the weights of textual
keep descending in the whole training process. From Fig. 4(b), features are obviously larger than acoustic and visual features
we can see LtKL and LaKL also have decreasing trends except for the 4th utterance by outputting their weights. Therefore,
for fluctuations at the beginning, and LvKL goes down during we visualize only attention weights of the transformers that
early training except for the fluctuation and then goes up and form enhanced textual modality representation in Fig. 8, and
achieves stability. Therefore, all of the losses can converge. other visualization results can be found in the appendix.
These show that all students can learn knowledge from hard As can be seen from Fig. 8(a), the 4th utterance depends
and soft labels to improve the model performance. Besides, we heavily on the 3rd and 5th utterances when using only the
find that losses of student v (i.e., LvCE and LvKL ) are larger textual modality. The 3rd utterance, which expresses “neutral”
than the other two students. This may due to a unsuitable emotion, may be more important due to a larger number of
learning rate for the student v. Hence, we would like to darkest attention heads; hence the 4th utterance is identified
adaptively modify learning rates between different modalities as the same emotion as the 3rd utterance, i.e., “neutral”.
to effectively optimize the proposed model in the future. The utterance can be correctly recognized as “disgust” by
SDT for the following reasons: (1) According to Fig. 8(b),
the text of the 4th utterance is influenced the most by the
D. Multimodal Representation Visualization text of the 5th utterance whose emotion is “disgust”. (2)
We extract multimodal representations for each utterance on From Fig. 8(c) and Fig. 8(d), we observe that the acoustic
IEMOCAP from our proposed transformer-based model with- and visual expressions of the 2nd, 4th and 5th utterances
out and with self-distillation. Besides, pre-extracted unimodal are more valuable for the 4th utterance’s textual expression,
representations are concatenated to produce original multi- and the 4th and 5th utterances express “disgust” emotion.
modal representations. Then, these multimodal representations Overall, the results show that interactions between modalities
are projected into two dimensions via the t-SNE algorithm are helpful in identifying emotion from different perspectives,
[62]. and therefore it is necessary to use multimodal information.
Fig. 5 illustrates the visualization results with different emo- In addition, comparing Fig. 8(a) and Fig. 8(b), SDT learns
tion categories. Compared with original multimodal represen- better to correlate the 4th utterance with the 5th utterance. The
tations, representations learned by the proposed transformer- finding illustrates that introducing self-distillation can learn
based model become more clustered even without self- more appropriate attention weights.

Authorized licensed use limited to: PSG COLLEGE OF TECHNOLOGY. Downloaded on July 18,2024 at 09:38:26 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3271019

10

(a) Origin representations (b) Without self-distillation (c) With self-distillation


Fig. 5. t-SNE visualization of the multimodal representations with different emotion categories on the IEMOCAP dataset.

(a) Origin representations (b) Without self-distillation (c) With self-distillation


Fig. 6. t-SNE visualization of the multimodal representations with different genders of speakers on the IEMOCAP dataset.

3H`LY!  3H`LY! 

Dialogue MM SDT Ground


Turn Speaker Visual Audio Text SDT
RNN* GCN* (only text) Truth
   
Oh my god,
1 Joey surprise surprise surprise surprise surprise
you’re back!
   

2 Phoebe
Ohh, let me see
it! Let me see surprise surprise surprise surprise surprise    
your hand!
   
Why do you
3 Monica want to see my surprise surprise neutral neutral neutral
hand?    
I wanna see
4 Phoebe
what’s in your
hand. I wanna
disgust disgust neutral disgust disgust (a) Intra-modal transformer (b) Intra-modal transformer
see the trash. 3H`LY!  (t → t) of SDT3H`LY!
(only text) (t → t) of SDT
Eww! Oh, it’s all
5 Phoebe dirty. You should disgust disgust disgust disgust disgust
throw this out.
   

Fig. 7. An example of emotion recognition results in a conversation from the    


MELD dataset.
   

   
F. Error Analysis    

Although the proposed SDT achieves strong performance, (c) Inter-modal transformer (d) Inter-modal transformer
it still fails to detect some emotions. We analyze confusion (a → t) of SDT (v → t) of SDT
matrices of the test set on the two datasets. From Fig. 9, we
see that: (1) SDT misclassifies similar emotions, like “happy” Fig. 8. Multi-head attention visualization for the 4th utterance in Fig. 7.
and “excited”, “angry” and “frustrated” on IEMOCAP, and There are 8 attention heads and different colors represent different heads. The
“surprise” and “anger” on MELD. (2) SDT also tends to darker the color, the more important for the 4th utterance.
misclassify other emotions as “neutral” on MELD due to that
“neutral” is the majority class. (3) It is difficult to correctly
detect “fear” and “disgust” emotions on MELD because the TABLE V
T EST ACCURACY OF SDT ON UTTERANCES WITH AND WITHOUT
two emotions are minority classes. Thus, it is challenging EMOTIONAL SHIFT
to recognize similar emotions and emotions with unbalanced
data. Dataset
Emotional Shift w/o Emotional Shift
Besides, we also investigate SDT performance on emotional #Utterances ACC #Utterances ACC
shift (i.e., two consecutive utterances spoken by the same IEMOCAP 410 54.88 1151 80.71
speaker have different emotions). As shown in Table V, MELD 1003 61.62 861 73.05
we observe that SDT performs poorer on utterances with

Authorized licensed use limited to: PSG COLLEGE OF TECHNOLOGY. Downloaded on July 18,2024 at 09:38:26 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3271019

11
3H`LY!  3H`LY! 

   

   

   

   

   

(a) Intra-modal transformer (b) Inter-modal transformer


3H`LY!  (a → a) of SDT3H`LY!  (t → a) of SDT

   

   

(a) IEMOCAP (b) MELD    


Fig. 9. The confusion matrices of the test set on the two datasets. The rows    
and columns represent true and predicted labels, respectively.
   

emotional shift than that without it5 , which is consistent (c) Inter-modal transformer (d) Intra-modal transformer
3H`LY!  with
previous works. The emotional shift in conversations is a (v → a) of SDT3H`LY!  (v → v) of SDT
complex phenomenon caused by multiple latent variables, e.g.,
   
the speaker’s personality and intent; however, SDT and most
existing models do not consider these factors, which may result    
in poor performance. Further improvement on the case needs   

to be explored.
   

VI. C ONCLUSION    

In this paper, we propose SDT, a transformer-based model (e) Inter-modal transformer (f) Inter-modal transformer
with self-distillation for multimodal ERC. We use intra- and (t → v) of SDT (a → v) of SDT
inter-modal transformers to model intra- and inter-modal inter-
Fig. 10. Multi-head attention visualization for the 4th utterance in Fig. 7.
actions between conversation utterances. To dynamically learn
weights between different modalities, we design a hierarchical
gated fusion strategy. Positional and speaker embeddings are Foundation of Liaoning Province (No. 2021-BS-067), and the
also leveraged as additional inputs to capture contextual and Fundamental Research Funds for the Central Universities (No.
speaker information. In addition, we devise self-distillation DUT21RC(3)015).
during training to transfer knowledge of hard and soft labels
within the model to learn better modal representations, which A PPENDIX
could further improve performance. We conduct experiments ATTENTION V ISUALIZATION
on two benchmark datasets and the results demonstrate the Multi-head attention weights of the transformers in our SDT
effectiveness and superiority of SDT. that form enhanced acoustic and visual modality representa-
Through error analysis, we find that distinguishing simi- tions are visualized in Fig. 10.
lar emotions, detecting emotions with unbalanced data, and
emotional shift are key challenges for ERC that are worth R EFERENCES
further exploration in future work. Furthermore, transformer- [1] A. Kumar, P. Dogra, and V. Dabas, “Emotion analysis of twitter
based fusion methods cause high computational costs as the using opinion mining,” in 2015 Eighth International Conference on
Contemporary Computing, 2015, pp. 285–290.
self-attention
 mechanism of transformer has a complexity of [2] F. A. Pujol, H. Mora, and A. Martı́nez, “Emotion recognition to improve
O N 2 with respect to sequence length N . To alleviate the e-healthcare systems in smart cities,” in Research & Innovation Forum
issue, Ding et al. [63] proposed sparse fusion for multimodal 2019, A. Visvizi and M. D. Lytras, Eds., 2019, pp. 245–254.
[3] L. Zhou, J. Gao, D. Li, and H.-Y. Shum, “The design and implementation
transformers. Similarly, we plan to design a novel multimodal of xiaoice, an empathetic social chatbot,” Computational Linguistics,
fusion method for transformers to reduce computational costs vol. 46, no. 1, pp. 53–93, 2020.
in the future. [4] D. Hazarika, S. Poria, A. Zadeh, E. Cambria, L.-P. Morency, and R. Zim-
mermann, “Conversational memory network for emotion recognition
in dyadic dialogue videos,” in Proceedings of the 2018 Conference
ACKNOWLEDGMENTS of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 1 (Long Papers),
This work is partially supported by the Natural Science 2018, pp. 2122–2132.
Foundation of China (No. 62006034), the Natural Science [5] D. Hazarika, S. Poria, R. Mihalcea, E. Cambria, and R. Zimmermann,
“ICON: Interactive conversational memory network for multimodal
5 In this paper, without emotional shift means two consecutive utterances emotion detection,” in Proceedings of the 2018 Conference on Empirical
spoken by the same speaker have same emotions. Methods in Natural Language Processing, 2018, pp. 2594–2604.

Authorized licensed use limited to: PSG COLLEGE OF TECHNOLOGY. Downloaded on July 18,2024 at 09:38:26 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3271019

12

[6] N. Majumder, S. Poria, D. Hazarika, R. Mihalcea, A. Gelbukh, and [26] P. P. Liang, Z. Liu, A. Bagher Zadeh, and L.-P. Morency, “Multimodal
E. Cambria, “Dialoguernn: An attentive rnn for emotion detection in language analysis with recurrent multistage fusion,” in Proceedings
conversations,” in Proceedings of the AAAI Conference on Artificial of the 2018 Conference on Empirical Methods in Natural Language
Intelligence, vol. 33, no. 01, 2019, pp. 6818–6825. Processing, 2018, pp. 150–161.
[7] W. Jiao, H. Yang, I. King, and M. R. Lyu, “HiGRU: Hierarchical gated [27] M. Wöllmer, F. Weninger, T. Knaup, B. Schuller, C. Sun, K. Sagae,
recurrent units for utterance-level emotion recognition,” in Proceedings and L.-P. Morency, “Youtube movie reviews: Sentiment analysis in an
of the 2019 Conference of the North American Chapter of the Asso- audio-visual context,” IEEE Intelligent Systems, vol. 28, no. 3, pp. 46–
ciation for Computational Linguistics: Human Language Technologies, 53, 2013.
Volume 1 (Long and Short Papers), 2019, pp. 397–406. [28] S. Poria, I. Chaturvedi, E. Cambria, and A. Hussain, “Convolutional
[8] J. Li, D. Ji, F. Li, M. Zhang, and Y. Liu, “HiTrans: A transformer- mkl based multimodal emotion recognition and sentiment analysis,” in
based context- and speaker-sensitive model for emotion detection in 2016 IEEE 16th International Conference on Data Mining, 2016, pp.
conversations,” in Proceedings of the 28th International Conference on 439–448.
Computational Linguistics, 2020, pp. 4190–4200. [29] B. Nojavanasghari, D. Gopinath, J. Koushik, T. Baltrušaitis, and L.-P.
[9] H. Ma, J. Wang, L. Qian, and H. Lin, “Han-regru: hierarchical attention Morency, “Deep multimodal fusion for persuasiveness prediction,” in
network with residual gated recurrent unit for emotion recognition in Proceedings of the 18th ACM International Conference on Multimodal
conversation,” Neural Computing and Applications, vol. 33, no. 7, pp. Interaction, 2016, p. 284–288.
2685–2703, 2021. [30] O. Kampman, E. J. Barezi, D. Bertero, and P. Fung, “Investigating audio,
[10] Y. Mao, G. Liu, X. Wang, W. Gao, and X. Li, “DialogueTRM: Exploring video, and text fusion methods for end-to-end automatic personality pre-
multi-modal emotional dynamics in a conversation,” in Findings of the diction,” in Proceedings of the 56th Annual Meeting of the Association
Association for Computational Linguistics: EMNLP 2021, 2021, pp. for Computational Linguistics (Volume 2: Short Papers), 2018, pp. 606–
2694–2704. 611.
[11] H. Ma, J. Wang, H. Lin, X. Pan, Y. Zhang, and Z. Yang, “A multi-view [31] Z. Liu, Y. Shen, V. B. Lakshminarasimhan, P. P. Liang, A. Bagher Zadeh,
network for real-time emotion recognition in conversations,” Knowledge- and L.-P. Morency, “Efficient low-rank multimodal fusion with modality-
Based Systems, vol. 236, p. 107751, 2022. specific factors,” in Proceedings of the 56th Annual Meeting of the
[12] D. Ghosal, N. Majumder, S. Poria, N. Chhaya, and A. Gelbukh, Association for Computational Linguistics (Volume 1: Long Papers),
“DialogueGCN: A graph convolutional neural network for emotion 2018, pp. 2247–2256.
recognition in conversation,” in Proceedings of the 2019 Conference [32] A. Zadeh, M. Chen, S. Poria, E. Cambria, and L.-P. Morency, “Tensor
on Empirical Methods in Natural Language Processing and the 9th fusion network for multimodal sentiment analysis,” in Proceedings
International Joint Conference on Natural Language Processing, 2019, of the 2017 Conference on Empirical Methods in Natural Language
pp. 154–164. Processing, 2017, pp. 1103–1114.
[13] J. Hu, Y. Liu, J. Zhao, and Q. Jin, “MMGCN: Multimodal fusion via [33] A. Zadeh, P. P. Liang, N. Mazumder, S. Poria, E. Cambria, and L.-P.
deep graph convolution network for emotion recognition in conversa- Morency, “Memory fusion network for multi-view sequential learning,”
tion,” in Proceedings of the 59th Annual Meeting of the Association for in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32,
Computational Linguistics and the 11th International Joint Conference no. 1, 2018.
on Natural Language Processing (Volume 1: Long Papers), 2021, pp. [34] Y.-H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L.-P. Morency, and
5666–5675. R. Salakhutdinov, “Multimodal transformer for unaligned multimodal
[14] W. Nie, R. Chang, M. Ren, Y. Su, and A. Liu, “I-gcn: Incremental language sequences,” in Proceedings of the 57th Annual Meeting of the
graph convolution network for conversation emotion detection,” IEEE Association for Computational Linguistics, 2019, pp. 6558–6569.
Transactions on Multimedia, pp. 1–1, 2021. [35] W. Rahman, M. K. Hasan, S. Lee, A. Bagher Zadeh, C. Mao, L.-P.
Morency, and E. Hoque, “Integrating multimodal information in large
[15] M. Ren, X. Huang, W. Li, D. Song, and W. Nie, “Lr-gcn: Latent
pretrained transformers,” in Proceedings of the 58th Annual Meeting of
relation-aware graph convolutional network for conversational emotion
the Association for Computational Linguistics, 2020, pp. 2359–2369.
recognition,” IEEE Transactions on Multimedia, pp. 1–1, 2021.
[36] W. Yu, H. Xu, Z. Yuan, and J. Wu, “Learning modality-specific
[16] A. Mehrabian et al., Silent messages. Wadsworth Belmont, CA, 1971.
representations with self-supervised multi-task learning for multimodal
[17] D. Hu, X. Hou, L. Wei, L. Jiang, and Y. Mo, “Mm-dfn: Multimodal sentiment analysis,” in Proceedings of the AAAI Conference on Artificial
dynamic fusion network for emotion recognition in conversations,” in Intelligence, vol. 35, no. 12, 2021, pp. 10 790–10 797.
2022 IEEE International Conference on Acoustics, Speech and Signal [37] Z. Yuan, W. Li, H. Xu, and W. Yu, “Transformer-based feature re-
Processing, 2022. construction network for robust multimodal sentiment analysis,” in
[18] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Proceedings of the 29th ACM International Conference on Multimedia,
Chang, S. Lee, and S. S. Narayanan, “Iemocap: Interactive emotional 2021, p. 4400–4407.
dyadic motion capture database,” Language Resources and Evaluation, [38] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural
vol. 42, no. 4, pp. 335–359, 2008. network,” arXiv preprint arXiv:1503.02531, 2015.
[19] B. Schuller, M. Valstar, R. Cowie, and M. Pantic, “Avec 2012: The [39] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Ben-
continuous audio/visual emotion challenge - an introduction,” in Pro- gio, “Fitnets: Hints for thin deep nets,” in International Conference on
ceedings of the 14th ACM International Conference on Multimodal Learning Representations, 2015.
Interaction, 2012, p. 361–362. [40] J. Yim, D. Joo, J. Bae, and J. Kim, “A gift from knowledge distillation:
[20] S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihal- Fast optimization, network minimization and transfer learning,” in In
cea, “MELD: A multimodal multi-party dataset for emotion recognition Proceedings of CVPR, 2017.
in conversations,” in Proceedings of the 57th Annual Meeting of the [41] N. Passalis and A. Tefas, “Learning deep representations with probabilis-
Association for Computational Linguistics, 2019, pp. 527–536. tic knowledge transfer,” in Proceedings of the European Conference on
[21] C. M. Lee and S. Narayanan, “Toward detecting emotions in spoken Computer Vision, 2018, pp. 268–284.
dialogs,” IEEE Transactions on Speech and Audio Processing, vol. 13, [42] T. Li, J. Li, Z. Liu, and C. Zhang, “Few sample knowledge distillation
no. 2, pp. 293–303, 2005. for efficient network compression,” in Proceedings of the IEEE/CVF
[22] L. Devillers and L. Vidrascu, “Real-life emotions detection with lexical Conference on Computer Vision and Pattern Recognition, 2020, pp.
and paralinguistic cues on human-human call center dialogs,” in Ninth 14 639–14 647.
International Conference on Spoken Language Processing, 2006. [43] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu, “Deep mutual
[23] W. Jiao, M. Lyu, and I. King, “Real-time emotion recognition via learning,” in Proceedings of the IEEE conference on computer vision
attention gated hierarchical memory network,” in Proceedings of the and pattern recognition, 2018, pp. 4320–4328.
AAAI Conference on Artificial Intelligence, vol. 34, no. 05, 2020, pp. [44] I. Chung, S. Park, J. Kim, and N. Kwak, “Feature-map-level online
8002–8009. adversarial knowledge distillation,” in International Conference on Ma-
[24] S. Zou, X. Huang, X. Shen, and H. Liu, “Improving multimodal fusion chine Learning, 2020, pp. 2006–2015.
with main modal transformer for emotion recognition in conversation,” [45] L. Zhang, J. Song, A. Gao, J. Chen, C. Bao, and K. Ma, “Be your
Knowledge-Based Systems, vol. 258, p. 109978, 2022. own teacher: Improve the performance of convolutional neural networks
[25] G. Hu, T.-E. Lin, Y. Zhao, G. Lu, Y. Wu, and Y. Li, “Unimse: Towards via self distillation,” in Proceedings of the IEEE/CVF International
unified multimodal sentiment analysis and emotion recognition,” in Conference on Computer Vision, 2019.
Proceedings of the 2022 Conference on Empirical Methods in Natural [46] Y. Hou, Z. Ma, C. Liu, and C. C. Loy, “Learning lightweight lane
Language Processing, 2022. detection cnns by self attention distillation,” in Proceedings of the

Authorized licensed use limited to: PSG COLLEGE OF TECHNOLOGY. Downloaded on July 18,2024 at 09:38:26 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3271019

13

IEEE/CVF international conference on computer vision, 2019, pp. 1013– Hui Ma received the M.S. degree from Dalian
1021. University of Technology, China, in 2019. She is
[47] S. Albanie, A. Nagrani, A. Vedaldi, and A. Zisserman, “Emotion recog- currently working toward the Ph.D. degree at School
nition in speech using cross-modal transfer in the wild,” in Proceedings of Computer Science and Technology, Dalian Uni-
of the 26th ACM international conference on Multimedia, 2018, pp. versity of Technology. Her research interests include
292–301. natural language processing, dialogue system, and
[48] Y. Wang, J. Wu, P. Heracleous, S. Wada, R. Kimura, and S. Kurihara, sentiment analysis.
“Implicit knowledge injectable cross attention audiovisual model for
group emotion recognition,” in Proceedings of the 2020 International
Conference on Multimodal Interaction, 2020, pp. 827–834.
[49] L. Schoneveld, A. Othmani, and H. Abdelkawy, “Leveraging recent
advances in deep learning for audio-visual emotion recognition,” Pattern
Recognition Letters, vol. 146, pp. 1–7, 2021.
[50] T. Moriya, T. Ochiai, S. Karita, H. Sato, T. Tanaka, T. Ashihara, Jian Wang received the Ph.D. degree from Dalian
R. Masumura, Y. Shinohara, and M. Delcroix, “Self-distillation for University of Technology, China, in 2014. She is
improving ctc-transformer-based asr systems.” in INTERSPEECH, 2020, currently a professor at School of Computer Science
pp. 546–550. and Technology, Dalian University of Technology.
[51] T. Zhou, P. Cao, Y. Chen, K. Liu, J. Zhao, K. Niu, W. Chong, and S. Liu, Her research interests include natural language pro-
“Automatic icd coding via interactive shared representation networks cessing, text mining, and information retrieval.
with self-distillation mechanism,” in Proceedings of the 59th Annual
Meeting of the Association for Computational Linguistics and the 11th
International Joint Conference on Natural Language Processing (Volume
1: Long Papers), 2021, pp. 5948–5957.
[52] X. Luo, Q. Liang, D. Liu, and Y. Qu, “Boosting lightweight single image
super-resolution via joint-distillation,” in Proceedings of the 29th ACM
International Conference on Multimedia, 2021, p. 1535–1543.
[53] L. Yuan, F. E. Tay, G. Li, T. Wang, and J. Feng, “Revisiting knowledge Hongfei Lin received the Ph.D. degree from North-
distillation via label smoothing regularization,” in Proceedings of the eastern University, China, in 2000. He is currently
IEEE/CVF Conference on Computer Vision and Pattern Recognition, a professor at School of Computer Science and
2020, pp. 3903–3911. Technology, Dalian University of Technology. His
[54] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. research interests include natural language process-
Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” ing, text mining, and sentimental analysis.
in Proceedings of NIPS, vol. 30, 2017.
[55] D. Ghosal, N. Majumder, A. Gelbukh, R. Mihalcea, and S. Poria,
“COSMIC: COmmonSense knowledge for eMotion identification in
conversations,” in Findings of the Association for Computational Lin-
guistics: EMNLP 2020, 2020, pp. 2470–2481.
[56] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis,
L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert
pretraining approach,” arXiv preprint arXiv:1907.11692, 2019. Bo Zhang received the B.S. degree from Tiangong
[57] F. Eyben, F. Weninger, F. Gross, and B. Schuller, “Recent developments University, China, in 2019. He is currently working
in opensmile, the munich open-source multimedia feature extractor,” in toward the Ph.D. degree at School of Computer Sci-
Proceedings of the 21st ACM international conference on Multimedia, ence and Technology, Dalian University of Technol-
2013, pp. 835–838. ogy. His research interests include natural language
[58] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely processing, dialogue system, and text generation.
connected convolutional networks,” in Proceedings of the IEEE confer-
ence on computer vision and pattern recognition, 2017, pp. 4700–4708.
[59] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
in Proceedings of ICLR, 2015.
[60] Y. Kim, “Convolutional neural networks for sentence classification,” in
Proceedings of the 2014 Conference on Empirical Methods in Natural
Language Processing, 2014, pp. 1746–1751.
[61] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- Yijia Zhang received the Ph.D. degree from the
training of deep bidirectional transformers for language understanding,” Dalian University of Technology, China, in 2014.
in Proceedings of the 2019 Conference of the North American Chapter He is currently a professor at School of Information
of the Association for Computational Linguistics: Human Language Science and Technology, Dalian Maritime Univer-
Technologies, Volume 1 (Long and Short Papers), 2019, pp. 4171–4186. sity. His research interests include natural language
[62] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal processing, bioinformatics, and text mining.
of machine learning research, vol. 9, no. 11, 2008.
[63] Y. Ding, A. Rich, M. Wang, N. Stier, P. Sen, M. Turk, and
T. Höllerer, “Sparse fusion for multimodal transformers,” arXiv preprint
arXiv:2111.11992, 2021.

Bo Xu received the Ph.D. degree from the Dalian


University of Technology, China, in 2018. He is
currently an associate professor at School of Com-
puter Science and Technology, Dalian University of
Technology. His research interests include informa-
tion retrieval, dialogue system, and natural language
processing.

Authorized licensed use limited to: PSG COLLEGE OF TECHNOLOGY. Downloaded on July 18,2024 at 09:38:26 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.

You might also like