A Transformer-Based Model With Self-Distillation For Multimodal Emotion Recognition in Conversations
A Transformer-Based Model With Self-Distillation For Multimodal Emotion Recognition in Conversations
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3271019
task of recognizing the emotion of each utterance in a con- Visual Audio Text Text Audio Visual
versation, is crucial for building empathetic machines. Exist- Wendy, we had a deal!
Yeah, you promised! 1
ing studies focus mainly on capturing context- and speaker- Wendy! Wendy! Wendy!
sensitive dependencies on the textual modality but ignore the [anger] 2 Who was that? [neutral]
learns weights between modalities dynamically by designing a there for you. [anger]
Authorized licensed use limited to: PSG COLLEGE OF TECHNOLOGY. Downloaded on July 18,2024 at 09:38:26 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3271019
importance of contributions of different modalities, let us focus are conducted to investigate the impact of positional and
on the 3rd utterance. The textual and acoustic expressions play speaker embeddings, intra- and inter-modal transformers,
more critical roles in recognizing “sadness” emotion than the self-distillation loss functions, and hierarchical gated
visual expression because a smiling face usually means “joy” fusion strategy.
emotion. (3) To understand the importance of efficient modal The rest of this paper is organized as follows: Section II
representations, let us focus on the textual expression of the discusses the related work; Section III formalizes the task
1st utterance, which contains multiple exclamation points. If definition and describes the proposed model; Section IV gives
the learned representation does not include the meaning of “!”, the experimental settings; Section V presents the experimental
it is challenging to identify “anger” emotion. results and discussion; Finally, Section VI concludes the paper
Therefore, it is valuable to capture intra- and inter-modal and provides directions for further work.
interactions between utterances, dynamically learn weights
between modalities, and enhance modal representations for
multimodal ERC. However, existing studies of the task have II. RELATED WORK
some limitations in achieving these characteristics. On the
A. Emotion Recognition in Conversations
one hand, most methods have drawbacks in modeling intra-
and inter-modal interactions. For example, CMN [4], ICON ERC has attracted widespread interest among researchers
[5], and DialogueRNN [6] concatenate unimodal features at with the increase in available conversation datasets, such as
the input level, and thus cannot capture intra-modal interac- IMEOCAP [18], AVEC [19], and MELD [20], etc. Early stud-
tions explicitly. While DialogueTRM [10] designs hierarchical ies primarily used lexicon-based methods [21], [22]. Recent
transformer and multi-grained interaction fusion modules to works have generally resorted to deep neural networks and
explore intra- and inter-modal emotional behaviors, it ig- focused on modeling context- and speaker-sensitive depen-
nores inter-modal interactions between different utterances. dencies. We divide the existing methods into two categories:
MMGCN [13] and MM-DFN [17] are graph-based fusion speaker-ignorant and speaker-dependent methods, according to
methods that require manually constructed graph structures to whether they utilize speaker information.
represent conversations. On the other hand, existing methods Speaker-ignorant methods do not distinguish speakers and
rely on the designed model to learn modal representations, but focus only on capturing contextual information in a conversa-
no work focuses on further improving modal representations tion. HiGRU [7] contains two gated recurrent units (GRUs) to
using model-agnostic techniques for ERC. model contextual relationships between words and utterances,
In this work, a transformer-based model with self- respectively. AGHMN [23] uses a hierarchical memory net-
distillation (SDT) is proposed to take into account the three work to enhance utterance representations and introduce an
aforementioned characteristics. First, we introduce intra- and attention GRU to model contextual information. MVN [11]
inter-modal transformers in a modality encoder to capture utilizes a multi-view network to model word- and utterance-
intra- and inter-modal interactions, and take positional and level dependencies in a conversation. In contrast, speaker-
speaker embeddings as additional inputs of these transform- dependent methods model both context- and speaker-sensitive
ers to capture contextual and speaker information. Next, a dependencies. DialogueRNN [6] leverages three distinct GRUs
hierarchical gated fusion strategy is proposed to dynamically to update speaker, context, and emotional states in a conver-
fuse information from multiple modalities. Then, we predict sation, respectively. DialogueGCN [12] uses a graph convolu-
emotion labels of conversation utterances based on fused mul- tional network to model speaker and conversation sequential
timodal representations in an emotion classifier. We call the information. HiTrans [8] consists of two hierarchical trans-
above three components a transformer-based model. Finally, to formers to capture global contextual information and exploits
learn more effective modal representations, we introduce self- an auxiliary task to model speaker-sensitive dependencies.
distillation into the proposed transformer-based model, which However, most of them are proposed for the textual modal-
transfers knowledge of hard and soft labels from the model to ity, ignoring the effectiveness of other modalities. Due to the
each modality. We treat the proposed model as the teacher and promising performance in the multimodal community, some
design three students according to three existing modalities. approaches tend to address multimodal ERC. DialogueTRM
These students are trained by distilling knowledge from the [10] explores intra- and inter-modal emotional behaviors using
teacher to learn better modal representations. hierarchical transformer and multi-grained interaction fusion
In summary, our contributions are as follows: modules, respectively. MMGCN [13] constructs a fully con-
• We propose a transformer-based model for multimodal nected graph to model multimodal and long-distance con-
ERC that contains a modality encoder for capturing textual information, and speaker embeddings are added for
intra- and inter-modal interactions between conversation encoding speaker information. MM-DFN [17] designs a graph-
utterances and a hierarchical gated fusion strategy for based dynamic fusion module to reduce redundancy and
adaptively learning weights between modalities. enhance complementarity between modalities. MMTr [24]
• To learn more effective modal representations, we devise preserves the integrity of main modal representations and
self-distillation that transfers knowledge of hard and soft enhances weak modal representations by using multi-head
labels from the proposed model to each modality. attention. UniMSE [25] performs modality fusion at syntactic
• Experiments on two benchmark datasets show the superi- and semantic levels and introduces inter-modality contrastive
ority of our proposed model. In addition, several studies learning to differentiate fusion representations among samples.
Authorized licensed use limited to: PSG COLLEGE OF TECHNOLOGY. Downloaded on July 18,2024 at 09:38:26 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3271019
This paper focuses on exploring intra- and inter-modal interac- unlabelled video data. Wang et al. [48] proposed K-injection
tions between utterances, learning weights between modalities, subnetworks to distill linguistic and acoustic knowledge rep-
and enhancing modal representations for multimodal ERC. resenting group emotions and transfer implicit knowledge into
the audiovisual model for group emotion recognition. Schon-
B. Multimodal Language Analysis eveld et al. [49] applied KD to further improve performance for
facial expression recognition. Most existing models belong to
Multimodal language analysis is a rapidly growing field
offline distillation, which requires training a teacher network.
and includes various tasks [26], such as multimodal emotion
In contrast, self-distillation needs no extra network except for
recognition, sentiment analysis, and personality traits recogni-
the network itself. While self-distillation has been successfully
tion. The key in this area is to fuse multimodal information.
applied in computer vision and natural language processing
Early studies on multimodal fusion mainly included early
[50]–[52], it focuses on unimodal tasks.
fusion and late fusion. Early fusion [27], [28] integrates
In this work, we adopt the idea of self-distillation to enhance
features of different modalities at the input level. Late fusion
modal representations for multimodal ERC. Moreover, output-
[29], [30] constructs distinct models for each modality and
based knowledge is used only due to the following reasons: (1)
then ensembles their outputs by majority voting or weighted
Soft labels can be used as training supervision which contain
averaging, etc. Unfortunately, as stated in [31], these two kinds
dark knowledge [38] and can provide effective regularization
of fusion methods cannot effectively capture intra- and inter-
for the model [53]. (2) Intuitively, the features of different
modal interactions.
modalities vary widely, and hence matching fused multimodal
Subsequently, model fusion has become popular and various
features with unimodal features is inappropriate2 . (3) Our
models have been proposed. TFN [32] models unimodal,
teacher and student networks lying in the same model have
bimodal, and trimodal interactions explicitly by computing
different architectures that results in an inability to inject rela-
Cartesian product. LMF [31] utilizes low-rank weight tensors
tionships between different layers of the teacher network into
for multimodal fusion, which reduces the complexity of TFN.
the student network [41]. Therefore, we adopt output-based
MFN [33] learns cross-modal interactions with an attention
knowledge rather than feature- and relation-based knowledge.
mechanism and stores information over time by a multi-view
gated memory. MulT [34] utilizes cross-modal transformers
to model long-range dependencies across modalities. Rahman III. M ETHODOLOGY
et al. [35] fine-tuned large pre-trained transformer models for
multimodal language by designing a multimodal adaptation A. Task Definition
gate (MAG). Self-MM [36] uses a unimodal label generation A conversation is composed of N consecutive utterances
strategy to acquire independent unimodal supervision and then {u1 , u2 , · · · , uN } and M speakers {s1 , s2 , · · · , sM }. Each
learns multimodal and unimodal tasks jointly. Yuan et al. [37] utterance ui is spoken by a speaker sϕ(ui ) , where ϕ is the
adopted transformer encoders to model intra- and inter-modal mapping between an utterance and its corresponding speaker’s
interactions between modality sequences. In order to capture index. Moreover, ui involves textual (t), acoustic (a), and
intra- and inter-modal interactions between conversation utter- visual (v) modalities, and their feature representations are de-
ances and meanwhile learn weights between modalities, we noted as uti ∈ Rdt , uai ∈ Rda , and uvi ∈ Rdv , respectively. We
present a transformer-based model. represent textual, acoustic, and visual modality sequences of
all utterances in the conversation as Ut = [ut1 ; ut2 ; · · · ; utN ] ∈
C. Knowledge Distillation RN ×dt , Ua = [ua1 ; ua2 ; · · · ; uaN ] ∈ RN ×da , and Uv =
Knowledge distillation (KD) aims at transferring knowledge [uv1 ; uv2 ; · · · ; uvN ] ∈ RN ×dv , respectively. The ERC task aims
from a large teacher network to a small student network. The to predict the emotion label of each utterance ui from pre-
knowledge mainly includes soft labels of the last output layer defined emotion categories.
(i.e., output-based knowledge) [38], features of intermediate
layers (i.e., feature-based knowledge) [39], and relationships
between different layers (i.e., relation-based knowledge) [40]. B. Overview
Depending on the learning schemes, existing methods on KD Fig. 2 gives an overview of our proposed SDT. After
are categorized into three classes: offline distillation [41], [42], extracting utterance-level unimodal features, the transformer-
online distillation [43], [44], and self-distillation [45], [46]. based model consists of three modules: a modality encoder
In offline distillation, the teacher network is first trained and module for capturing intra- and inter-modal interactions be-
then the pre-trained teacher distills its knowledge to guide the tween different utterances, a hierarchical gated fusion module
student training. In online distillation, the teacher and student for adaptively learning weights between modalities, and an
networks are updated simultaneously, and hence its training emotion classifier module for predicting emotion labels. Fur-
process is only one-phase. Self-distillation is a special case of thermore, we introduce self-distillation and devise two kinds of
online distillation that teaches a single network using its own losses to transfer knowledge from our proposed model within
knowledge. each modality to learn better modal representations.
Recently, KD has been used for multimodal emotion recog-
nition. For example, Albanie et al. [47] transferred visual 2 We tried to add feature-based knowledge, but the performance drops
knowledge into a speech emotion recognition model using significantly.
Authorized licensed use limited to: PSG COLLEGE OF TECHNOLOGY. Downloaded on July 18,2024 at 09:38:26 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3271019
Conv1D
Transformer C
Text RoBERTa
(a ® t ) s
S (v ® t )
Intra-modal s
Transformer
(a ® a) H¢a H¢ µ
Y
Classifier
s
Softmax
Ua Inter-modal
Conv1D
Transformer C
Audio openSMILE
(t ® a) s
S (v ® a )
Intra-modal s
Transformer
(v ® v ) H¢v
Uv s LTask
Inter-modal
Conv1D
Transformer C
Visual DenseNet
(t ® v) s
S ( a ® v)
Fig. 2. The overall architecture of SDT. After extracting utterance-level unimodal features, it consists of four key components: Modality Encoder, Hierarchical
Gated Fusion, Emotion Classifier, and Self-distillation.
C. Modality Encoder one-hot vector of speaker sj , i.e., 1 in the jth position and 0
The modality encoder obtains modality-enhanced modality otherwise.
sequence representations that can learn intra- and inter-modal Hence, speaker embeddings corresponding
to the conversa-
interactions between conversation utterances. tion can be represented as SE = sϕ(u1 ) ; sϕ(u2 ) ; · · · ; sϕ(uN ) .
Overall, we augment positional and speaker embeddings to
Temporal Convolution: To ensure that three unimodal
the convolved sequence:
sequence representations lie in the same space, we feed them
into a 1D convolutional layer: Hm = U′m + PE + SE. (4)
Authorized licensed use limited to: PSG COLLEGE OF TECHNOLOGY. Downloaded on July 18,2024 at 09:38:26 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3271019
n modality and hence can capture inter-modal interactions Task Loss: We utilize the cross-entropy loss for estimating
between the utterance sequence. the quality of emotion predictions during training:
In summary, n-enhanced m-modality sequence representa- N C
tion, Hn→m , is obtained from the modality encoder module, 1 XX
LT ask = − yi,j log (b
yi,j ), (14)
where n, m ∈ {t, a, v}. N i=1 j=1
Authorized licensed use limited to: PSG COLLEGE OF TECHNOLOGY. Downloaded on July 18,2024 at 09:38:26 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3271019
Authorized licensed use limited to: PSG COLLEGE OF TECHNOLOGY. Downloaded on July 18,2024 at 09:38:26 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3271019
TABLE II
R ESULTS ON THE IEMOCAP DATASET; “*”: BASELINES ARE RE - IMPLEMENTED USING OUR EXTRACTED FEATURES ; BOLD FONT DENOTES THE BEST
PERFORMANCE .
IEMOCAP
Models happy sad neutral angry excited frustrated
ACC w-F1
ACC F1 ACC F1 ACC F1 ACC F1 ACC F1 ACC F1
CMN 24.31 30.30 56.33 62.02 52.34 52.41 61.76 60.17 56.19 60.76 72.44 61.27 56.87 56.33
ICON 25.00 31.30 67.35 73.17 55.99 58.50 69.41 66.29 70.90 67.09 71.92 65.08 62.85 62.25
DialogueRNN 25.00 34.95 82.86 84.58 54.43 57.66 61.76 64.42 90.97 76.30 62.20 59.55 65.43 64.29
MMGCN 32.64 39.66 72.65 76.89 65.10 62.81 73.53 71.43 77.93 75.40 65.09 63.43 66.61 66.25
DialogueTRM 61.11 57.89 84.90 81.25 69.27 68.56 76.47 65.99 76.25 76.13 50.39 58.09 68.52 68.20
MM-DFN 44.44 44.44 77.55 80.00 71.35 66.99 75.88 70.88 74.25 76.42 58.27 61.67 67.84 67.85
MMTr - - - - - - - - - - - - 72.27 71.91
UniMSE - - - - - - - - - - - - 70.56 70.66
DialogueRNN* 57.64 57.64 77.96 80.25 75.52 70.56 68.24 64.99 73.91 75.95 59.06 62.41 69.38 69.37
MMGCN* 50.00 56.25 78.78 81.43 71.35 67.57 68.24 66.29 75.92 76.82 65.09 64.92 69.62 69.61
DialogueTRM* 72.22 62.84 85.71 83.33 69.27 68.12 79.41 66.67 67.22 75.00 57.22 63.28 69.87 69,93
MM-DFN* 57.64 52.87 84.49 86.07 76.04 71.66 70.59 65.04 73.24 75.26 55.91 62.19 69.87 69.91
SDT (Ours) 72.71 66.19 79.51 81.84 76.33 74.62 71.88 69.73 76.79 80.17 67.14 68.68 73.95 74.08
w/o self-distillation 71.53 58.52 79.59 79.43 69.27 70.65 69.41 67.05 69.23 77.09 67.98 68.07 70.73 71.10
In addition, we re-implement DialogueRNN, MMGCN, Dia- UniMSE by 2.46% and 1.09%, respectively. Similar to IEMO-
glogueTRM, and MM-DFN with our extracted features, CAP, SDT performs superior on most emotion classes in terms
namely DialogueRNN*, MMGCN*, DiaglogueTRM*, and of F1-score.
MM-DFN*. We use the same data splits to implement all Overall, the above results indicate the effectiveness of
models. SDT. Furthermore, we have several similar findings on the
two datasets: (1) DialogueTRM has a superior performance
D. Implementation Details compared to DialogueRNN, MMGCN, and MM-DFN that
use TextCNN [60] to extract textual features. This is because
We implement the proposed model using Pytorch4 and use
textual modality plays a more important role for ERC [13],
Adam [59] as optimizer with an initial learning rate of 1.0e−4
and DialogueTRM extracts textual features using BERT [61],
for IEMOCAP and 5.0e − 6 for MELD. The batch size is
which is more powerful than TextCNN. (2) The baselines
16 for IEMOCAP and 8 for MELD, and the temperature
gain further improvement and achieve comparable results
τ for the two datasets are set to 1 and 8, respectively. For
when using our extracted utterance features. The results show
the 1D convolutional layers, the number of input channels
that our feature extractor is more effective and sequence-
are set to 1024, 1582, and 342 for textual, acoustic, and
and graph-based baselines can achieve similar performance
visual modalities, respectively (i.e., their corresponding feature
using our extracted features. (3) Even without self-distillation,
dimensions) on IEMOCAP. On MELD, these parameters are
our proposed model is still comparable to strong baselines,
set to 1024, 300, and 342, respectively. In addition, the number
demonstrating the power of the proposed transformer-based
of output channels and kernel size are set to 1024 and 1
model.
respectively for all three modalities on the two datasets. For
the transformer encoder, the hidden size, number of attention
heads, feed-forward size, and number of layers are set to 1024, B. Ablation Study
8, 1024, and 1, respectively. To prevent overfitting, we set the We carry out ablation experiments on IEMOCAP and
L2 weight decay to 1.0e − 5 and employ dropout with a rate MELD. Table IV reports the results under different ablation
of 0.5. All results are averages of 10 runs. settings.
Ablation on Transformer-based Model: Positional em-
V. R ESULTS AND D ISCUSSION beddings, speaker embeddings, intra-modal transformers, and
A. Overall Results inter-modal transformers are four crucial components of our
Table II and Table III present the performance of baselines proposed transformer-based model. We remove only one com-
and SDT on IEMOCAP and MELD datasets, respectively. On ponent at a time to evaluate the effectiveness of the compo-
IEMOCAP dataset, SDT performs better than all baselines and nent. From Table IV, we conclude that: (1) All components
outperforms MMTr by 1.68% and 2.17% in terms of overall are useful because removing one of them leads to perfor-
accuracy and weighted F1-score, respectively. In addition, mance degradation. (2) Positional and speaker embeddings
SDT achieves a significant improvement on most emotion have considerable effects on the two datasets, which means
classes in terms of F1-score. On MELD dataset, SDT achieves capturing sequential and speaker information are valuable.
the best performance compared to all baselines in terms (3) Inter-modal transformers are more important than intra-
of overall accuracy and weighted F1-score, and outperforms modal transformers on the two datasets. This indicates that
inter-modal interactions between conversation utterances could
4 https://pytorch.org/ provide more helpful information.
Authorized licensed use limited to: PSG COLLEGE OF TECHNOLOGY. Downloaded on July 18,2024 at 09:38:26 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3271019
TABLE III
R ESULTS ON THE MELD DATASET.
MELD
Models neutral surprise fear sadness joy disgust anger
ACC w-F1
ACC F1 ACC F1 ACC F1 ACC F1 ACC F1 ACC F1 ACC F1
DialogueRNN 82.17 76.56 46.62 47.64 0.00 0.00 21.15 24.65 49.50 51.49 0.00 0.00 48.41 46.01 60.27 57.95
MMGCN 84.32 76.96 47.33 49.63 2.00 3.64 14.90 20.39 56.97 53.76 1.47 2.82 42.61 45.23 61.34 58.41
DialogueTRM 83.20 79.41 56.94 55.27 12.00 17.39 27.88 36.48 60.45 60.30 16.18 20.18 51.01 49.79 65.10 63.80
MM-DFN 79.06 75.80 53.02 50.42 0.00 0.00 17.79 23.72 59.20 55.48 0.00 0.00 50.43 48.27 60.96 58.72
MMTr - - - - - - - - - - - - - - 64.64 64.41
UniMSE - - - - - - - - - - - - - - 65.09 65.51
DialogueRNN* 85.11 79.60 54.09 56.72 10.00 12.66 29.81 38.63 62.94 63.81 22.06 27.27 53.62 53.24 66.70 65.31
MMGCN* 81.53 79.20 58.36 57.75 8.00 13.79 31.73 39.40 69.90 63.43 20.59 24.56 52.17 53.49 66.40 65.21
DialogueTRM* 83.44 79.54 54.45 57.09 24.00 27.91 33.17 40.95 60.45 62.79 22.06 28.04 58.26 53.96 66.70 65.76
MM-DFN* 83.52 79.65 63.35 58.17 32.00 26.67 26.44 35.71 63.68 64.89 19.12 24.76 49.28 52.15 66.55 65.48
SDT (Ours) 83.22 80.19 61.28 59.07 13.80 17.88 34.90 43.69 63.24 64.29 22.65 28.78 56.93 54.33 67.55 66.60
w/o self-distillation 82.01 80.00 57.65 57.96 20.00 23.81 32.21 41.61 65.17 64.22 25.00 27.42 57.97 54.05 66.97 66.26
TABLE IV
R ESULTS OF ABLATION STUDIES ON THE TWO DATASETS . 75.00
73.95
Uni-Cat-Transformer
75.00
73.30
74.08 Uni-Cat-Transformer
Add Add
73.16
Concatenation 72.20 Concatenation
71.97 72.00
Ours Ours
72.00
IEMOCAP MELD 69.64
w-F1 (%)
ACC (%)
69.56 69.00
w/o intra-modal transformers 73.38 73.36 67.13 66.21 IEMOCAP MELD IEMOCAP MELD
Dataset Dataset
w/o inter-modal transformers 72.09 72.26 66.97 65.55
Self-distillation (a) Overall accuracy (b) Weighted F1-score
w/o LCE 73.07 73.32 67.39 66.37
w/o LKL 72.95 73.03 67.09 66.33 Fig. 3. Performance of different fusion methods on the two datasets. Bold
Modality font means that the improvement to all baselines is statistically significant
Text 66.42 66.58 66.82 65.52 (t-test with p < 0.05).
Audio 59.77 59.34 48.12 40.81
Visual 41.47 42.71 48.05 32.01
Text + Audio 72.52 72.75 67.05 66.24
Text + Visual 69.01 69.07 67.20 66.18 Effect of Different Fusion Strategies: To investigate the
Audio + Visual 62.05 62.26 47.24 40.21 effect of our proposed hierarchical gated fusion module, we
compare it with two typical information fusion strategies: (1)
Add: representations are fused via element-wise addition. (2)
Concatenation: representations are directly concatenated and
Ablation on Self-distillation Loss Functions: There are
followed by an FC layer. Add treats all representations equally,
two kinds of losses ( i.e., LCE and LKL ) for self-distillation.
while Concatenation could implicitly choose the important
To verify the importance of these losses, we remove one loss at
information due to the FC layer. For a fair comparison, we
a time. Table IV shows that LCE and LKL are complementary
replace the hierarchical gated fusion module of our model with
and our model performs best when all losses are included. The
hierarchical add and concatenation operations to implement
result demonstrates that transferring knowledge of both hard
the Add and Concatenation fusion strategies, respectively. In
and soft labels from the proposed transformer-based model to
addition, we also compare SDT with a general transformer-
each modality can further boost the model performance.
based fusion method (i.e., unimodal features are concatenated
Effect of Different Modalities: To show the effect of and then fed into a transformer encoder) that we call Uni-Cat-
different modalities, we remove one or two modalities at a Transformer.
time. From Table IV, we observe that: (1) For unimodal As shown in Fig. 3, compared with other fusion strategies,
results, the textual modality has far better performance than the our proposed hierarchical gated fusion strategy significantly
other two modalities, indicating that the textual feature plays outperforms them. The result indicates that directly fusing
a leading role in ERC. This finding is consistent with previous representations with Add and Concatenation is sub-optimal.
works [10], [13], [17]. (2) Any bimodal results are better Our proposed hierarchical gated fusion module first filters
than its own unimodal results. Moreover, fusing the textual out irrelevant information at the unimodal level and then
modality and acoustic or visual modality performs superior dynamically learns weights between different modalities at the
to the fusion of the acoustic and visual modalities due to the multimodal level, which can more effectively fuse multimodal
importance of textual features. (3) Using all three modalities representations.
gives the best performance. The result can validate that emo- In addition, our model achieves a significant performance
tion is affected by verbal, vocal, and visual expressions, and improvement over Uni-Cat-Transformer that demonstrates the
integrating multimodal information is essential for ERC. effectiveness of the proposed SDT in multimodal fusion.
Authorized licensed use limited to: PSG COLLEGE OF TECHNOLOGY. Downloaded on July 18,2024 at 09:38:26 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3271019
Loss
1.00 0.20
other emotions. By comparing Fig. 5(b) and Fig. 5(c), we
observe that our model with self-distillation yields a better
0.50 0.10
Authorized licensed use limited to: PSG COLLEGE OF TECHNOLOGY. Downloaded on July 18,2024 at 09:38:26 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3271019
10
3H`LY! 3H`LY!
2 Phoebe
Ohh, let me see
it! Let me see surprise surprise surprise surprise surprise
your hand!
Why do you
3 Monica want to see my surprise surprise neutral neutral neutral
hand?
I wanna see
4 Phoebe
what’s in your
hand. I wanna
disgust disgust neutral disgust disgust (a) Intra-modal transformer (b) Intra-modal transformer
see the trash. 3H`LY! (t → t) of SDT3H`LY!
(only text) (t → t) of SDT
Eww! Oh, it’s all
5 Phoebe dirty. You should disgust disgust disgust disgust disgust
throw this out.
F. Error Analysis
Although the proposed SDT achieves strong performance, (c) Inter-modal transformer (d) Inter-modal transformer
it still fails to detect some emotions. We analyze confusion (a → t) of SDT (v → t) of SDT
matrices of the test set on the two datasets. From Fig. 9, we
see that: (1) SDT misclassifies similar emotions, like “happy” Fig. 8. Multi-head attention visualization for the 4th utterance in Fig. 7.
and “excited”, “angry” and “frustrated” on IEMOCAP, and There are 8 attention heads and different colors represent different heads. The
“surprise” and “anger” on MELD. (2) SDT also tends to darker the color, the more important for the 4th utterance.
misclassify other emotions as “neutral” on MELD due to that
“neutral” is the majority class. (3) It is difficult to correctly
detect “fear” and “disgust” emotions on MELD because the TABLE V
T EST ACCURACY OF SDT ON UTTERANCES WITH AND WITHOUT
two emotions are minority classes. Thus, it is challenging EMOTIONAL SHIFT
to recognize similar emotions and emotions with unbalanced
data. Dataset
Emotional Shift w/o Emotional Shift
Besides, we also investigate SDT performance on emotional #Utterances ACC #Utterances ACC
shift (i.e., two consecutive utterances spoken by the same IEMOCAP 410 54.88 1151 80.71
speaker have different emotions). As shown in Table V, MELD 1003 61.62 861 73.05
we observe that SDT performs poorer on utterances with
Authorized licensed use limited to: PSG COLLEGE OF TECHNOLOGY. Downloaded on July 18,2024 at 09:38:26 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3271019
11
3H`LY! 3H`LY!
emotional shift than that without it5 , which is consistent (c) Inter-modal transformer (d) Intra-modal transformer
3H`LY! with
previous works. The emotional shift in conversations is a (v → a) of SDT3H`LY! (v → v) of SDT
complex phenomenon caused by multiple latent variables, e.g.,
the speaker’s personality and intent; however, SDT and most
existing models do not consider these factors, which may result
in poor performance. Further improvement on the case needs
to be explored.
VI. C ONCLUSION
In this paper, we propose SDT, a transformer-based model (e) Inter-modal transformer (f) Inter-modal transformer
with self-distillation for multimodal ERC. We use intra- and (t → v) of SDT (a → v) of SDT
inter-modal transformers to model intra- and inter-modal inter-
Fig. 10. Multi-head attention visualization for the 4th utterance in Fig. 7.
actions between conversation utterances. To dynamically learn
weights between different modalities, we design a hierarchical
gated fusion strategy. Positional and speaker embeddings are Foundation of Liaoning Province (No. 2021-BS-067), and the
also leveraged as additional inputs to capture contextual and Fundamental Research Funds for the Central Universities (No.
speaker information. In addition, we devise self-distillation DUT21RC(3)015).
during training to transfer knowledge of hard and soft labels
within the model to learn better modal representations, which A PPENDIX
could further improve performance. We conduct experiments ATTENTION V ISUALIZATION
on two benchmark datasets and the results demonstrate the Multi-head attention weights of the transformers in our SDT
effectiveness and superiority of SDT. that form enhanced acoustic and visual modality representa-
Through error analysis, we find that distinguishing simi- tions are visualized in Fig. 10.
lar emotions, detecting emotions with unbalanced data, and
emotional shift are key challenges for ERC that are worth R EFERENCES
further exploration in future work. Furthermore, transformer- [1] A. Kumar, P. Dogra, and V. Dabas, “Emotion analysis of twitter
based fusion methods cause high computational costs as the using opinion mining,” in 2015 Eighth International Conference on
Contemporary Computing, 2015, pp. 285–290.
self-attention
mechanism of transformer has a complexity of [2] F. A. Pujol, H. Mora, and A. Martı́nez, “Emotion recognition to improve
O N 2 with respect to sequence length N . To alleviate the e-healthcare systems in smart cities,” in Research & Innovation Forum
issue, Ding et al. [63] proposed sparse fusion for multimodal 2019, A. Visvizi and M. D. Lytras, Eds., 2019, pp. 245–254.
[3] L. Zhou, J. Gao, D. Li, and H.-Y. Shum, “The design and implementation
transformers. Similarly, we plan to design a novel multimodal of xiaoice, an empathetic social chatbot,” Computational Linguistics,
fusion method for transformers to reduce computational costs vol. 46, no. 1, pp. 53–93, 2020.
in the future. [4] D. Hazarika, S. Poria, A. Zadeh, E. Cambria, L.-P. Morency, and R. Zim-
mermann, “Conversational memory network for emotion recognition
in dyadic dialogue videos,” in Proceedings of the 2018 Conference
ACKNOWLEDGMENTS of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 1 (Long Papers),
This work is partially supported by the Natural Science 2018, pp. 2122–2132.
Foundation of China (No. 62006034), the Natural Science [5] D. Hazarika, S. Poria, R. Mihalcea, E. Cambria, and R. Zimmermann,
“ICON: Interactive conversational memory network for multimodal
5 In this paper, without emotional shift means two consecutive utterances emotion detection,” in Proceedings of the 2018 Conference on Empirical
spoken by the same speaker have same emotions. Methods in Natural Language Processing, 2018, pp. 2594–2604.
Authorized licensed use limited to: PSG COLLEGE OF TECHNOLOGY. Downloaded on July 18,2024 at 09:38:26 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3271019
12
[6] N. Majumder, S. Poria, D. Hazarika, R. Mihalcea, A. Gelbukh, and [26] P. P. Liang, Z. Liu, A. Bagher Zadeh, and L.-P. Morency, “Multimodal
E. Cambria, “Dialoguernn: An attentive rnn for emotion detection in language analysis with recurrent multistage fusion,” in Proceedings
conversations,” in Proceedings of the AAAI Conference on Artificial of the 2018 Conference on Empirical Methods in Natural Language
Intelligence, vol. 33, no. 01, 2019, pp. 6818–6825. Processing, 2018, pp. 150–161.
[7] W. Jiao, H. Yang, I. King, and M. R. Lyu, “HiGRU: Hierarchical gated [27] M. Wöllmer, F. Weninger, T. Knaup, B. Schuller, C. Sun, K. Sagae,
recurrent units for utterance-level emotion recognition,” in Proceedings and L.-P. Morency, “Youtube movie reviews: Sentiment analysis in an
of the 2019 Conference of the North American Chapter of the Asso- audio-visual context,” IEEE Intelligent Systems, vol. 28, no. 3, pp. 46–
ciation for Computational Linguistics: Human Language Technologies, 53, 2013.
Volume 1 (Long and Short Papers), 2019, pp. 397–406. [28] S. Poria, I. Chaturvedi, E. Cambria, and A. Hussain, “Convolutional
[8] J. Li, D. Ji, F. Li, M. Zhang, and Y. Liu, “HiTrans: A transformer- mkl based multimodal emotion recognition and sentiment analysis,” in
based context- and speaker-sensitive model for emotion detection in 2016 IEEE 16th International Conference on Data Mining, 2016, pp.
conversations,” in Proceedings of the 28th International Conference on 439–448.
Computational Linguistics, 2020, pp. 4190–4200. [29] B. Nojavanasghari, D. Gopinath, J. Koushik, T. Baltrušaitis, and L.-P.
[9] H. Ma, J. Wang, L. Qian, and H. Lin, “Han-regru: hierarchical attention Morency, “Deep multimodal fusion for persuasiveness prediction,” in
network with residual gated recurrent unit for emotion recognition in Proceedings of the 18th ACM International Conference on Multimodal
conversation,” Neural Computing and Applications, vol. 33, no. 7, pp. Interaction, 2016, p. 284–288.
2685–2703, 2021. [30] O. Kampman, E. J. Barezi, D. Bertero, and P. Fung, “Investigating audio,
[10] Y. Mao, G. Liu, X. Wang, W. Gao, and X. Li, “DialogueTRM: Exploring video, and text fusion methods for end-to-end automatic personality pre-
multi-modal emotional dynamics in a conversation,” in Findings of the diction,” in Proceedings of the 56th Annual Meeting of the Association
Association for Computational Linguistics: EMNLP 2021, 2021, pp. for Computational Linguistics (Volume 2: Short Papers), 2018, pp. 606–
2694–2704. 611.
[11] H. Ma, J. Wang, H. Lin, X. Pan, Y. Zhang, and Z. Yang, “A multi-view [31] Z. Liu, Y. Shen, V. B. Lakshminarasimhan, P. P. Liang, A. Bagher Zadeh,
network for real-time emotion recognition in conversations,” Knowledge- and L.-P. Morency, “Efficient low-rank multimodal fusion with modality-
Based Systems, vol. 236, p. 107751, 2022. specific factors,” in Proceedings of the 56th Annual Meeting of the
[12] D. Ghosal, N. Majumder, S. Poria, N. Chhaya, and A. Gelbukh, Association for Computational Linguistics (Volume 1: Long Papers),
“DialogueGCN: A graph convolutional neural network for emotion 2018, pp. 2247–2256.
recognition in conversation,” in Proceedings of the 2019 Conference [32] A. Zadeh, M. Chen, S. Poria, E. Cambria, and L.-P. Morency, “Tensor
on Empirical Methods in Natural Language Processing and the 9th fusion network for multimodal sentiment analysis,” in Proceedings
International Joint Conference on Natural Language Processing, 2019, of the 2017 Conference on Empirical Methods in Natural Language
pp. 154–164. Processing, 2017, pp. 1103–1114.
[13] J. Hu, Y. Liu, J. Zhao, and Q. Jin, “MMGCN: Multimodal fusion via [33] A. Zadeh, P. P. Liang, N. Mazumder, S. Poria, E. Cambria, and L.-P.
deep graph convolution network for emotion recognition in conversa- Morency, “Memory fusion network for multi-view sequential learning,”
tion,” in Proceedings of the 59th Annual Meeting of the Association for in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32,
Computational Linguistics and the 11th International Joint Conference no. 1, 2018.
on Natural Language Processing (Volume 1: Long Papers), 2021, pp. [34] Y.-H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L.-P. Morency, and
5666–5675. R. Salakhutdinov, “Multimodal transformer for unaligned multimodal
[14] W. Nie, R. Chang, M. Ren, Y. Su, and A. Liu, “I-gcn: Incremental language sequences,” in Proceedings of the 57th Annual Meeting of the
graph convolution network for conversation emotion detection,” IEEE Association for Computational Linguistics, 2019, pp. 6558–6569.
Transactions on Multimedia, pp. 1–1, 2021. [35] W. Rahman, M. K. Hasan, S. Lee, A. Bagher Zadeh, C. Mao, L.-P.
Morency, and E. Hoque, “Integrating multimodal information in large
[15] M. Ren, X. Huang, W. Li, D. Song, and W. Nie, “Lr-gcn: Latent
pretrained transformers,” in Proceedings of the 58th Annual Meeting of
relation-aware graph convolutional network for conversational emotion
the Association for Computational Linguistics, 2020, pp. 2359–2369.
recognition,” IEEE Transactions on Multimedia, pp. 1–1, 2021.
[36] W. Yu, H. Xu, Z. Yuan, and J. Wu, “Learning modality-specific
[16] A. Mehrabian et al., Silent messages. Wadsworth Belmont, CA, 1971.
representations with self-supervised multi-task learning for multimodal
[17] D. Hu, X. Hou, L. Wei, L. Jiang, and Y. Mo, “Mm-dfn: Multimodal sentiment analysis,” in Proceedings of the AAAI Conference on Artificial
dynamic fusion network for emotion recognition in conversations,” in Intelligence, vol. 35, no. 12, 2021, pp. 10 790–10 797.
2022 IEEE International Conference on Acoustics, Speech and Signal [37] Z. Yuan, W. Li, H. Xu, and W. Yu, “Transformer-based feature re-
Processing, 2022. construction network for robust multimodal sentiment analysis,” in
[18] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Proceedings of the 29th ACM International Conference on Multimedia,
Chang, S. Lee, and S. S. Narayanan, “Iemocap: Interactive emotional 2021, p. 4400–4407.
dyadic motion capture database,” Language Resources and Evaluation, [38] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural
vol. 42, no. 4, pp. 335–359, 2008. network,” arXiv preprint arXiv:1503.02531, 2015.
[19] B. Schuller, M. Valstar, R. Cowie, and M. Pantic, “Avec 2012: The [39] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Ben-
continuous audio/visual emotion challenge - an introduction,” in Pro- gio, “Fitnets: Hints for thin deep nets,” in International Conference on
ceedings of the 14th ACM International Conference on Multimodal Learning Representations, 2015.
Interaction, 2012, p. 361–362. [40] J. Yim, D. Joo, J. Bae, and J. Kim, “A gift from knowledge distillation:
[20] S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihal- Fast optimization, network minimization and transfer learning,” in In
cea, “MELD: A multimodal multi-party dataset for emotion recognition Proceedings of CVPR, 2017.
in conversations,” in Proceedings of the 57th Annual Meeting of the [41] N. Passalis and A. Tefas, “Learning deep representations with probabilis-
Association for Computational Linguistics, 2019, pp. 527–536. tic knowledge transfer,” in Proceedings of the European Conference on
[21] C. M. Lee and S. Narayanan, “Toward detecting emotions in spoken Computer Vision, 2018, pp. 268–284.
dialogs,” IEEE Transactions on Speech and Audio Processing, vol. 13, [42] T. Li, J. Li, Z. Liu, and C. Zhang, “Few sample knowledge distillation
no. 2, pp. 293–303, 2005. for efficient network compression,” in Proceedings of the IEEE/CVF
[22] L. Devillers and L. Vidrascu, “Real-life emotions detection with lexical Conference on Computer Vision and Pattern Recognition, 2020, pp.
and paralinguistic cues on human-human call center dialogs,” in Ninth 14 639–14 647.
International Conference on Spoken Language Processing, 2006. [43] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu, “Deep mutual
[23] W. Jiao, M. Lyu, and I. King, “Real-time emotion recognition via learning,” in Proceedings of the IEEE conference on computer vision
attention gated hierarchical memory network,” in Proceedings of the and pattern recognition, 2018, pp. 4320–4328.
AAAI Conference on Artificial Intelligence, vol. 34, no. 05, 2020, pp. [44] I. Chung, S. Park, J. Kim, and N. Kwak, “Feature-map-level online
8002–8009. adversarial knowledge distillation,” in International Conference on Ma-
[24] S. Zou, X. Huang, X. Shen, and H. Liu, “Improving multimodal fusion chine Learning, 2020, pp. 2006–2015.
with main modal transformer for emotion recognition in conversation,” [45] L. Zhang, J. Song, A. Gao, J. Chen, C. Bao, and K. Ma, “Be your
Knowledge-Based Systems, vol. 258, p. 109978, 2022. own teacher: Improve the performance of convolutional neural networks
[25] G. Hu, T.-E. Lin, Y. Zhao, G. Lu, Y. Wu, and Y. Li, “Unimse: Towards via self distillation,” in Proceedings of the IEEE/CVF International
unified multimodal sentiment analysis and emotion recognition,” in Conference on Computer Vision, 2019.
Proceedings of the 2022 Conference on Empirical Methods in Natural [46] Y. Hou, Z. Ma, C. Liu, and C. C. Loy, “Learning lightweight lane
Language Processing, 2022. detection cnns by self attention distillation,” in Proceedings of the
Authorized licensed use limited to: PSG COLLEGE OF TECHNOLOGY. Downloaded on July 18,2024 at 09:38:26 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Multimedia. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TMM.2023.3271019
13
IEEE/CVF international conference on computer vision, 2019, pp. 1013– Hui Ma received the M.S. degree from Dalian
1021. University of Technology, China, in 2019. She is
[47] S. Albanie, A. Nagrani, A. Vedaldi, and A. Zisserman, “Emotion recog- currently working toward the Ph.D. degree at School
nition in speech using cross-modal transfer in the wild,” in Proceedings of Computer Science and Technology, Dalian Uni-
of the 26th ACM international conference on Multimedia, 2018, pp. versity of Technology. Her research interests include
292–301. natural language processing, dialogue system, and
[48] Y. Wang, J. Wu, P. Heracleous, S. Wada, R. Kimura, and S. Kurihara, sentiment analysis.
“Implicit knowledge injectable cross attention audiovisual model for
group emotion recognition,” in Proceedings of the 2020 International
Conference on Multimodal Interaction, 2020, pp. 827–834.
[49] L. Schoneveld, A. Othmani, and H. Abdelkawy, “Leveraging recent
advances in deep learning for audio-visual emotion recognition,” Pattern
Recognition Letters, vol. 146, pp. 1–7, 2021.
[50] T. Moriya, T. Ochiai, S. Karita, H. Sato, T. Tanaka, T. Ashihara, Jian Wang received the Ph.D. degree from Dalian
R. Masumura, Y. Shinohara, and M. Delcroix, “Self-distillation for University of Technology, China, in 2014. She is
improving ctc-transformer-based asr systems.” in INTERSPEECH, 2020, currently a professor at School of Computer Science
pp. 546–550. and Technology, Dalian University of Technology.
[51] T. Zhou, P. Cao, Y. Chen, K. Liu, J. Zhao, K. Niu, W. Chong, and S. Liu, Her research interests include natural language pro-
“Automatic icd coding via interactive shared representation networks cessing, text mining, and information retrieval.
with self-distillation mechanism,” in Proceedings of the 59th Annual
Meeting of the Association for Computational Linguistics and the 11th
International Joint Conference on Natural Language Processing (Volume
1: Long Papers), 2021, pp. 5948–5957.
[52] X. Luo, Q. Liang, D. Liu, and Y. Qu, “Boosting lightweight single image
super-resolution via joint-distillation,” in Proceedings of the 29th ACM
International Conference on Multimedia, 2021, p. 1535–1543.
[53] L. Yuan, F. E. Tay, G. Li, T. Wang, and J. Feng, “Revisiting knowledge Hongfei Lin received the Ph.D. degree from North-
distillation via label smoothing regularization,” in Proceedings of the eastern University, China, in 2000. He is currently
IEEE/CVF Conference on Computer Vision and Pattern Recognition, a professor at School of Computer Science and
2020, pp. 3903–3911. Technology, Dalian University of Technology. His
[54] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. research interests include natural language process-
Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” ing, text mining, and sentimental analysis.
in Proceedings of NIPS, vol. 30, 2017.
[55] D. Ghosal, N. Majumder, A. Gelbukh, R. Mihalcea, and S. Poria,
“COSMIC: COmmonSense knowledge for eMotion identification in
conversations,” in Findings of the Association for Computational Lin-
guistics: EMNLP 2020, 2020, pp. 2470–2481.
[56] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis,
L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert
pretraining approach,” arXiv preprint arXiv:1907.11692, 2019. Bo Zhang received the B.S. degree from Tiangong
[57] F. Eyben, F. Weninger, F. Gross, and B. Schuller, “Recent developments University, China, in 2019. He is currently working
in opensmile, the munich open-source multimedia feature extractor,” in toward the Ph.D. degree at School of Computer Sci-
Proceedings of the 21st ACM international conference on Multimedia, ence and Technology, Dalian University of Technol-
2013, pp. 835–838. ogy. His research interests include natural language
[58] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely processing, dialogue system, and text generation.
connected convolutional networks,” in Proceedings of the IEEE confer-
ence on computer vision and pattern recognition, 2017, pp. 4700–4708.
[59] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
in Proceedings of ICLR, 2015.
[60] Y. Kim, “Convolutional neural networks for sentence classification,” in
Proceedings of the 2014 Conference on Empirical Methods in Natural
Language Processing, 2014, pp. 1746–1751.
[61] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- Yijia Zhang received the Ph.D. degree from the
training of deep bidirectional transformers for language understanding,” Dalian University of Technology, China, in 2014.
in Proceedings of the 2019 Conference of the North American Chapter He is currently a professor at School of Information
of the Association for Computational Linguistics: Human Language Science and Technology, Dalian Maritime Univer-
Technologies, Volume 1 (Long and Short Papers), 2019, pp. 4171–4186. sity. His research interests include natural language
[62] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal processing, bioinformatics, and text mining.
of machine learning research, vol. 9, no. 11, 2008.
[63] Y. Ding, A. Rich, M. Wang, N. Stier, P. Sen, M. Turk, and
T. Höllerer, “Sparse fusion for multimodal transformers,” arXiv preprint
arXiv:2111.11992, 2021.
Authorized licensed use limited to: PSG COLLEGE OF TECHNOLOGY. Downloaded on July 18,2024 at 09:38:26 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.