0% found this document useful (0 votes)

37 views

Recent Trends of Multimodal Affective Computing: A Survey From NLP Perspective

great paper, not written by me for sure, I just wanna download book, pardon me

Uploaded by

mahmoudmohamed669401

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views

Recent Trends of Multimodal Affective Computing: A Survey From NLP Perspective

great paper, not written by me for sure, I just wanna download book, pardon me

Uploaded by

mahmoudmohamed669401

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO.

8, AUGUST 2021 1

Recent Trends of Multimodal Affective Computing:

A Survey from NLP Perspective
Guimin Hu♠ , Yi Xin† , Weimin Lyu♡ , Haojian Huang♣ , Chang Sun⋄ , Zhihong Zhu‡ , Lin Gui△ , Ruichu Cai▽

Abstract—Multimodal affective computing (MAC) has gar- of affective computing predominantly focused on unimodal
nered increasing attention due to its broad applications in tasks, examining text-based, audio-based, and vision-based
analyzing human behaviors and intentions, especially in text- affective computing separately. For instance, D-MILN [9] is a
arXiv:2409.07388v1 [cs.CL] 11 Sep 2024

dominated multimodal affective computing field. This survey

presents the recent trends of multimodal affective computing textual sentiment classification model, while work [10] utilizes
from NLP perspective through four hot tasks: multimodal senti- BiLSTM models trained on raw audio to predict the average
ment analysis, multimodal emotion recognition in conversation, sentiment of crowd responses. Today, sentiment analysis is
multimodal aspect-based sentiment analysis and multimodal widely employed across various modalities for applications
multi-label emotion recognition. The goal of this survey is to such as market research, brand monitoring, customer service
explore the current landscape of multimodal affective research,
identify development trends, and highlight the similarities and analysis, and social media monitoring. Recent advancements in
differences across various tasks, offering a comprehensive report multimedia technology [11]–[14] have diversified the channels
on the recent progress in multimodal affective computing from an for information dissemination, with an influx of news, social
NLP perspective. This survey covers the formalization of tasks, media platforms like Weibo, and video content. These devel-
provides an overview of relevant works, describes benchmark opments have integrated textual (spoken features), acoustic
datasets, and details the evaluation metrics for each task. Ad-
ditionally, it briefly discusses research in multimodal affective (rhythm, pitch), and visual (facial attributes) information to
computing involving facial expressions, acoustic signals, phys- comprehensively analyze human emotions. For example, Xu
iological signals, and emotion causes. Additionally, we discuss et al. [15] introduces image modality data into traditional text-
the technical approaches, challenges, and future directions in based aspect-level sentiment analysis, creating the new task of
multimodal affective computing. To support further research, we multimodal aspect-based sentiment analysis. Similarly, Wang
released a repository that compiles related works in multimodal
affective computing, providing detailed resources and references et al. [16] extends textual emotion-cause pair extraction to a
for the community 1 . multimodal conversation setting, utilizing multimodal signals
(text, audio, and video) to enhance the model’s ability to
understand the emotions and their causes.
I. I NTRODUCTION Multimodal affective computing tasks closely related to sev-
Affective computing combines expertise in computer sci- eral learning paradigms in machine learning, including transfer
ence, psychology, and cognitive science and its goal is to learning [17]–[19], multimodal learning [20], [21], muti-task
equip machines with the ability to recognize, interpret, and learning [22]–[24] and semantic understanding [25], [26].
emulate human emotions [1]–[6]. Today, the world around Regarding transfer learning, it allows affective analysis models
us comprises various modalities—we perceive objects visu- trained in one domain to be adapted for effective performance
ally, hear sounds audibly, feel textures tangibly, smell odors in different domains. By fine-tuning pre-trained models on
olfactorily, and so forth. A modality refers to the way an limited data from the target domain, these models can be
experience is perceived or occurs, and is often associated transferred to new domain, thereby enhancing their perfor-
with sensory modalities such as vision or touch, which are mance in multimodal affective computing tasks. In multimodal
essential for communication and sensation. The significant learning, cross-modal attention dynamically aligns and focuses
advancements in multimodal learning across various fields [7], on relevant information from different modalities, enhancing
[8] garnered increasing attention and accelerated the progress the model’s ability to capture sentiment by highlighting key
of multimodal affective computing. features and their interactions. In multi-task learning, shared
Multimodal affective computing seeks to develop models representations across affective computing tasks and modal-
capable of interpreting and reasoning sentiment or emotional ities improve performance by capturing common sentiment-
state over multiple modalities. In its early stages, researchers related features from text, audio, and video.
♠ University
More recently, the studies of multimodal learning have
of Copenhagen, Denmark. email: [email protected].
† Nanjing University, China.
advanced the field by pre-training multimodal models on
♡ Stony Brook University, United States. extensive multimodal datasets over years, further improving
♣ University of Hong Kong, China. the performance on downstream tasks such as multimodal
⋄ University of Bologna, Italy.
‡ Peking University, China.
sentiment analysis [27]–[30]. With the scaling of the pre-
△ King’s College London, United Kingdom. trained model, parameter-efficient transfer learning emerges
▽ Guangdong University of Technology, China. such as adapter [31], prompt [32], instruction-tuning [33]
1 https://github.com/LeMei/Affective-Computing and in-context learning [34], [35]. More and more works
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 2

of multimodal affective computing leverage these parameter- affective computing tasks, the incorporation of external knowl-
efficient transfer learning methods to transfer knowledge edge, and affective computing with less-studied modalities.
from pre-trained models (e.g., unimodal pre-trained model Lastly, Section XI concludes this survey and its contribution
or multimodal pre-trained model) to downstream affective to multimodal affective computing community.
tasks to improve model performance by further fine-tuning
the pre-trained model. For instance, Zou et al. [36] design III. M ULTIMODAL A FFECTIVE C OMPUTING TASKS
a multimodal prompt Transformer (MPT) to perform cross- In this section, we show the definition of each task and
modal information fusion. UniMSE [37] proposes an adapter- discuss their application scenarios. Table I presents basic
based modal fusion method, which injects acoustic and visual information, including task input, output, type, and parent task,
signals into the T5 model to fuse them with multi-level textual for each of the four tasks.
information.
Multimodal affective computing encompasses tasks like A. Multimodal Sentiment Analysis
sentiment analysis, opinion mining, and emotion recognition
Multimodal sentiment analysis (MSA) [41] origins from
using modalities such as text, audio, images, video, physiolog-
sentiment analysis (SA) task [42] and it extends SA with the
ical signals, and haptic feedback. This survey focuses mainly
multimodal input. As a key research topics for computers to
on three key modalities: natural language, visual signals, and
understand human behaviors, the goal of multimodal sentiment
vocal signals. We highlight four main tasks in this survey:
analysis (MSA) is to predict sentiment polarity and sentiment
Multimodal Sentiment Analysis (MSA), Multimodal Emotion
intensity based on multimodal signals [43]. This task belongs
Recognition in Conversation (MERC), Multimodal Aspect-
to binary classification and regression task.
Based Sentiment Analysis (MABSA), and Multimodal Multi-
1) Task Formalization: Given a multimodal signal Ii =
label Emotion Recognition (MMER). A considerable volume
{Iit , Iia , Iiv }, we use Iim , m ∈ {t, a, v} to represent unimodal
of studies exists in the field of multimodal affective computing,
raw sequence drawn from the video fragment i, where {t, a, v}
and several reviews have been published [14], [38]–[40].
denote the three types of modalities—text, acoustic and visual.
However, these reviews primarily focus on specific affective
Multimodal sentiment analysis aims to predict the real number
computing tasks or specific single modality and overlook an
yir ∈ R, where yir ∈ [−3, 3] reflects the sentiment strength. We
overview of multimodal affective computing across multiple
feed Ii as the model input and train a model to predict yir .
tasks, and the consistencies and differences among these tasks.
2) Application Scenarios: We categorize multimodal senti-
The goal of this survey is twofold. First, this survey aims
ment analysis applications into key areas: social media mon-
to provide a comprehensive overview of multimodal affective
itoring, customer feedback, market research, content creation,
computing for beginners exploring deep learning in emotion
healthcare, and product reviews. For example, analyzing senti-
analysis, detailing tasks, inputs, outputs, and relevant datasets.
ment in text, images, and videos on social media helps gauge
Second, it also offers insights for researchers to reflect on past
public opinion and monitor brand perception, while analyzing
developments, explore future trends, and examine technical
multimedia product reviews can improve personalized recom-
approaches, challenges, and research directions in areas such
mendations and user satisfaction.
as multimodal sentiment analysis and emotion recognition.

B. Multimodal Emotion Recognition in Conversation

II. O RGANIZATION OF THIS S URVEY
Initially, MERC [39], [40], [44] extends from emotion
Section III outlines task formalization and application sce- recognition in a conversation (ERC) task [45], [46] and it
narios for multimodal affective tasks. Section IV introduces takes multimodal signals as the model inputs instead of signal
feature extraction methods and recent multimodal pre-trained modality. The goal of MERC is to automatically detect and
models like (e.g., CLIP, BLIP, BLIP2). Section V analyzes monitor the emotional states of speakers in a dialogue using
multimodal affective works from the two perspectives: multi- multimodal signals like text, audio and vision. In the com-
modal fusion and multimodal alignment, and shortly summa- munity of multimodal emotion recognition in a conversation,
rizes the parameter-efficient transfer methods used for further MERC is a multi-class classification task and then categorizes
tuning the pre-trained model. Section VI reviews literature the given utterance into one basic emotion from a pre-define
on MSA, MERC, MABSA, and MMER, focusing on mul- emotion set.
titask learning, pre-trained models, enhanced knowledge, and 1) Task Formalization: Given a dialog for k number of
contextual information. Furthermore, Section VII summarizes utterances, it can be formulated as U = {u1 , · · · , uk }, where
multimodal datasets, and Section VIII covers evaluation met- ui =< Iti , Iai , Ivi > denotes ith utterance of a conversation
rics for each multimodal affective computing task. After the containing text (text transcript), audio (speech segment) and
reviews of multimodal affective computing works. Section IX visual (video clip) modalities, denoted by {Iti , Iai , Ivi }. We use
briefly reviews multimodal affective computing works based Y as the label set of U and each utterance can be formalized
on facial expressions, acoustic signals, physiological signals, as follows:
and emotion causes. It also highlights the consistency, differ-
< U, Y >= {{ui , y i , i ∈ [1, k]}} (1)
ences, and recent trends of multimodal affective computing
tasks from an NLP perspective. Section X looks ahead to Here, y i indicates ith utterance’s emotion category that is
future work from three aspects of the unification of multimodal predefined before.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 3

TABLE I
T HE DETAILS OF MULTIMODAL AFFECTIVE TASKS . T,A,V DENOTE TEXT, AUDIO AND VISUAL MODALITIES RESPECTIVELY.

Task Input Output Granularity Task type Parent task

MSA T,A,V score,[-3,3] sentence binary classification, regression sentiment analysis [42]
MERC T,A,V one basic emotion utterance multi-class classification emotion recognition [45], [46]
MABSA T,V (aspect, sentiment polarity) aspect classification, tuple extraction, triple extraction aspect-level sentiment analysis [48]–[50]
MMER T,A,V two or more basic emotions utterance multi-label multi-class classification emotion recognition [51], [52]

2) Application Scenarios: Multimodal Emotion Recogni- D. Multimodal Multi-label Emotion Recognition

tion in Conversation (MERC) has broad applications across Multimodal signals may show more than one emotion label,
key areas: human-computer interaction, virtual assistants, which boosts the rise of a new task: multimodal multi-label
healthcare, and customer service. (i) In Human-Computer emotion recognition (MMER). MMER inherits the charac-
Interaction, MERC enhances user experience by enabling teristic of multimodal emotion recognition and multi-label
systems to recognize and respond to emotional states, lead- classification [51], [52]. MMER is developed from multi-label
ing to more personalized interactions. (ii) Virtual Assistants emotion recognition, which predicts two or more basic emo-
and Chatbots benefit from improved emotional understand- tion categories to analysis the given multimodal information
ing, making conversations more natural and engaging. (iii) and it is a multi-label multi-class classification.
In Customer Service, MERC helps agents better respond to 1) Task Formalization: Given a multimodal signal Ii =
customer emotions, enhancing satisfaction. Additionally, bio- {Iit , Iia , Iiv }, Ii contains three types of modalities—text, audio
sensing systems measuring physiological signals like ECG, and visual. Formally, we use Iim ∈ Rdm ×lm , m ∈ {t, a, v} to
PPG, EEG, and GSR expand MERC applications in robotics, represent the raw sequence of text, audio, and visual modalities
healthcare, and virtual reality. from the sample i. dm and lm denote the feature dimension
and sequence length of modality m. The goal of MMER is
to recognize at least one emotion categories from |L| pre-
defined label space Y = {y1 , y2 , · · · , y|L| } according to the
C. Multimodal Aspect-based Sentiment Analysis
multimodal signal Ii .
2) Application Scenarios: Multimodal multi-label emotion
Xu et al. [47] are among the first to put forward the new
recognition seeks to create AI systems that can understand
task, aspect based multimodal sentiment analysis. Multimodal
and categorize emotions expressed through various modalities
aspect-based sentiment analysis (MABSA) is contructed based
simultaneously. This task is challenging due to the complexity
on aspect-based sentiment analysis in texts [48], [49]. In con-
and variability of human emotions, differences in emotional
trast with MSA and ERC, multimodal aspect-based sentiment
expression across individuals and cultures, and the need for
analysis performs on fined granularity multimodal signals.
effective integration of diverse modalities.
MABSA receives texts and vision (image) modalitie as the
inputs and outputs the tuple including aspect and its sentiment
polarity. This task can be viewed as the classification, tuple IV. M ODAL F EATURE E XTRACTOR
extraction and triple extraction tasks. Recently, MABSA has For multimodal affective computing tasks, the model input
attracted increasing attention. Given an image and correspond- typically includes at least two modalities. In this section, we
ing text, MABSA is defined as jointly extracting all aspect introduce the common feature extractors that transform raw
terms from image-text pairs and predicting their sentiment sequences into a feature vectors.
polarities, i.e., positive, negative and neutral. a) Text Feature Extractor: For text modality, researchers
1) Task Formalization: Suppose the multimodal inputs in- adopt static word embedding methods like Word2Vec [53]
clude a textual content T = {w1 , w2 , ..., wL } and an im- and GloVec [54] to initialize word representation. Also, text
age set I = {I1 , I2 , · · · , IK }, the goal of MABSA is to modality can be encoded into feature vector through pre-
predict the sentiment polarities with a given aspect phrase trained language models like BERT [55], BART [56], and
A = {a1 , a2 , · · · , aN }, where ai denotes the ith aspect (e.g., T5 [57] to extract the text representation. More recently, a
food), L is the length of textual context, K is the number of collection of foundation language models like LLaMA [58],
images, and N is the length of aspect phrase. [59], Mamba [60] emerge and are used for encoding text
modality.
2) Application Scenarios: Multimodal aspect-based senti- b) Audio Feature Extractor: For audio modality, raw
ment analysis (MABSA) focuses on improving products and acoustic input needs to be processed into numerical sequential
services by analyzing reviews across text, images, and videos vectors. The common way is to use librosa 2 to extract Mel-
to identify customer opinions on specific aspects. For example, spectrogram as audio features. It is the short-term power spec-
MABSA can assess dining experiences, like food quality or trum of sound and is widely used in modern audio processing.
service, to enhance restaurant operations. It also applies to Transformer structure has achieved tremendous success of in
social media, where analyzing mixed content provides deeper the field of natural language processing and computer vision.
insights into public opinion, aiding better decision-making and
marketing strategies. 2 https://github.com/librosa/librosa.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 4

Gong et al. [61] propose audio spectrogram Transformer in Vision Language Models (VLMs) like GPT-4V [65] and
(AST), which converted waveform into a sequence of 128- Flamingo [67] allows the models to interpret and generate
dimensional log Mel filterbank (fbank) features to encode outputs based on combined visual and textual inputs. In con-
audio modality. trast with prompt, instruction-tuning belongs to the learning
c) Vision Feature Extractor: For image modality, re- paradigm of prompt. Also, models like InstructBLIP [70]
searchers can extract fixed T frames from each segment and FLAN [72] have demonstrated that instruction-tuning not
and use effecientNet [62] pre-trained (supervised) on VG- only improves the model’s adherence to instructions but also
Gface3 and AFEW dataset as vision initial representation. enhances its ability to generalize across tasks. In the commu-
Furthermore, Dosovitskiy et al. [63] propose to use standard nity of multimodal affective computing, researchers can lever-
Transformer directly to images, which split an image into age these parameter-efficient transfer learning methods (e.g.,
patches and provide the sequence of linear embeddings of adapter, prompt and instruction tuning) to transfer knowledge
these patches as an input to a Transformer. CLIP [64] jointly from pre-trained models (e.g., unimodal pre-trained model or
trained image and its caption with the contrastive learning, multimodal pre-trained model) to downstream affective tasks,
thereby extraction vision features that correspond to texts. further tune the pre-trained model with the affective dataset.
d) Multimodal Feature Extractor: The emergence of Considering that multimodal affective computing involves
multimodal pre-trained model (MPM) marks a significant ad- multimodal learning, therefore, we analyze multimodal affec-
vancement in integrating multimodal signals, as demonstrated tive computing works from multimodal fusion and multimodal
by groundbreaking developments like GPT-4 [65] and Gem- alignment, as shown in Fig. 1.
ini [66]. Among the open-source innovations, Flamingo [67]
represents an early effort to integrate visual features with
B. Multimodal Fusion
LLMs using cross-attention layers. BLIP-2 [68] introduces a
trainable adaptor module (Q-Former) that efficiently connects Multimodal signals are heterogeneous and derived from
a pre-trained image encoder with a pre-trained LLM, ensuring various information sources, making integrating multimodal
precise alignment of visual and textual information. Similarly, signals into one representation essential. Tasi et al. [74]
MiniGPT-4 [69] achieves visual and textual alignment through summarize multimodal fusion into early, late or intermediate
a linear projection layer. InstructBLIP [70] advances the field fusion based on the fusion stage. Early fusion combines
by focusing on vision-language instruction tuning, building features from different modalities at the input level before
upon BLIP-2, and requiring a deeper understanding and larger the model processes them. Late fusion processes features
datasets for effective training. LLaVA [71] integrates CLIP’s from different modalities separately through individual sub-
image encoder with LLaMA’s language decoder to enhance networks, and the outputs of these sub-networks are combined
instruction tuning capabilities. Akbari et al. [30] train VATT at a later stage, typically just before making the final decision.
end-to-end from scratch using multimodal contrastive losses Late fusion uses unimodal decision values and combines them
and evaluate its performance by the downstream tasks of using mechanisms such as averaging [121], voting schemes
video action recognition, audio event classification, image [122], weighting based on channel noise [123] and signal
classification, and text-to-video retrieval. Based on multimodal variance [124], or a learned model [6], [125]. The two fusion
pre-trained model, raw modal signals can be used to extract strategies face some problems. For example, early fusion at
modal features. the feature level can underrate intra-modal dynamics after
the fusion operation, while late fusion at the decision level
V. M ULTIMODAL L EARNING ON M ULTIMODAL A FFECTIVE may struggle to capture inter-modal dynamics before the
C OMPUTING fusion operation. Different from the previous two methods by
combining features from different modalities at intermediate
Multimodal learning involves learning representations from layers of the model learner, Intermediate fusion allows for
different modalities. Generally, the multimodal model should more interaction between the modalities at different processing
first align the modalities based on their semantics before fusing stages, potentially leading to richer representations [37], [126],
multimodal signals. After alignment, the model combines [127]. Based on these fusion strategies, we review multimodal
multiple modalities into one representation vector. fusion from three aspects: cross-modality learning, modal
consistency and difference, and multi-stage modal fusion. Fig.
A. Preliminary 2 illustrates the three aspects of modal fusion.
With the scaling of the pre-trained model, parameter- 1) Cross-modality Learning: Cross-modality learning fo-
efficient transfer learning emerges such as adapter [31], cuses on the incorporation of inter-modality dependencies and
prompt [32], instruction-tuning [33] and in-context learn- interactions for better modal fusion in representation learn-
ing [34], [35]. In this paradigm, instead of adapting pre- ing. Early works of multimodal fusion [73] mainly operate
trained LMs to downstream tasks via objective engineering, geometric manipulation in the feature spaces to fuse multiple
downstream tasks are reformulated to look more like those modalities. The recent common way of cross-modality learn-
solved during the original LM training with the help of prompt, ing is to introduce attention-based learning method to model
instruction-tuning and in-context learning. The use of prompts inter-modality and intra-modality interactions. For example,
MuLT [74] proposes multimodal Transformer to learn inter-
3 https://www.robots.ox.ac.uk/ vgg/software/vgg face/. modal interaction. Chen et al. [75] augment the inter-intra
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 5

TFN [73], MuLT [74], TCDN [75], CM-BERT [76], HGraph-CL [77], BAFN [78], TeFNA [79],
Cross-modal Learning
CMCF-SRNet [80], MultiEMO [81], MM-RBN [82], MAGDRA [83], AMuSE [84].
Multimodal Learning on

Multimodal Modal Consistency MMIM [85], MPT [86], MMMIE [87], MISA [88], CoolNet [89], ModalNet [90], MAN [84], TAILOR [91],
Fusion (§V-B) and Difference AMP [92], STCN [93].
Affective Computing

Multi-stage TSCL-FHFN [94], HFFN [95], CLMLF [96], RMFN [97], CTFN [98], MCM [99], FmlMSN [100],
Modal Fusion ScaleVLAD [101], MUG [102], HFCE [103], MTAG [104], CHFusion [105].

MMIN [106], CMAL [107], M2R2 [108], EMMR [109], TFR-Net [110], MRAN [111], VIGAN [112],
Miss Modality
TATE [113], IF-MMIN [114], CTFN [98], MTMSA [115], FGR [116], MMTE+AMMTD [117].
Multimodal
Alignment (§V-C)
Semantic Alignment MuLT [74], ScaleVLAD [101], Robust-MSA [118], HGraph-CL [77], SPIM [119], MA-CMU-SGRNet [120].

Fig. 1. Taxonomy of multimodal affective computing from multimodal fusion and multimodal alignment.

image encoder
vision audio
vision
audio encoder vision
text
audio audio
..felt a bit text encoder
common
frustrated text modal consistency
text cross-modality and difference multi-stage fusion

Fig. 2. Illustration of multimodal fusion from following aspects: 1) cross-modality modal fusion, 2) modal fusion based on modal consistency and difference
and 3) multi-stage modal fusion.

modal features with trimodal collaborative interaction and ponents. Modal consistency helps handle missing modalities,
unifies the characteristics of the three modals (inter-modal). while modal difference leverages complementary informa-
Yang et al. [76] propose the cross-modal BERT (CM-BERT), tion from each modality to improve overall data understand-
aiming to model the interaction of text and audio modality ing. For example, several works [86], [87] have explored
based on pre-trained BERT model. Lin et al. [77] explore the learning modal consistency and difference using contrastive
intricate relations of intra- and inter-modal representations for learning. Han et al. [85] maximized the mutual information
sentiment extraction. More recently, Tang et al. [78] propose between modalities and between each modality to explore the
the multimodal dynamic enhanced block to capture the intra- modal consistency. Another study [86] proposes a hybrid con-
modality sentiment context, which decrease the intra-modality trastive learning framework that performs intra-/inter-modal
redundancy of auxiliary modalities. Huang et al. [79] propose contrastive learning and semi-contrastive learning simultane-
a Text-centered fusion network with cross-modal attention ously, models cross-modal interactions, preserves inter-class
(TeFNA), a multimodal fusion network that uses crossmodal relationships, and reduces the modality gap. Additionally,
attention to model unaligned multimodal timing information. Zheng et al. [87] combined mutual information maximization
In the community of emotion recognition, CMCF-SRNet [80] between modal pairs with mutual information minimization
is a cross-modality context fusion and semantic refinement net- between input data and corresponding features. This method
work, which contains a cross-modal locality-constrained trans- aims to extract modal-invariant and task-related information.
former and a graph-based semantic refinement transformer, Modal consistency can also be viewed as the process of
aiming to explore the multimodal interaction and dependencies projecting multiple modalities into a common latent space
among utterances. Shi et al. [81] propose an attention-based (modality-invariant representation), while modal difference
correlation-aware multimodal fusion framework MultiEMO, refers to projecting modalities into modality-specific repre-
which captures cross-modal mapping relationships across sentation spaces. For example, Hazarika et al. [88] propose
textual, audio and visual modalities based on bidirectional a method that projects each modality into both a modality-
multi-head cross attention layers. In summary, cross-modality invariant and a modality-specific space. They implemented
learning mainly focuses on modeling the relation between a decoder to reconstruct the original modal representation
modalities. using both modality-invariant and modality-specific features.
2) Modal Consistency and Difference: Modal consistency AMuSE [84] proposes a multimodal attention network to
refers to the shared feature space across different modali- capture cross-modal interactions at various levels of spatial
ties for the same sample, while modal difference highlights abstraction by jointly learning its interactive bunch of mode-
the unique information each modality provides. Most multi- specific peripheral and central networks. For the fine-grain
modal fusion approaches separate representations into modal- sentiment analysis, Xiao et al. [89] present CoolNet to boost
invariant (consistency) and modal-specific (difference) com- the performance of visual-language models in seamlessly
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 6

multi-level correlation mining and self-supervised multi-task

learning, Peng et al. [100] propose a fine-grained modal label-
based multi-stage network (FmlMSN), which utilize seven
mask
sentiment labels in unimodal, bimodal and trimodal infor-
mation at different granularities from text, audio, image and
the combinations of them. Researchers generally focus on the
scale-level modal alignment and modal fusion before model’
mask
decision. Sharafi et al. [93] design a new fusion method
was proposed for multimodal emotion recognition utilizing
..felt a bit frustrated ..felt a bitscales.
different frustrated

C. Multimodal Alignment
(a)
Multimodal alignment involves synchronizing modal se-
mantics before fusing multimodal data. A key challenge is
handling missing modalities, which can occur due to issues
like a camera being turned off, a user being silent, or device
mask errors affecting both voice and text. Since the assumption of
always having all modalities is often unrealistic, multimodal
alignment must address these gaps. Additionally, it involves
aligning objects across images, text, and audio through se-
mask mantic alignment. Thus, we discuss multimodal alignment in
terms of managing missing modalities and achieving semantic
rustrated ..felt a bit frustrated alignment. Fig. 3 illustrates the multimodal alignment.
1) Alignment for Missing Modality: In real-world scenar-
ios, data collection can sometimes result in the simultaneous
(b)
loss of certain modalities due to unforeseen events. While mul-
Fig. 3. Illustration multimodal alignment:(a) semantic alignment and (b) timodal affective computing typically assumes the availability
alignment with missing modal fragments.
of all modalities, this assumption often fails in practice, which
can cause issues in modal fusion and alignment models when
some modalities are missing. We classify existing methods for
integrating vision and language information. Zhang et al. [90] handling missing modalities into four groups.
propose an aspect-level sentiment classification model by The first group features the data augmentation approach,
exploring modal consistency with fusion discriminant attention which randomly ablates the inputs to mimic missing modality
network. cases. Parthasarathy et al. [107] propose a strategy to randomly
3) Multi-stage Modal Fusion: Multi-stage multimodal fu- ablate visual inputs during training at the clip or frame level
sion [128], [129] refers to combine modal information ex- to mimic real world scenarios. Wang et al. [108] deal with
tracted from multiple stages or multiple scales to fuse modal the utterance-level modalities missing problem by training
representation. Li et al. [94] design a two-stage contrastive emotion recognition model with iterative data augmentation by
learning task, which learns similar features for data with the learned common representation. The second group is based on
same emotion category and learns distinguishable features for generative methods to directly predict the missing modalities
data with different emotion categories. HFFN [95] divides given the available modalities [131]. For example, Zhao et
the procession of multimodal fusion into divide, conquer and al. [106] propose a missing modality imagination network
combine, which learns local interactions at each local chunk (MMIN), which can predict the representation of any missing
and explores global interactions by conveying information modality given available modalities under different missing
across local interactions. Different from the work of HFFN, modality conditions, so as to to deal with the uncertain missing
Li et al. [96] align and fused the token-level features of text modality problem. Zeng et al. [109] propose an ensemble-
and image and designed label based contrastive learning and based missing modality reconstruction (EMMR) network to
data based contrastive learning to capture common features detect and recover semantic features of the key missing
related to sentiment in multimodal data. There are some modality. Yuan et al. [110] propose a transformer-based fea-
work [97] decomposed the fusion procession into multiple ture reconstruction network (TFR-Net), which improves the
stages, each of them focused on a subset of multimodal signals robustness of models for the random missing in non-aligned
for specialized, effective fusion. Also, CTFN [130] presents a modality sequences. Luo et al. [111] propose the multimodal
novel feature fusion strategy that proceeds in a hierarchical reconstruction and align net (MRAN) to tackle the missing
fashion, first fusing the modalities two in two and only then modality problem, especially to relieve the decline caused by
fusing all three modalities. Moreover, the modal fusion at the text modality’s absence.
multiple levels has made progress, such as Li et al. [99] The third group aims to learn the joint multimodal rep-
propose a multimodal sentiment analysis method based on resentations that can contain related information from these
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 7

modalities [132]. For example, Ma et al. [133] propose a sentations for the final extraction to align text and image.
unified deep learning framework to efficiently handle missing Lai et al. [119] propose a deep modal shared information
labels and missing modalities for audio-visual emotion recog- learning module based on the covariance matrix to capture the
nition through correlation analysis. Zeng et al. [113] propose shared information between modalities. Additionally, we use
a tag-assisted Transformer encoder (TATE) network to handle a label generation module based on a self- supervised learning
the problem of missing uncertain modalities, which designs strategy to capture the private information of the modalities.
a tag encoding module to cover both the single modality and Our module is plug-and-play in multimodal tasks, and by
multiple modalities missing cases, so as to guide the network’s changing the parameterization, it can adjust the information
attention to those missing modalities. Zuo et al. [114] propose exchange relationship between the modes and learn the private
to use invariant features for a missing modality imagination or shared information between the specified modes. We also
network (IF-MMIN), which includes an invariant feature learn- employ a multi-task learning strategy to help the model focus
ing strategy and an invariant feature based imagination module its attention on the modal differentiation training data. For
(IF-IM). Through the two strategies, IF-MMIN can alleviate model robustness, Robust-MSA [118] present an interactive
the modality gap during the missing modalities prediction, thus platform that visualizes the impact of modality noise to help
improving the robustness of multimodal joint representation. researchers improve model capacity.
Zhou et al. [116] propose a novel brain tumor segmentation
network in the case of missing one or more modalities. The VI. M ODELS ACROSS M ULTIMODAL A FFECTIVE
proposed network consists of three sub-networks: a feature- C OMPUTING
enhanced generator, a correlation constraint block and a seg- In the community of multimodal affective computing, the
mentation network. The last group is translation-base methods. works appear to significant consistency in term of development
Tang et al. [98] propose the coupled-translation fusion network technical route. For clarity, we group the these works based
(CTFN) to model bi-direction interplay via couple learning, on multitask learning, pre-trained model, enhanced knowledge,
ensuring the robustness in respect to missing modalities. Liu contextual information. Meanwhile, we briefly summarized the
et al. [115] propose a modality translation-based MSA model advancements of MSA, MERC, MABSA and MMER tasks
(MTMSA), which is robust to uncertain missing modalities. In through the above four aspects. Fig. 4 summarizes the typical
summary, the works about alignment for miss modality focus works of multimodal affective computing from these aspects
on miss modality reconstruction and learning based on the and Table II shows the taxonomy of multimodal affective
available modal information. computing.
2) Alignment for Cross-modal Semantics: Semantic align-
ment aims to find the connection between multiple modalities
in one sample, which refers to searching one modal informa- A. Multitask Learning
tion through another modal information and vice versa. In the Multitask learning trains a model on multiple related tasks
filed of MSA, Tsai et al. [74] leverage cross-modality and simultaneously, using shared information to enhance perfor-
multi-scale modal alignment to implement the modal consis- mance. The loss function combines losses from all tasks,
tency in the semantic aspects, respectively. ScaleVLAD [200] with model parameters updated via gradient descent. In mul-
proposes a fusion model to gather multi-Scale representation timodal affective computing, multitask learning helps distin-
from text, video, and audio with shared vectors of locally guish between modal-invariant and modal-specific features and
aggregated descriptors to improve unaligned multimodal senti- integrates emotion-related sub-tasks into a unified framework.
ment analysis. Yang et al. [104] convert unaligned multimodal Fig. 5 shows the learning paradigm of multitask learning in
sequence data into a graph with heterogeneous nodes and multimodal affective learning task.
edges that captures the rich interactions across modalities and 1) Multimodal Sentiment Analysis: In filed of multi-
through time. Lee et al. [201] segment the audio and the under- modal sentiment analysis, Self-MM [134] generates a pseudo-
lying text signals into equal number of steps in an aligned way label [205]–[207] for single modality and then jointly train
so that the same time steps of the sequential signals cover the unimodal and multimodal representations based on the gener-
same time span in the signals. Zong et al. [202] exploit multi- ated and original labels. Furthermore, a translation framework
ple bi-direction translations, leading to double multimodal fus- ARGF between modalities, i.e., translating from one modality
ing embeddings compared with traditional translation methods. to another is used as an auxiliary task to regualize the
Wang et al. [203] propose a multimodal encoding–decoding multimodal representation learning [135]. Akhtar et al. [136]
translation network with a transformer and adopted a joint en- leverage the interdependence of the tasks sentiment and emo-
coding–decoding method with text as the primary information tion to improve the model performance on two tasks. Chen et
and sound and image as the secondary information. Zhang al. [137] propose a video-based cross-modal auxiliary network
et al. [120] propose a novel multi-level alignment to bridge (VCAN), which is comprised of an audio features map module
the gap between acoustic and lexical modalities, which can and a cross-modal selection module to make use of auxiliary
effectively contrast both the instance-level and prototype-level information. Zheng et al. [138] propose a disentanglement
relationships, separating the multimodal features in the latent translation network (DTN) with slack reconstruction to cap-
space. Yu et al. [204] propose an unsupervised approach which ture desirable information properties, obtain a unified feature
minimizes the Wasserstein distance between both modalities, distribution and reduce redundancy. Zheng et al. [87] com-
forcing both encoders to produce more appropriate repre- bine mutual information maximization (MMMIE) between
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 8

MSA (§VI-A1) Self-MM [134], ARGF [135], MultiSE [136], VCAN [137], DTN [138], MMMIE [87], MMIM [85], MISA [88],

FacialMMT [24], MMMIE [87], AuxEmo [139], TDFNet [140], MALN [141], LGCCT [142], MultiEMO [81],
MERC (§VI-A2)
Multitask RLEMO [143],
Learning (§VI-A)
MABSA (§VI-A3) CMMT [144], AbCoRD [145], JML [146], MPT [36], MMRBN [82],

MMER (§VI-A4) AMP [92], MEGLN-LDA [147], MultiSE [136], AMP [92].

MAG-XLNet [21], UniMSE [37], AOBERT [148], SKESL [149], TEASAL [150], TO-BERT [151], SPT [152],
MSA (§VI-B1)
ALMT [153]
Multimodal Affective Computing

MERC (§VI-B2) FacialMMT [24], QAP [19], UniMSE [37], GraphSmile [154],
Pre-trained
Model (§VI-B)
MIMN [24], GMP [17], ERUP [155], VLP-MABSA [156], DR-BERT [157], DTCA [158], MSRA [159],
MABSA (§VI-B3)
AOF-ABSA [160], AD-GCFN [161], MOCOLNet [162],

MMER (§VI-B4) TAILOR [91],

MSA (§VI-C1) TETFN [163], ITP [18], SKEAFN [164], SAWFN [165], MTAG [104],

Enhanced MERC (§VI-C2) ConSK-GCN [166], DMD [167], MRST [168], SF [169], TGMFN [170], RLEMO [143], DEAN [171],
Knowledge
(§VI-C) MABSA (§VI-C3) KNIT [172], FITE [173], CoolNet [174], HIMT [175],

MMER (§VI-C4) UniVA-RoBERTa [176], CARAT [177], M3TR [178], MAGDRA [83], HHMPN [179],

MuLT [74], CIA [180], CAT-LSTM [181], CAMFNet [182], MTAG [104], CTNet [183], ScaleVLAD [101],
MSA (§VI-D1)
MMML [184], GFML [184], CHFusion [105],

CMCF-SRNet [80], MMGCN [185], MM-DFN [186], SAMGN [187], M3Net [188], M3GAT [185], RL-EMO [143],
Contextual MERC (§VI-D2)
SCMFN [189], EmoCaps [190], GA2MIF [191], MALN [141], COGMEN [46],
Information
(§VI-D)
MABSA (§VI-D3) DTCA [158], MCPR [192], Elbphilharmonie [193], M2DF [194], AoM [195], FGSN [196], MIMN [15],

MMER (§VI-D4) MMS2S [197], MESGN [198], MDI [199],

Fig. 4. Taxonomy of multimodal affective computing works from aspects multitask learning, pre-trained model, enhanced knowledge and contextual information.

cialMMT), which jointly trains multimodal face recognition,

unsupervised face clustering, and face matching in a unified
architecture, so as to leverages the frame-level facial emotion
Task A
knowledge sharing

distributions to help improve utterance-level emotion recogni-

tion based on multi-task learning. Zhang et al. [208] design
audio two kinds of multitask learning (MTL) decoders, i.e., single-
shared level and multi-level decoders, to explore their potential. More
encoder Task B specifically, the core of a single-level decoder is a masked
vision outer-modal self-attention mechanism. Sun et al. [139] design
two auxiliary tasks to alleviate the insufficient fusion between
modalities and guide the network to capture and align emotion-
text Task C related features. Zhao et al. [140] propose a transformer-
based deep-scale fusion network (TDFNet) for multimodal
emotion recognition, solving the aforementioned problems.
The multimodal embedding (ME) module in TDFNet uses
Fig. 5. Illustration of multitask learning in multimodal affective computing pre-trained models to alleviate the data scarcity problem by
tasks. providing a prior knowledge of multimodal information to the
model with the help of a large amount of unlabeled data. Ren
et al. [141] propose a novel multimodal adversarial learning
modal pairs with mutual information minimization between network (MALN), which first mines the speaker’s charac-
input data and corresponding features, jointly extract modal- teristics from context sequences and then incorporate them
invariant and task-related information in a single architecture. with the unimodal features. Liu et al. [142] propose LGCCT,
a light gated and crossed complementation transformer for
2) Multimodal Emotion Recognition in Conversation:
multimodal speech emotion recognition.
In community of multimodal emotion recognition, Zheng
et al. [24] propose a two-stage framework named Fa- 3) Multimodal Aspect-based Sentiment Analysis: Yang et
cial expression-aware multimodal multi-task learning (Fa- al. [144] propose a multi-task learning framework named
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 9

cross-modal multitask Transformer (CMMT), which incor-

porates two auxiliary tasks to learn the aspect/sentiment-
aware intra-modal representations and introduces a text-guided
cross-modal interaction module to dynamically control the
contributions of the visual information to the representation audio PLM
of each word in the inter-modal interaction. Jain et al. [145] encoder
propose a hierarchical multimodal generative approach (Ab-
CoRD) for aspect-based complaint and rationale detection that
reframes the multitasking problem as a multimodal text-to-text
vision
generation task. Ju et al. [146] is the first to jointly perform
multimodal ATE (MATE) and multimodal ASC (MASC), probe fusion
and propose a joint framework JML with auxiliary cross-
modal relation detection for multimodal aspect-level sentiment
text
analysis (MALSA) to control the proper exploitation of visual
information. Zou et al. [36] design a multimodal prompt Trans- Fig. 6. An illustration of pre-trained model in multimodal affective computing
tasks. PLM denotes pre-trained language model.
former (MPT) to perform cross-modal information fusion.
Meanwhile, this work used the hybrid contrastive learning
(HCL) strategy to optimize the model’s ability to handle
labels with few samples. Chen et al. [82] design that audio AOBERT [148] introduces a single-stream transformer struc-
module should be more expressive than the text module, ture, which integrates all modalities into one BERT model.
and the single-modality emotional representation should be Qian et al. [149] embed sentiment information at the word
dynamically fused into the multimodal emotion representation, level into pre-trained multimodal representation to facilitate
and proposes corresponding rule-based multimodal multi-task further learning on limited labeled data. TEASAL [150] is
network (MMRBN) to restrict the representation learning. a Transformer-Based speech-prefixed language model, which
4) Multimodal Multi-label Emotion Recognition: For mul- exploits a conventional pre-trained language model as a cross-
timodal multi-label emotion recognition, Ge et al. [92] design modal Transformer encoder. Yu et al. [151] study target-
adversarial temporal masking strategy and adversarial param- oriented multimodal sentiment classification (TMSC) and pro-
eter perturbation strategy to jointly enhance the encoding of pose a multimodal BERT architecture for multimodal senti-
other modalities and generalization of the model respectively. ment analysis task. Cheng et al. [152] set layer-wise parame-
MER-MULTI [147] is label distribution adaptation which ter sharing and factorized co-attention that share parameters
adapts the label distribution between the training set and between cross attention blocks, so as to allow multimodal
testing set to remove training samples that do not match signal to interact within every layer. ALMT [153] incorporates
the features of the testing set. Akhtar et al. [209] present an adaptive hyper-modality learning (AHL) module to learn
a deep multi-task learning framework that jointly performs an irrelevance/conflict-suppressing representation from visual
sentiment and emotion analysis both, which leverage the inter- and audio features under the guidance of language features at
dependence of two related tasks (i.e. sentiment and emotion) different scales.
in improving each others performance using an effective 2) Multimodal Emotion Recognition in Conversation: In
multimodal framework. the domain of multimodal emotion recognition in conversa-
tion, FacialMMT [24] is a two-stage framework, which takes
B. Pre-trained Model RoBERTa [214] and Swin Transformer as the backbone for
In recent years, large language model (LLM) [56], [210] representation learning. Qiu et al. [215] adopt VATT [30]
and multimodal pre-trained model [21], [26], [211], [212] has to encode vision, text and audio respectively, and makes an
achieved significant progress [25], [210], [213]. Compared alignment between the learned modal representation. QAP [19]
with non-pretrained model, pre-trained model contains massive is a quantum-inspired adaptive-priority-learning model, which
transferred knowledge [27], [31], which can be introduced employs ALBERT as the text encoder and introduces quantum
into multimodal represention learning to probe the richer theory (QT) to learn modal priority adaptively. UniMSE [37]
information. Fig. 6 shows the use of pre-trained model in proposes a multimodal fusion method based on pre-trained
multimodal affective learning task. model T5, aiming to fuse the modal information with pre-
1) Multimodal Sentiment Analysis: In filed of multimodal trained knowledge. GraphSmile [154] adopts Roberta [214]
sentiment analysis, Rahman et al. [21] propose an attachment to track intricate emotional cues in multimodal dialogues,
to pre-trained model BERT and XLNet called multimodal alternately assimilating inter-modal and intra-modal emotional
adaptation gate (MAG), which allows BERT and XL-Net to dependencies layer by layer, adequately capturing cross-modal
accept multimodal nonverbal data by generating a shift that is cues while effectively circumventing fusion conflicts.
conditioned on the visual and acoustic modalities to internal 3) Multimodal Aspect-based Sentiment Analysis: In the
representation of BERT and XLNet. UniMSE [37] is a unified study of multimodal aspect-based sentiment analysis, Xu et
sentiment-sharing framework based on T5 model [57], which al. [47] are the first to put forward the task, multimodal aspect-
injects the non-verbal signals into pre-trained Transformer- based sentiment analysis, and propose a novel multi-interactive
based model for probing the knowledge stored in LLM. memory network (MIMN), which includes two interactive
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 10

TABLE II
TAXONOMY OF MULTIMODAL AFFECTIVE COMPUTING TASK FROM TASKS OF MSA, MERC, MABSA AND MMER.

Task Model Model architecture Miss Modality Datasets

TFN [73] LSTM % MOSI,MOSEI
MuLT [74] cross-attention % MOSI,MOSEI
MISA [88] Transformer % MOSI,MOSEI,UR FUNNY
Self-MM [134] BiLSTM % MOSI,MOSEI,SIMS
MTAG [104] Graph-based neural network % MOSI, IEMOCAP
ScaleVLAD [101] VLAD,CNN % MOSI,MOSEI,IEMOCAP
MMIM [85] BERT, CPC % MOSI,MOSEI
UniMSE [37] Adapter, T5, contrastive learning % MOSI,MOSEI,IEMOCAP,MELD
MSA
CHFusion [105] CNN % MOSI,IEMOCAP
MMHA [216] Multi-head Attention % MOUD,MOSI
QMR [217] Quantum Language Mode % Getty Images,Flickr
RMFN [97] LSTHM % MOSI
HMM-BLSTM [218] HMM,BiLSTM % IEMOCAP
HALCB [219] Cognitive brain limbic system % MOSI,YouTube,MOSEI
CSFC [220] CNN,fuzzy logic % Alh,Mos,Sag,MOUD
SFNN [221] CNN,attention % vista-net
GFML [184] Multi-head attention % MOSEI,MOSI
MMML [184] cross-attention % MOSEI,MOSI,CH-SIMS
TATE [113] Transformer ! IEMOCAP,MOSI
FacialMMT-RoBERTa [222] RoBERTa % MELD, Aff-Wild2
MultiEMO [81] Multi-head attention % IEMOCAP, MELD
MMGCN [185] Graph convolutional network % IEMOCAP, MELD
MM-DFN [186] GRU,Graph conventional network % IEMOCAP,MELD
EmoCaps [190] Multi-head attention % IEMOCAP,MELD
GA2MIF [191] Graph attention networks,Multi-head attention % IEMOCAP,MELD
MALN [141] LSTM, cross-attention, Transformer-based model % IEMOCAP, MELD
M2R2 [108] GRU ! IEMOCAP, MELD
TDF-Net [140] CNN,GRU,Transformer ! IEMOCAP
SDT [223] CNN,Transformer % IEMOCAP,MELD
HCT-DMG [224] Cross-modal Transformer % CMU-MOSI, MOSEI,IEMOCAP
MERC COGMEN [46] Graph neural network, Transformer % CMU-MOSI, IEMOCAP
Qiu et al. [215] VATT(Video-Audio-Text Transformer) % CMU-MOSEI
QAP [19] VGG,AlBERT % IEMOCAP,CMU-MOSEI
MPT-HCL [36] Transformer % IEMOCAP,MELD
CMCF-SRNet [80] Transformer,RGCN % IEMOCAP,MELD
SAMGN [187] Graph neural network % IEMOCAP,MELD
M3Net [188] Graph neural network % IEMOCAP,MELD
M3GAT [225] Graph attention network % MELD,MEISD,MESD
IF-MMIN [114] Imagination Module ! IEMOCAP
AMuSE [84] Attention-based model ! IEMOCAP,MELD
DEAN [171] Transformer % MOSI,MOSEI,IEMOCAP
MMIN [106] Transformer ! IEMOCAP,MSP-IMPROV
RLEMO [143] Graph Convolutional Network,GRU % IEMOCAP,MELD
Yao et al. [189] Graph convolutional network % IEMOCAP,MELD
BCFN [226] CNN % CHERMA,CH-SIMS
MM-RBN [82] Transformer % IEMOCAP
AoM [195] BART,CNN % Twitter2015, Twitter2017
JML [146] BERT, ResNet % TRC, Twitter2015, Twitter2017
VLP-MABSA [156] Vision-Language pre-trained model % Twitter2015, Twitter2017
MABSA CMMT [144] cross-attention,self-attention % Twitter-2015,Twitter-2017,Political-Twitter
DTCA [158] RoBERTa, ViT % Twitter2015, Twitter2017
M2DF [227] pre-trained model CLIP % Twitter2015, Twitter2017
MIMN [15] Memory Network % ZOL
KNIT [172] Transformer, % Twitter2015, Twitter2017
MMS2S [197] cross-attention, multi-head attention % MOSEI
MESGN [198] cross-modal transformer % MOSEI
TAILOR [228] Transformer, cross-attention % MOSEI
MMER HHMPN [229] MPNN, multi-head attention % MOSEI
AMP [92] Transformer-based encoder-decoder, mask learning % CMU-MOSEI,NEMu
M3TR [230] CNN, Transformer, cross-attention % MS-COCO,VOC 2007
UniVA [176] VAD model, contrastive learning % MOSEI, M 3 ED
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 11

memory networks to supervise the textual and visual informa-

tion with the given aspect, and learns not only the interactive
influences between cross-modality data but also the self influ-
ences in single-modality data. Yang et al. [17] propose a novel audio
generative multimodal prompt (GMP) model for MABSA, external
which includes the multimodal encoder module and the N- knowledge
Stream decoders module and perform three MABSA-related vision
tasks with quite a small number of labeled multimodal sam- fusion representation
ples. Liu et al. [155] propose an entity-related unsupervised Fusion
pre-training with visual prompts for MABSA. Instead of using
text
sentiment-related supervised pre-training, two entity-related
unsupervised pre-training tasks are applied and compared,
which are targeted at locating the entities in text with the sup-
Fig. 7. Illustration of enhanced knowledge in multimodal affective computing
port of visual prompts. Ling et al. [156] propose a task-specific tasks.
Vision-Language Pre-training framework for MABSA (VLP-
MABSA), which is a unified multimodal encoder-decoder ar-
chitecture for all the pre-training and downstream tasks. Zhang representations in a granularity descent way.
et al. [157] construct a dynamic re-weighting BERT (DR-
BERT) based BERT and designed to learn dynamic aspect- C. Enhanced Knowledge
oriented semantics for ABSA. Jin et al. [159] propose a multi- External knowledge in machine learning and AI refers
aspect semantic relevance model that takes into account the to information from outside the training dataset, including
match between search queries and the title, attribute and image knowledge bases, text corpora, knowledge graphs, pre-trained
information of items simultaneously. Wang et al. [160] propose models, and expert insights. Integrating this knowledge can
an end-to-end MABSA framework with image conversion improve performance, generalization, interpretability, and ro-
and noise filtration, which bridges the representation gap in bustness to noisy or limited data. Fig. 7 shows the common
different modalities by translating images into the input space way of incorporating external knowledge into multimodal
of a pre-trained language model (PLM). Wang et al. [161] affective learning task.
propose an adaptive dual graph convolution fusion network 1) Multimodal Sentiment Analysis: In the area of study
(AD-GCFN) for aspect-based sentiment analysis. This model focused on multimodal sentiment analysis, Rahmani et al.
uses two graph convolution networks: one for the semantic [18] construct an adaptive tree by hierarchically dividing users
layer to learn semantic correlations by an attention mecha- and utilizes an attention-based fusion to transfer cognitive-
nism, and the other for the syntactic layer to learn syntactic oriented knowledge within the tree. TETFN [163] is a novel
structure by dependency parsing. Mu et al. [162] propose a method named text enhanced Transformer fusion network,
novel momentum contrastive learning network (MOCOLNet) which learns text-oriented pairwise cross-modal mappings for
which is an unified multimodal encoder-decoder framework obtaining effective unified multimodal representations. Zhu et
for multimodal aspect-level sentiment analysis task. An end- al. [164] propose the sentiment knowledge enhanced attention
to-end training manner is proposed to optimize MOCOLNet fusion network (SKEAFN), a novel end-to-end fusion network
parameters, which alleviates the problem of scarcity of la- that enhances multimodal fusion by incorporating additional
belled pre-training data. To fully explore multimodal contents sentiment knowledge representations from an external knowl-
especially the location information, Zhang et al. [231] design edge base. Chen et al. [165] try to incorporate sentimental
a multimodal interactive network to model the textual and words knowledge into the fusion network to guide the learning
visual information of each aspect. Two memory network-based of joint representation of multimodal features.
modules are used to capture intra-modality features, including 2) Multimodal Emotion Recognition in Conversation: In
text to aspect alignment and image to aspect alignment. the discipline of research related to multimodal emotion
Yang et al. [232] propose multi-grained fusion network with recognition in conversation, Fu et al. [166] integrate context
self-distillation (MGFN-SD) to analyze aspect-based senti- modeling, knowledge enrichment, and multimodal (text and
ment polarity, which can effectively integrate multi-grained audio) learning into a GCN-based architecture. Li et al. [167]
representation learning with self-distillation to obtain more propose a decoupled multimodal distillation (DMD) approach
representative multimodal features. that facilitates flexible and adaptive crossmodal knowledge
4) Multimodal Multi-label Emotion Recognition: A few distillation, aiming to enhance the discriminative features of
works on multimodal multi-label emotion recognition leverage each modality. Sun et al. [168] investigate a multimodal fusion
pre-trained model to improve model performance. To our transformer network based on rough set theory, which facil-
best known, TAILOR [91] is a novel framework of versa- itates the interaction of multimodal information and feature
tile multimodal learning for multi-labeL emotion recogni- guidance through rough set cross-attention. Wang et al. [169]
tion, which adversarially depicts commonality and diversity devise a novel label-guided attentive fusion module to fuse the
among multiple modalities. TAILOR adversarially extracts label-aware text and speech representations, which learns the
private and common modality representations. Then a BERT- label-enhanced text/speech representations for each utterance
like transformer encoder is devised to gradually fuse these via label-token and label-frame interactions.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12

3) Multimodal Aspect-based Sentiment Analysis: In the

field of research on multimodal aspect-based sentiment anal- context
ysis, Xu et al. [172] introduce external knowledge, including u1 u1
textual syntax and cross-modal relevancy knowledge to Trans-
former layer, which cut off the irrelevant connections among inter-utterance
textual or cross-modal modalities by using a knowledge- dependency
induced matrix. Yang et al. [173] distill visual emotional u2 u2
cues and align them with the textual content to selectively target inter-modality target
match and fuse with the target aspect in textual modality.
CoolNet [174] is a cross-modal fine-grained alignment and
fusion network, aims to boost the performance of visual- u3 context u3
language models in seamlessly integrating vision and language
information, which transforms an image into a textual caption
and a graph structure, then dynamically aligns the semantic Fig. 8. Illustration of context information in multimodal affective computing
tasks.
and syntactic information from both the input sentence and
the generated caption, as well as models the object-level
visual features. To strengthen the semantic meanings of image
representations, Yu et al. [175] propose to detect a set of salient 1) Multimodal Sentiment Analysis: In the community of
objects in each image based on a pre-trained Faster R-CNN multimodal sentiment analysis, Chauhan et al. [180] employ a
model, and represent each object by concatenating its hidden context-aware attention module to learn intra-modality inter-
representation and associated semantic concepts, followed by action among particating modalities through encoder-decoder
an Aspect-Guided Attention layer to learn the relevance of structure. Multimodal context integrate the unimodal context,
each semantic concept with the guidance of given aspects. Poria et al. [181] propose a recurrent model with multi-level
4) Multimodal Multi-label Emotion Recognition: In the multiple attentions to capture contextual information among
area of study focused on multimodal multi-label emotion utterances, and design a recurrent model to capture contextual
recognition, Zheng et al. [176] propose to represent each information among utterances and introduced attention-based
emotion category with the valence-arousal (VA) space to networks for improving both context learning and dynamic
capture the correlation between emotion categories and de- feature fusion. Huang et al. [182] propose a novel context-
sign a unimodal VA-driven contrastive learning algorithm. based adaptive multimodal fusion network (CAMFNet) for
CARAT [177] presents contrastive feature reconstruction and consecutive frame-level sentiment prediction. Li et al. [183]
aggregation for the MMER task. Specifically, CARAT devises propose a spatial context extraction block to explore the spatial
a reconstruction-based fusion mechanism to better model context by calculating the relationships between feature maps
fine-grained modality-to-label dependencies by contrastively and the higher-level semantic representation in images.
learning modal-separated and label-specific features. Zhao et 2) Multimodal Emotion Recognition in Conversation: In
al. [178] propose a novel multimodal multi-label TRansformer the realm of research concerning multimodal emotion recog-
(M3TR) learning framework, which embeds the high-level nition in conversation, Hu et al. [185] make use of multimodal
semantics, visual structures and label-wise co-occurrences of dependencies effectively, and leverages speaker information to
multiple modalities into one unified encoding. Li et al. [83] model inter-speaker and intra-speaker dependency. Zhang et
propose a novel multimodal attention graph network with dy- al. [80] propose a cross-modality context fusion and semantic
namic routing-by-agreement (MAGDRA) for MMER. MAG- refinement network (CMCF-SRNet) to solve the limitation of
DRA is able to efficiently fuse graph data with various insufficient semantic relationship information between utter-
node and edge types as well as properly learn the cross- ances. Zhang et al. [187] construct multiple modality-specific
modal and temporal interactions between multimodal data graphs to model the heterogeneity of the multimodal context.
without pre-aligning. Zhang et al. [179] propose heterogeneous Chen et al. [188] propose a GNN-based model that explores
hierarchical message passing network (HHMPN), which can multivariate relationships and captures the varying importance
simultaneously model the feature-to-label, label-to-label and of emotion discrepancy and commonality by valuing multi-
modality-to-label dependencies via graph message passing. frequency signals. Zhang et al. [225] propose a multimodal,
multi-task interactive graph attention network, termed M3GAT,
D. Contextual Information to simultaneously solve conversational context dependency,
Context refers to the surrounding words, sentences, or multimodal interaction, and multi-task correlation in a unified
paragraphs that give meaning to a particular word or phrase. framework. RL-EMO [143] is a novel reinforcement Learn-
Understanding context is crucial for tasks like dialogue sys- ing framework for the multimodal emotion recognition task,
tems or sentiment analysis. In a conversation, context includes which combines reinforcement learning (RL) module to model
the history of previous utterances, while for news, it refers context at both the semantic and emotional levels respectively.
to the overall description provided by the entire document. Yao et al. [189] propose a speaker-centric multimodal fusion
Overall, contextual information helps machines make more network for emotion recognition in a conversation, to model
accurate predictions. Fig. 8 shows the significance of context intra-modal feature fusion and speaker-centric cross-modal
information to multimodal affective learning task. feature fusion.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13

3) Multimodal Aspect-based Sentiment Analysis: In the video segments in the wild with both multimodal and
stduy of multimodal aspect-based sentiment analysis, Yu et independent unimodal annotations.
al. [158] propose an unsupervised approach which minimizes • CH-SIMS v2.0 [238] is an extended version of CH-
the Wasserstein distance between both modalities, forcing both SIMS that includes more data instances, spanning text,
encoders to produce more appropriate representations for the audio and visual modalities. Each modality of sample
final extraction. Xu et al. [192] design and construct a multi- is annotated with sentiment polarity, and then sample is
modal Chinese product review dataset (MCPR) to support the annotated with a concluded sentiment.
research of MABSA. Anschutz et al. [193] report the results of • CMU-MOSEAS [236] is the first large-scale multimodal
an empirical study on how semantic computing can provide in- language dataset for Spanish, Portuguese, German and
sights into user-generated content for domain experts. In addi- French, and it is collect from YouTube and its samples
tion, this work discussed different image-based aspect retrieval are 4,000 in total.
and aspect-based sentiment analysis approaches to handle and • ICT-MMMO [237] is collected from online social review
structure large datasets. Zhao et al. [194] borrow the idea videos that encompass a strong diversity in how people
of Curriculum Learning and propose a multi-grained multi- express opinions about movies and include a real-world
curriculum denoising Framework (M2DF) to adjust the order variability in video recording quality4 .
of training data, so as to obtain more contextual information. • YouTube [43] collects 47 videos from the social media
Zhou et al. [195] propose an aspect-oriented method (AoM) web site YouTube. Each video contains 3-11 utterances
to detect aspect-relevant semantic and sentiment information. with most videos having 5-6 utterances in the extracted
Specifically, an aspect-aware attention module is designed to 30 seconds.
simultaneously select textual tokens and image blocks that are
semantically related to the aspects. Zhao et al. [196] propose B. Multimodal Emotion Recognition in Conversation
a fusion with GCN and SE ResNeXt Network (FGSN), which • MELD [239] contains 13,707 video clips of multi-party
constructs a graph convolution network on the dependency tree conversations, with labels following Ekman’s six univer-
of sentences to obtain the context representation and aspects sal emotions, including joy, sadness, fear, anger, surprise
words representation by using syntactic information and word and disgust.
dependency. • IEMOCAP [240] has 13,707 video clips of multi-party
4) Multimodal Multi-label Emotion Recognition: conversations, with labels following Ekman’s six univer-
MMS2S [197] is a multimodal sequence-to-set approach to sal emotions, including joy, sadness, fear, anger, surprise
effectively model label dependence and modality dependence. and disgust.
MESGN [198] firstly proposes this task, which simultaneously • HED [241] contains happy, sad, disgust, angry and scared
models the modality-to-label and label-to-label dependencies. emotion-aligned face, body and text samples, which are
Many works consider the dependencies of multi-label based much larger than existing datasets. Moreover, the emotion
on the characteristics of co-occurrence labels. Zhao et labels were correspondingly attached to those samples by
al. [199] propose a general multimodal dialogue-aware strictly following a standard psychological paradigm.
interaction framework, named by MDI, to model the impacts • RML [242] collects video samples from eight subjects,
of dialogue context on emotion recognition. speaking six different languages. The six languages are
English, Mandarin, Urdu, Punjabi, Persian, and Italian.
VII. DATASETS OF M ULTIMODAL A FFECTIVE C OMPUTING This dataset contains 500 video samples, each delivered
with one of the six particular emotions.
In this section, we introduce the benchmark datasets of • BAUM-1 [243] contains two sets: BAUM-1a and BAUM-
MSA, MERC, MABSA, and MMER tasks. To facilitate easy 1s databases. BAUM-1a database contains clips con-
navigation and reference, the details of datasets are shown in taining expressions of five basic emotions (happiness,
Table III with a comprehensive overview of the studies that sadness, anger, disgust, fear) along with expressions
we cover. of boredom, confusion (unsure) and interest (curiosity).
BAUM-1s database contains clips reflecting six basic
A. Multimodal Sentiment Analysis emotions and also expressions of boredom, contempt,
confusion, thinking, concentrating, bothered, and neutral.
• MOSI [233] contains 2,199 utterance video segments, • MAHNOB-HCI [244] includes 527 facial video record-
and each segment is manually annotated with a sentiment ings of 27 participants engaged in various tasks and
score ranging from -3 to +3 to indicate the sentiment interactions, while their physiological signals such as 32-
polarity and relative sentiment strength of the segment. channel electroencephalogram (EEG), 3-channel electro-
• MOSEI [234] is an upgraded version of MOSI, anno- cardiogram.
tated with both sentiment and emotion. MOSEI contains • Deap [245] contains data from 32 participants, aged
22,856 movie review clips from YouTube. Each sample between 19 and 37 (50% female), who were recorded
in MOSEI includes sentiment annotations ranging from watching 40 one-minute music videos. Each participant
-3 to +3 and multi-label emotion annotations. was asked to evaluate each video by assigning values
• CH-SIMS [235] is a Chinese single- and multimodal
sentiment analysis dataset, which contains 2,281 refined 4 http://multicomp.ict.usc.edu
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 14

TABLE III
L IST OF MULTIMODAL AFFECTIVE COMPUTING DATASETS . T,A,V DENOTE THE TEXT, AUDIO AND VISION MODALITIES , RESPECTIVELY. E MOTION
DENOTES THE SAMPLE IN DATASET IS LABELED WITH EMOTION CATEGORY AND SENTIMENT DENOTES THE SAMPLE IN DATESET IS LABELED WITH
SENTIMENT POLARITY.

Task Dataset Modalities Source Emotion Sentiment Language Datasize

CMU-MOSI [233] T,A,V Video Blog, YouTube % ! English 2,199
CMU-MOSEI [234] T,A,V YouTube ! ! English 22,856
CH-SIMS [235] T,A,V Movies,TVs % ! Chinese 2,281
MSA
CMU-MOSEAS [236] T,A,V YouTube ! ! Spanish,Portuguese,German,French 4,000
ICT-MMMO [237] T,A,V reviews - - - -
YouTube [43] T,A,V YouTube - - English 300
CH-SIMS v2.0 [238] T,A,V TV series,Shows,Movies % ! Chinese 14,402
MELD [239] T,A,V Friends TV ! % English 7,532
IEMOCAP [240] T,A,V Act ! % English 13,707
HED [241] T,V Movies, TVs ! % English 17,441
MERC RML [242] A,V Video ! % English, Mandarin, Urdu, Punjabi, Persian, and Italian -
BAUM-1 [243] A,V Data collection ! % Turkish 1,184
MAHNOB-HCI [244] V, EEG Data collection ! % - -
Deap [245] EEG Act, Data collection ! ! physiological signal -
MuSe-CaR [246] T,A,V Car reviews,YouTube ! ! English
CHEAVD [247] A,V Movies,TVs - - Mandarin 7,030
MSP-IMPROV [248] T,A,V Act ! % English 8,438
MEISD [249] T,A,V TVs ! ! English -
MESD [250] T,A,V TVs ! ! English 9,190
Ulm-TSST [251] A,V,EEG job interviews ! ! English -
CHERMA [252] T,A,V TV series,Shows,Movies ! ! Chinese 28,717
AMIGOS [253] EEG, ECG, GSR Data collection % ! - -
Twitter2015 [254] T,V Twitter % ! English -
Twitter2017 [254] T,V Twitter % ! English -
MABSA MCPR [255] T,V Product reviews % ! Chinese 15,000
Multi-ZOL [47] T,V Product reviews % ! Chinese 5,288
MACSA [256] T,V Hotel service reviews % ! Chinese 21,108
MASAD [257] T,V Visual sentiment ontology datasets % ! English 38,532
PanoSent [258] T,A,V Social media % ! English,Chinese, Spanish 10,000
CMU-MOSEI [234] T,A,V YouTube ! ! English 22,856
MMER
M 3 ED [199] T,A,V 56 TVs ! % Mandarin 24,449

from 1 to 9 for arousal, valence, dominance, like/dislike, • AMIGOS [253] is collected in two experimental set-
and familiarity. tings. In the first setting, 40 participants viewed 16
• MuSe-CaR [246] focuses on the tasks of emotion, short emotional videos. In the second setting, participants
emotion-target engagement, and trustworthiness recogni- watched 4 longer videos, some individually and others
tion by means of comprehensively integrating the audio- in groups. During these sessions, participants’ phys-
visual and language modalities. iological signals—Electroencephalogram (EEG), Elec-
• CHEAVD 2.0 [247] is selected from Chinese movies, trocardiogram (ECG), and Galvanic Skin Response
soap operas and TV shows, which contains noise in the (GSR)—were recorded using wearable sensors.
background to mimic real-world conditions.
• MSP-IMPROV [248] is a multimodal emotional
database comprised of spontaneous dyadic interactions, C. Multimodal Aspect-based Sentiment Analysis
designed to study audiovisual perception of expressive • Twitter2015 and Twitter2017 are originally provided by
behaviors. the work [254] for multimodal named entity recognition
• MEISD [249] is a large-scale balanced multimodal multi- and annotated with the sentiment polarity for each aspect
label emotion, intensity, and sentiment dialogue dataset by the work [13].
(MEISD) collected from different TV series that has • MCPR [255] has 2,719 text-image pairs and 610 distinct
textual, audio, and visual features. aspects in total, which collects 1.5k product reviews
• MESD [250] is the first multimodal and multi-task senti- involving clothing and furniture departments, from the e-
ment, emotion, and desire dataset, which contains 9,190 commercial platform JD.com. It is the first aspect-based
text-image pairs, with English text. multimodal Chinese product review dataset.
• Ulm-TSST [251] is a multimodal dataset, where partic- • Multi-ZOL [47] consists of reviews of mobile phones
ipants were recorded in a stressful situation emulating a collected from ZOL.com. It contains 5,288 sets of multi-
job interview, following the TSST protocol. modal data points that cover various models of mobile
• CHERMA [252] provides uni-modal labels for each indi- phones from multiple brands. These data points are
vidual modality, and multi-modal labels for all modalities annotated with a sentiment intensity rating from 1 to 10
jointly observed. It is collected from various source, for six aspects.
including 148 TV series, 7 variety shows, and 2 movies. • MACSA [256] contains more than 21K text-image pairs,
and provides fine-grained annotations for both textual and
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 15

visual content and firstly uses the aspect category as the IX. D ISCUSS
pivot to align the fine-grained elements between the two In this section, we briefly discuss the works of multimodal
modalities. affective computing based on facial expression, acoustic sig-
• MASAD [257] selects 38,532 samples from a partial nal, physiological signals, and emotion cause. Furthermore, we
VSO visual dataset [259] (approximately 120,000 sam- discuss the technical routes across multiple multimodal affec-
ples) that can clearly express sentiments and categorized tive computing tasks to track their consistency and difference.
them into seven domains: food, goods, buildings, animal,
human, plant, scenery, with a total of 57 predefined
aspects. A. Other Multimodal Affective Computing
• PanoSent [258] is annotated both manually and au- a) Multimodal Affective Computing Based on Facial
tomatically, featuring high quality, large scale (10,000 Expression Recognition: Facial expression recognition has
dialogues), multimodality (text, image, audio and video), significantly evolved over the years, progressing from static to
multilingualism (English, Chinese and Spanish), multi- dynamic methods. Initially, static facial expression recognition
scenarios (over 100 domains), and covering both im- (SFER) relied on single-frame images, utilizing traditional
plicit&explicit sentiment elements. image processing techniques such as Local Binary Patterns
(LBP) and Gabor filters to extract features for classification.
The advent of deep learning brought Convolutional Neural
D. Multimodal Multi-label Emotion Recognition Networks (CNNs), which markedly improved the accuracy
• CMU-MOSEI [234] contains 22,856 movie review clips of SFER [260]–[263]. However, static methods were lim-
from Youtube videos. Each video intrinsically contains ited in capturing the temporal dynamics of facial expres-
three modalities: text, audio, and visual, and each movie sions [264]. Some methods attempt to approach the problem
review clip is annotated with at least one emotion cate- from a local-global feature perspective, extracting more fine-
gory of the set: angry, disgust, fear, happy, sad, surprise. grained visual representations and identifying key informative
• M3 ED [199] is a multimodal emotional dialogue dataset segments [265]–[270]. These approaches enhance robustness
in Chinese, which contains a total of 9,082 turns and against noisy frames, enabling uncertainty-aware inference. To
24,449 utterances, and each utterance is annotated with further enhance accuracy, recent advancements in DFER focus
the seven emotion categories (happy, surprise, sad, dis- on integrating multimodal data and employing parameter-
gust, anger, fear, and neutral). efficient fine-tuning (PEFT) to adapt large pre-trained models
for enhanced performance. Liu et al. [271] introduces the
concept of expression reenactment (i.e. normalization), har-
VIII. E VALUATION M ETRICS nessing generative AI to mitigate noise in in-the-wild datasets.
Moreover, the burgeoning evidential deep learning (EDL) has
In this section, we report the mainstream evaluation metrics shown considerable promise by enabling explicit uncertainty
for each multimodal affective computing task. quantification through the distributional measurement in la-
a) Multimodal Sentiment Analysis: Previous works adopt tent spaces for improved interpretability, with demonstrated
mean absolute error (MAE), Pearson correlation (Corr), seven- efficacy in zero-shot learning [272], multi-view classifica-
class classification accuracy (ACC-7), binary classification tion [273]–[275], video understanding [276]–[278] and multi-
accuracy (ACC-2) and F1 score computed for positive/negative modal named entity recognition.
and non-negative/negative classification as evaluation metrics. b) Multimodal Affective Computing Based on Acoustic
b) Multimodal Emotion Recognition in a conversation: Signal: The model based on single-sentence single-task is
Accuracy (ACC) and weighted F1 (WF1) are used for evalua- the most common model in speech emotion recognition. For
tion. Additionally, the imbalance label distribution results in a example, Aldeneh et al. [279] use CNN to perform convolu-
phenomenon that the trained model performs better on some tions in the time direction of handcrafted temporal features
categories and perform poorly on others. In order to verify the (40-dimensional MFSC) to identify emotion-salient regions
impacts of data distribution on model performance, researchers and used global max pooling to capture important temporal
also provide ACC and F1 on each emotion category to measure areas. Li et al. [280] apply two different convolution kernels
the model performance. on spectrograms to extract temporal and frequency domain
c) Multimodal Aspect-based Sentiment Analysis: With features, concatenated them, and input them into a CNN
the previous methods, for multimodal aspect term extrac- for learning, followed by attention mechanism pooling for
tion (MATE) and joint multimodal aspect sentiment analysis classification. Trigeorgis et al. [281] use CNN for end-to-end
(JMASA) tasks, researchers use precision (P), recall (R) and learning directly on speech signals, avoiding the problem of
micro-F1 (F1) as the evaluation metrics. For the multimodal feature extraction not being robust for all speakers. Mirsamadi
aspect sentiment classification(MASC) task, accuracy (ACC) et al. [282] combine Bidirectional LSTM (Bi-LSTM) with
and macro-F1 are as evaluation metrics. a novel pooling strategy, utilizing attention mechanisms to
d) Multimodal Multi-label Emotion Recognition: Ac- enable the network to focus on emotionally prominent parts of
cording to the prior work, multi-label classification works sentences. Zhao et al. [283] consider the temporal and spatial
mostly adopt accuracy (ACC), micro-F1, precision (P) and characteristics of the spectrum in the attention mechanism to
recall (R) as evaluation metrics. learn time-related features in spectrograms, and using CNN
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 16

to learn frequency-related features in spectrograms. Luo et tasks such as image captioning [7], [299], the impact of vision
al. [284] propose a dual-channel speech emotion recognition is more significant than language. In contrast, multimodal
model that uses CNN and RNN to learn from spectrograms affective computing tasks place a greater emphasis on lan-
on one hand, and separately learns HSFs features on the other, guage [37], [300].
finally concatenating the obtained features for classification. b) Pre-trained model: Generally, pre-trained models are
c) Multimodal Affective Computing Based on Physiolog- used to encode raw modal information into vectors. From this
ical Signals: In medical measurements and health monitoring, perspective, multimodal affective computing tasks adopt pre-
EEG-based emotion recognition (EER) is one of the most trained models as the backbone and then fine-tune them for
promising directions within emotion recognition and has at- downstream tasks. For example, UniMSE [37] uses T5 as the
tracted substantial research attention [285]–[287]. Notably, the backbone, while GMP [17] utilizes BART. These approaches
field of affective computing has seen nearly 1,000 publica- aim to transfer the general knowledge embedded in pre-trained
tions related to EER since 2010 [288]. Numerous EEG-based language models to the field of affective computing.
multimodal emotion recognition (EMER) methods have been c) Enhanced knowledge: Commonsense knowledge en-
proposed [289]–[293], leveraging the complementarity and re- compasses facts and judgments about our natural world. In
dundancy between EEG and other physiological signals in ex- the field of affective computing, this knowledge is crucial for
pressing emotions. For example, Vazquez et al. [294] address enabling machines to understand human emotions and their
the problem of multimodal emotion recognition from multiple underlying causes. Researchers enhance affective computing
physiological signal, which demonstrates Transformer-based by integrating external knowledge sources such as sentiment
approach is suitable for emotion recognition based on physi- lexicons [301], English knowledge bases [302]–[306], and
ological signal. Chinese knowledge bases [307] as the external knowledge to
d) Multimodal Affective Computing Based on Emotion enhance affective computing.
Cause: Apart from focusing on the emotions themselves, the d) Contextual information: Affective computing tasks
capacity of machine for understanding the cause that triggers require an understanding of contextual information. In MERC,
an emotion is essential for comprehending human behaviors, contextual information encompasses the entire conversation,
which makes emotion-cause pair extraction (ECPE) crucial. including both previous and subsequent utterances relative to
Over the years, text-based ECPE has made significant progress the current utterance. For MABSA, contextual information
[295], [296]. Based on ECPE, Li et al. [297] propose multi- refers to the full sentence containing customer opinions. Re-
modal emotion-cause pair extraction (MECPE), which aims searchers integrate contextual information using hierarchical
to extract emotion-cause pairs with multimodal information. approaches [308], [309], self-attention mechanisms [55], and
Initially, Li et al. [297] construct a joint training architecture, graph-based dependency modeling [310], [311]. Additionally,
which contains the main task, i.e., multimodal emotion-cause affective computing tasks can enhance understanding by incor-
pair extraction and two subtasks, i.e., multimodal emotion porating non-verbal cues such as facial expressions and vocal
detection and cause detection. To solve MECPE, researchers tone, alongside textual information.
borrowed the multitask learning framework to train the model
using multiple training objectives of sub-tasks, aiming to
C. Difference among Multimodal Affective Computing
enhance the knowledge sharing among them. For example,
Li et al. [298] propose a novel model that captures holistic We examine the differences among multimodal affective
interaction and label constraint (HiLo) features for the MECPE computing tasks by considering the type of downstream tasks,
task. HiLo enables cross-modality and cross-utterance feature sentiment granularity, and application contexts to identify the
interactions through various attention mechanisms, providing unique characteristics of each task.
a strong foundation for accurate cause extraction. For downstream tasks, MSA predicts sentiment strength as
a regression task. MERC is a multi-class classification task for
identifying emotion categories. MMER performs multilabel
B. Consistency among Multimodal Affective Computing emotion recognition, detecting multiple emotions simultane-
We categorize the multimodal affective computing tasks ously. MABSA involves extracting aspects and opinions to
into several key areas: multimodal alignment and fusion, determine sentiment polarity, categorizing it as information ex-
multi-task learning, pre-trained models, enhanced knowledge, traction. In terms of analysis granularity, MERC and MECPE
and contextual information. To ensure clarity, we discuss the focus on utterances and speakers within a conversation, while
consistencies across these aspects. MSA and MMER concentrate on sentence-level information
a) Multimodal alignment and fusion: Among MSA, within a document. MABSA, on the other hand, focuses
MERC, MABSA and MMER tasks, each is fundamentally a on aspects within comments. Some studies infer fine-grained
multimodal task that involves considering and combining at sentiment from coarse-grained sentiment [207], [312] or in-
least two modalities to make decisions. This process includes tegrate tasks of different granularities into a unified training
extracting features from each modality and integrating them framework [300]. Due to these differences in granularity,
into a unified representation vector. In multimodal represen- the contextual information varies as well. For instance, in
tation learning, modal alignment and fusion are two critical MABSA, the context includes the comment along with any
issues that must be addressed to advance the field of multi- associated images and short descriptions of aspects, whereas
modal affective computing. For vision-dominated multimodal in MERC, the context encompasses the entire conversation
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 17

and speaker information. In terms of application scenarios, XI. C ONCLUSION

MSA, MMER, and MABSA are used for public opinion Multimodal Affective Computing (MAC) has emerged as a
analysis and mining user experiences related to products or crucial research direction in artificial intelligence with signifi-
services. MERC and MECPE help machines understand and cant progress in understanding and interpreting emotions. This
mimic human behaviors, generating empathetic responses in survey provides a comprehensive overview of the diverse tasks
dialogue agents. While many tasks are context-specific, there associated with multimodal affective computing, covering
is a growing trend toward unified frameworks for analyzing its research background, definitions, related work, technical
human emotions across diverse settings (e.g., task type and approaches, benchmark dataset, and evaluation metrics. We
emotion granularity) [37], [136]. group multimodal affective computing across MSA, MERC,
MABSA and MMER tasks into four categories: multi-task
X. F UTURE WORK learning, pre-trained modal, enhanced knowledge and context
information. Additionally, we summarize the consistency and
We outline directions for future work in multimodal af-
differences among various affective computing tasks. Also,
fective computing from Transfer Learning with Multimodal
we report the inherent challenges in multimodal sentiment
Pre-trained Model, Unification of Multimodal Affective Com-
analysis and explore potential directions for future research
puting tasks, Model with External Knowledge Distill, and
and development.
Affective Computing with Less-studied Modalities.
a) Unification of Multimodal Affective Computing tasks:
R EFERENCES
Recent advances have made significant strides by unifying
related yet distinct tasks into a single framework [25], [210], [1] Z. Zhu, X. Zhuang, Y. Zhang, D. Xu, G. Hu, X. Wu, and Y. Zheng,
“Tfcd: Towards multi-modal sarcasm detection via training-free coun-
[213]. For example, T5 [57] integrates various NLP tasks by terfactual debiasing,” in Proceedings of the Thirty-Third International
representing all text-based problems in a text-to-text format, Joint Conference on Artificial Intelligence, IJCAI-24, K. Larson, Ed.
achieving state-of-the-art results across numerous benchmarks. International Joint Conferences on Artificial Intelligence Organization,
2024, pp. 6687–6695.
These studies highlight the effectiveness of such unified [2] Z. Zhu, X. Cheng, G. Hu, Y. Li, Z. Huang, and Y. Zou, “Towards multi-
frameworks in enhancing model performance and generaliza- modal sarcasm detection via disentangled multi-grained multi-modal
tion [212], [212], [313]. Meanwhile, the progress also suggests distilling,” in Proceedings of the 2024 Joint International Conference
on Computational Linguistics, Language Resources and Evaluation,
the potential for unifying multimodal affective computing LREC/COLING 2024, 20-25 May, 2024, Torino, Italy, N. Calzolari,
tasks across diverse application scenarios. First, unification M. Kan, V. Hoste, A. Lenci, S. Sakti, and N. Xue, Eds. ELRA and
across different granularity–from fine to coarse—has been ICCL, 2024, pp. 16 581–16 591.
[3] A. Ben-Ze’ev, The subtlety of emotions. MIT press, 2001.
employed to train models effectively [207]. Second, an in- [4] R. K. Shelly, “Emotions, sentiments, and performance expectations,” in
creasing number of pre-trained models now handle language, Theory and research on human emotions. Emerald Group Publishing
vision, and audio simultaneously, enabling end-to-end process- Limited, 2004.
[5] R. J. Davidson, K. R. Sherer, and H. H. Goldsmith, Handbook of
ing across single, dual, and multiple modalities [314]. Third, affective sciences. Oxford University Press, 2009.
integrating emotion-cause analysis with emotion recognition in [6] G. A. Ramı́rez, T. Baltrusaitis, and L. Morency, “Modeling latent
a multimodal setting within a single architecture can enhance discriminative dynamic of multi-dimensional affective signals,” in
Affective Computing and Intelligent Interaction - Fourth International
their mutual indications and improve overall performance. Conference, ACII 2011, Memphis, TN, USA, October 9-12, 2011,
b) Transfer Learning with External Knowledge Distill: Proceedings, Part II, 2011, pp. 396–406.
In the field of affective computing, incorporating external [7] J. Li, R. R. Selvaraju, A. Gotmare, S. R. Joty, C. Xiong, and S. C. Hoi,
“Align before fuse: Vision and language representation learning with
knowledge such as sentiment lexicons and commonsense momentum distillation,” in Advances in Neural Information Processing
knowledge is crucial for a deeper understanding of emotional Systems 34: Annual Conference on Neural Information Processing Sys-
expressions within the context of social norms and cultural tems 2021, NeurIPS 2021, December 6-14, 2021, virtual, M. Ranzato,
A. Beygelzimer, Y. N. Dauphin, P. Liang, and J. W. Vaughan, Eds.,
backgrounds [315]. For example, The expression and percep- 2021, pp. 9694–9705.
tion of emotion also varies across cultures, both in text and [8] N. C. Garcia, P. Morerio, and V. Murino, “Modality distillation with
in face-to-face communication [316]. These differences are multiple stream networks for action recognition,” in Computer Vision -
ECCV 2018 - 15th European Conference, Munich, Germany, September
critical for cross-cultural sentiment analysis. 8-14, 2018, Proceedings, Part VIII, ser. Lecture Notes in Computer
c) Affective Computing with Less-studied Modalities: Science, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Eds.,
Natural language (spoken and written), visual data (images vol. 11212. Springer, 2018, pp. 106–121.
[9] Y. Ji, H. Liu, B. He, X. Xiao, H. Wu, and Y. Yu, “Diversified
and videos), and auditory signals (speech, sound, and music) multiple instance learning for document-level multi-aspect sentiment
have long been central to multimodal affective computing. classification,” in Proceedings of the 2020 conference on empirical
Recently, new sensing data types, like haptic and ECG signals, methods in natural language processing (EMNLP), 2020, pp. 7012–
7023.
are gaining attention. Haptic signals, which involve touch [10] P. J. Donnelly and A. Prestwich, “Identifying sentiment from crowd
and convey sensory and emotional attributes, enhance user audio,” in 7th International Conference on Frontiers of Signal Pro-
experiences in areas such as gaming, virtual reality, and cessing, ICFSP 2022, Paris, France, September 7-9, 2022, 2022, pp.
64–69.
mobile apps [317]. These signals offer immediate feedback [11] Z. Sun, P. K. Sarma, W. A. Sethares, and Y. Liang, “Learning relation-
and can improve user engagement. As research progresses, ships between text, audio, and video via deep canonical correlation for
less-studied modalities like haptic will likely become crucial, multimodal language analysis,” in The Thirty-Fourth AAAI Conference
on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative
complementing established methods and advancing the field Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth
of affective computing. AAAI Symposium on Educational Advances in Artificial Intelligence,
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 18

EAAI 2020, New York, NY, USA, February 7-12, 2020, 2020, pp. 8992– [28] H. Bao, W. Wang, L. Dong, Q. Liu, O. K. Mohammed, K. Aggarwal,
8999. S. Som, S. Piao, and F. Wei, “Vlmo: Unified vision-language pre-
[12] A. Zadeh, C. Mao, K. Shi, Y. Zhang, P. P. Liang, S. Poria, and training with mixture-of-modality-experts,” in NeurIPS, 2022.
L. Morency, “Factorized multimodal transformer for multimodal se- [29] J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, and
quential learning,” CoRR, vol. abs/1911.09826, 2019. Y. Wu, “Coca: Contrastive captioners are image-text foundation mod-
[13] D. Lu, L. Neves, V. Carvalho, N. Zhang, and H. Ji, “Visual attention els,” Trans. Mach. Learn. Res., vol. 2022, 2022.
model for name tagging in multimodal social media,” in Proceedings [30] H. Akbari, L. Yuan, R. Qian, W. Chuang, S. Chang, Y. Cui, and
of the 56th Annual Meeting of the Association for Computational B. Gong, “VATT: transformers for multimodal self-supervised learning
Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume from raw video, audio and text,” in Advances in Neural Information
1: Long Papers, 2018, pp. 1990–1999. Processing Systems 34: Annual Conference on Neural Information
[14] A. Gandhi, K. Adhvaryu, S. Poria, E. Cambria, and A. Hussain, “Mul- Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual,
timodal sentiment analysis: A systematic review of history, datasets, 2021, pp. 24 206–24 221.
multimodal fusion methods, applications, challenges and future direc- [31] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. de Laroussilhe,
tions,” Inf. Fusion, vol. 91, pp. 424–444, 2023. A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer
[15] N. Xu, W. Mao, and G. Chen, “Multi-interactive memory network learning for NLP,” in Proceedings of the 36th International Conference
for aspect based multimodal sentiment analysis,” in The Thirty-Third on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach,
AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty- California, USA, 2019, pp. 2790–2799.
First Innovative Applications of Artificial Intelligence Conference, [32] X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts
IAAI 2019, The Ninth AAAI Symposium on Educational Advances in for generation,” in Proceedings of the 59th Annual Meeting of the
Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 Association for Computational Linguistics and the 11th International
- February 1, 2019, 2019, pp. 371–378. Joint Conference on Natural Language Processing, ACL/IJCNLP 2021,
[16] F. Wang, Z. Ding, R. Xia, Z. Li, and J. Yu, “Multimodal emotion-cause (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, C. Zong,
pair extraction in conversations,” CoRR, vol. abs/2110.08020, 2021. F. Xia, W. Li, and R. Navigli, Eds. Association for Computational
Linguistics, 2021, pp. 4582–4597.
[17] X. Yang, S. Feng, D. Wang, Q. Sun, W. Wu, Y. Zhang, P. Hong, and
[33] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du,
S. Poria, “Few-shot joint multimodal aspect-sentiment analysis based
A. M. Dai, and Q. V. Le, “Finetuned language models are zero-shot
on generative multimodal prompt,” in Findings of the Association for
learners,” arXiv preprint arXiv:2109.01652, 2021.
Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14,
2023, 2023, pp. 11 575–11 589. [34] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhari-
wal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal,
[18] S. Rahmani, S. Hosseini, R. Zall, M. R. Kangavari, S. Kamran,
A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh,
and W. Hua, “Transfer-based adaptive tree for multimodal sentiment
D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler,
analysis based on user latent aspects,” Knowl. Based Syst., vol. 261, p.
M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish,
110219, 2023.
A. Radford, I. Sutskever, and D. Amodei, “Language models are few-
[19] Z. Li, Y. Zhou, Y. Liu, F. Zhu, C. Yang, and S. Hu, “QAP: shot learners,” in Advances in Neural Information Processing Systems
A quantum-inspired adaptive-priority-learning model for multimodal 33: Annual Conference on Neural Information Processing Systems
emotion recognition,” in Findings of the Association for Computational 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle,
Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, 2023, pp. M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., 2020.
12 191–12 204. [35] Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun,
[20] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, J. Xu, and Z. Sui, “A survey on in-context learning,” arXiv preprint
“Multimodal deep learning,” in Proceedings of the 28th International arXiv:2301.00234, 2022.
Conference on Machine Learning, ICML 2011, Bellevue, Washington, [36] S. Zou, X. Huang, and X. Shen, “Multimodal prompt transformer with
USA, June 28 - July 2, 2011, L. Getoor and T. Scheffer, Eds. hybrid contrastive learning for emotion recognition in conversation,”
Omnipress, 2011, pp. 689–696. CoRR, vol. abs/2310.04456, 2023.
[21] W. Rahman, M. K. Hasan, S. Lee, A. B. Zadeh, C. Mao, L. Morency, [37] G. Hu, T. Lin, Y. Zhao, G. Lu, Y. Wu, and Y. Li, “Unimse: Towards
and M. E. Hoque, “Integrating multimodal information in large pre- unified multimodal sentiment analysis and emotion recognition,” in
trained transformers,” in Proceedings of the 58th Annual Meeting of Proceedings of the 2022 Conference on Empirical Methods in Nat-
the Association for Computational Linguistics, ACL 2020, Online, July ural Language Processing, EMNLP 2022, Abu Dhabi, United Arab
5-10, 2020, 2020, pp. 2359–2369. Emirates, December 7-11, 2022, 2022, pp. 7837–7851.
[22] Y. Zhang and Q. Yang, “A survey on multi-task learning,” IEEE Trans. [38] Y. Zhang, X. Yang, X. Xu, Z. Gao, Y. Huang, S. Mu, S. Feng, D. Wang,
Knowl. Data Eng., vol. 34, no. 12, pp. 5586–5609, 2022. Y. Zhang, K. Song et al., “Affective computing in the era of large
[23] Y. Xie, K. Yang, C. Sun, B. Liu, and Z. Ji, “Knowledge-interactive language models: A survey from the nlp perspective,” arXiv preprint
network with sentiment polarity intensity-aware multi-task learning arXiv:2408.04638, 2024.
for emotion recognition in conversations,” in Findings of the Asso- [39] B. Pan, K. Hirota, Z. Jia, and Y. Dai, “A review of multimodal emotion
ciation for Computational Linguistics: EMNLP 2021, Virtual Event / recognition from datasets, preprocessing, features, and fusion methods,”
Punta Cana, Dominican Republic, 16-20 November, 2021, M. Moens, Neurocomputing, vol. 561, p. 126866, 2023.
X. Huang, L. Specia, and S. W. Yih, Eds. Association for Computa- [40] K. Ezzameli and H. Mahersia, “Emotion recognition from unimodal to
tional Linguistics, 2021, pp. 2879–2889. multimodal analysis: A review,” Inf. Fusion, vol. 99, p. 101847, 2023.
[24] W. Zheng, J. Yu, R. Xia, and S. Wang, “A facial expression-aware [41] L. Zhu, Z. Zhu, C. Zhang, Y. Xu, and X. Kong, “Multimodal sentiment
multimodal multi-task learning framework for emotion recognition in analysis based on fusion methods: A survey,” Inf. Fusion, vol. 95, pp.
multi-party conversations,” in Proceedings of the 61st Annual Meeting 306–325, 2023.
of the Association for Computational Linguistics (Volume 1: Long [42] T. Thongtan and T. Phienthrakul, “Sentiment classification using doc-
Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, 2023, pp. ument embeddings trained with cosine similarity,” in Proceedings of
15 445–15 459. the 57th Conference of the Association for Computational Linguistics,
[25] Z. Chen, L. Chen, B. Chen, L. Qin, Y. Liu, S. Zhu, J. Lou, and ACL 2019, Florence, Italy, July 28 - August 2, 2019, Volume 2: Student
K. Yu, “Unidu: Towards A unified generative dialogue understanding Research Workshop, F. Alva-Manchego, E. Choi, and D. Khashabi, Eds.
framework,” CoRR, vol. abs/2204.04637, 2022. Association for Computational Linguistics, 2019, pp. 407–414.
[26] H. Luo, L. Ji, B. Shi, H. Huang, N. Duan, T. Li, X. Chen, and M. Zhou, [43] L. Morency, R. Mihalcea, and P. Doshi, “Towards multimodal sentiment
“Univilm: A unified video and language pre-training model for mul- analysis: harvesting opinions from the web,” in Proceedings of the
timodal understanding and generation,” CoRR, vol. abs/2002.06353, 13th International Conference on Multimodal Interfaces, ICMI 2011,
2020. Alicante, Spain, November 14-18, 2011, 2011, pp. 169–176.
[27] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, [44] A. G. A. and V. Vetriselvi, “Survey on multimodal approaches to
G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, emotion recognition,” Neurocomputing, vol. 556, p. 126693, 2023.
“Learning transferable visual models from natural language supervi- [45] Y. Sun, N. Yu, and G. Fu, “A discourse-aware graph neural network
sion,” in Proceedings of the 38th International Conference on Machine for emotion recognition in multi-party conversation,” in Findings of
Learning, ICML 2021, 18-24 July 2021, Virtual Event, 2021, pp. 8748– the Association for Computational Linguistics: EMNLP 2021, Virtual
8763. Event / Punta Cana, Dominican Republic, 16-20 November, 2021,
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 19

M. Moens, X. Huang, L. Specia, and S. W. Yih, Eds. Association for A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan,
Computational Linguistics, 2021, pp. 2949–2958. B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov,
[46] A. Joshi, A. Bhat, A. Jain, A. V. Singh, and A. Modi, “COGMEN: Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic,
contextualized GNN based multimodal emotion recognition,” CoRR, S. Edunov, and T. Scialom, “Llama 2: Open foundation and fine-tuned
vol. abs/2205.02455, 2022. chat models,” CoRR, vol. abs/2307.09288, 2023.
[47] N. Xu, W. Mao, and G. Chen, “Multi-interactive memory network for [60] A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with
aspect based multimodal sentiment analysis,” in Proceedings of the selective state spaces,” CoRR, vol. abs/2312.00752, 2023.
AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. [61] Y. Gong, Y. Chung, and J. R. Glass, “AST: audio spectrogram
371–378. transformer,” in 22nd Annual Conference of the International Speech
[48] Z. Chen and T. Qian, “Transfer capsule network for aspect level Communication Association, Interspeech 2021, Brno, Czechia, August
sentiment classification,” in Proceedings of the 57th Conference of 30 - September 3, 2021, H. Hermansky, H. Cernocký, L. Burget,
the Association for Computational Linguistics, ACL 2019, Florence, L. Lamel, O. Scharenborg, and P. Motlı́cek, Eds. ISCA, 2021, pp.
Italy, July 28- August 2, 2019, Volume 1: Long Papers, A. Korhonen, 571–575.
D. R. Traum, and L. Màrquez, Eds. Association for Computational [62] M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling for
Linguistics, 2019, pp. 547–556. convolutional neural networks,” in Proceedings of the 36th Interna-
[49] H. Yan, J. Dai, T. Ji, X. Qiu, and Z. Zhang, “A unified generative tional Conference on Machine Learning, ICML 2019, 9-15 June 2019,
framework for aspect-based sentiment analysis,” in Proceedings of the Long Beach, California, USA, ser. Proceedings of Machine Learning
59th Annual Meeting of the Association for Computational Linguistics Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR,
and the 11th International Joint Conference on Natural Language 2019, pp. 6105–6114.
Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual [63] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
Event, August 1-6, 2021, C. Zong, F. Xia, W. Li, and R. Navigli, Eds. T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly,
Association for Computational Linguistics, 2021, pp. 2416–2429. J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words:
[50] C. Li, F. Gao, J. Bu, L. Xu, X. Chen, Y. Gu, Z. Shao, Q. Zheng, Transformers for image recognition at scale,” in 9th International
N. Zhang, Y. Wang, and Z. Yu, “Sentiprompt: Sentiment knowledge Conference on Learning Representations, ICLR 2021, Virtual Event,
enhanced prompt-tuning for aspect-based sentiment analysis,” CoRR, Austria, May 3-7, 2021. OpenReview.net, 2021.
vol. abs/2109.08306, 2021. [64] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal,
[51] P. Yang, X. Sun, W. Li, S. Ma, W. Wu, and H. Wang, “SGM: sequence G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable
generation model for multi-label classification,” in Proceedings of the visual models from natural language supervision,” in International
27th International Conference on Computational Linguistics, COLING conference on machine learning, 2021, pp. 8748–8763.
2018, Santa Fe, New Mexico, USA, August 20-26, 2018, 2018, pp. [65] “Gpt-4v(ision) system card,” 2023. [Online]. Available: https:
3915–3926. //api.semanticscholar.org/CorpusID:263218031
[52] Q. Ma, C. Yuan, W. Zhou, and S. Hu, “Label-specific dual graph [66] G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut,
neural network for multi-label text classification,” in Proceedings of the J. Schalkwyk, A. M. Dai, A. Hauth et al., “Gemini: a family of highly
59th Annual Meeting of the Association for Computational Linguistics capable multimodal models,” arXiv preprint arXiv:2312.11805, 2023.
and the 11th International Joint Conference on Natural Language [67] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson,
Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo:
Event, August 1-6, 2021, 2021, pp. 3855–3864. a visual language model for few-shot learning,” Advances in Neural
[53] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, Information Processing Systems, vol. 35, pp. 23 716–23 736, 2022.
“Distributed representations of words and phrases and their compo- [68] J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-
sitionality,” in Advances in Neural Information Processing Systems 26: image pre-training with frozen image encoders and large language
27th Annual Conference on Neural Information Processing Systems models,” arXiv preprint arXiv:2301.12597, 2023.
2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, [69] D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: En-
Nevada, United States, 2013, pp. 3111–3119. hancing vision-language understanding with advanced large language
[54] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors models,” arXiv preprint arXiv:2304.10592, 2023.
for word representation,” in Proceedings of the 2014 Conference on [70] W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li,
Empirical Methods in Natural Language Processing, EMNLP 2014, P. Fung, and S. Hoi, “Instructblip: Towards general-purpose vision-
October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special language models with instruction tuning,” 2023.
Interest Group of the ACL, 2014, pp. 1532–1543. [71] H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,”
[55] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Advances in neural information processing systems, vol. 36, 2024.
Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” [72] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li,
in Advances in Neural Information Processing Systems 30: Annual X. Wang, M. Dehghani, S. Brahma et al., “Scaling instruction-finetuned
Conference on Neural Information Processing Systems 2017, December language models,” Journal of Machine Learning Research, vol. 25,
4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, no. 70, pp. 1–53, 2024.
H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, Eds., [73] A. Zadeh, M. Chen, S. Poria, E. Cambria, and L. Morency, “Tensor
2017, pp. 5998–6008. fusion network for multimodal sentiment analysis,” in Proceedings
[56] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, of the 2017 Conference on Empirical Methods in Natural Language
V. Stoyanov, and L. Zettlemoyer, “BART: denoising sequence-to- Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11,
sequence pre-training for natural language generation, translation, and 2017, M. Palmer, R. Hwa, and S. Riedel, Eds. Association for
comprehension,” in Proceedings of the 58th Annual Meeting of the Computational Linguistics, 2017, pp. 1103–1114.
Association for Computational Linguistics, ACL 2020, Online, July 5- [74] Y. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L. Morency, and
10, 2020, 2020, pp. 7871–7880. R. Salakhutdinov, “Multimodal transformer for unaligned multimodal
[57] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, language sequences,” in Proceedings of the 57th Conference of the
Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning Association for Computational Linguistics, ACL 2019, Florence, Italy,
with a unified text-to-text transformer,” J. Mach. Learn. Res., vol. 21, July 28- August 2, 2019, Volume 1: Long Papers, A. Korhonen,
pp. 140:1–140:67, 2020. D. R. Traum, and L. Màrquez, Eds. Association for Computational
[58] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, Linguistics, 2019, pp. 6558–6569.
B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, [75] C. Chen, H. Hong, J. Guo, and B. Song, “Inter-intra modal representa-
E. Grave, and G. Lample, “Llama: Open and efficient foundation tion augmentation with trimodal collaborative disentanglement network
language models,” CoRR, vol. abs/2302.13971, 2023. for multimodal sentiment analysis,” IEEE ACM Trans. Audio Speech
[59] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, Lang. Process., vol. 31, pp. 1476–1488, 2023.
N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, [76] K. Yang, H. Xu, and K. Gao, “CM-BERT: cross-modal BERT for text-
C. Canton-Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, audio sentiment analysis,” in MM ’20: The 28th ACM International
J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, Conference on Multimedia, Virtual Event / Seattle, WA, USA, October
S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, 12-16, 2020, 2020, pp. 521–528.
I. Kloumann, A. Korenev, P. S. Koura, M. Lachaux, T. Lavril, J. Lee, [77] Z. Lin, B. Liang, Y. Long, Y. Dang, M. Yang, M. Zhang, and
D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, R. Xu, “Modeling intra- and inter-modal relations: Hierarchical graph
I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, contrastive learning for multimodal sentiment analysis,” in Proceedings
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 20

of the 29th International Conference on Computational Linguistics, timodal affective computing,” in Proceedings of the 57th Conference
COLING 2022, Gyeongju, Republic of Korea, October 12-17, 2022, of the Association for Computational Linguistics, ACL 2019, Florence,
2022, pp. 7124–7135. Italy, July 28- August 2, 2019, Volume 1: Long Papers, 2019, pp. 481–
[78] J. Tang, D. Liu, X. Jin, Y. Peng, Q. Zhao, Y. Ding, and W. Kong, 492.
“BAFN: bi-direction attention based fusion network for multimodal [96] Z. Li, B. Xu, C. Zhu, and T. Zhao, “CLMLF: A contrastive learning
sentiment analysis,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, and multi-layer fusion method for multimodal sentiment detection,”
no. 4, pp. 1966–1978, 2023. in Findings of the Association for Computational Linguistics: NAACL
[79] C. Huang, J. Zhang, X. Wu, Y. Wang, M. Li, and X. Huang, “Tefna: 2022, Seattle, WA, United States, July 10-15, 2022, 2022, pp. 2282–
Text-centered fusion network with crossmodal attention for multimodal 2294.
sentiment analysis,” Knowl. Based Syst., vol. 269, p. 110502, 2023. [97] P. P. Liang, Z. Liu, A. Zadeh, and L. Morency, “Multimodal language
[80] X. Zhang and Y. Li, “A cross-modality context fusion and analysis with recurrent multistage fusion,” in Proceedings of the 2018
semantic refinement network for emotion recognition in conversation,” Conference on Empirical Methods in Natural Language Processing,
in Proceedings of the 61st Annual Meeting of the Association for Brussels, Belgium, October 31 - November 4, 2018, 2018, pp. 150–
Computational Linguistics (Volume 1: Long Papers). Toronto, Canada: 161.
Association for Computational Linguistics, Jul. 2023, pp. 13 099– [98] J. Tang, K. Li, X. Jin, A. Cichocki, Q. Zhao, and W. Kong, “CTFN:
13 110. [Online]. Available: https://aclanthology.org/2023.acl-long.732 hierarchical learning for multimodal sentiment analysis using coupled-
[81] T. Shi and S. Huang, “Multiemo: An attention-based correlation-aware translation fusion network,” in Proceedings of the 59th Annual Meeting
multimodal fusion framework for emotion recognition in conversa- of the Association for Computational Linguistics and the 11th Interna-
tions,” in Proceedings of the 61st Annual Meeting of the Association tional Joint Conference on Natural Language Processing, ACL/IJCNLP
for Computational Linguistics (Volume 1: Long Papers), ACL 2023, 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, 2021,
Toronto, Canada, July 9-14, 2023, 2023, pp. 14 752–14 766. pp. 5301–5311.
[82] X. Chen, “Mmrbn: Rule-based network for multimodal emotion recog- [99] Z. Li, Q. Guo, Y. Pan, W. Ding, J. Yu, Y. Zhang, W. Liu, H. Chen,
nition,” in ICASSP 2024-2024 IEEE International Conference on H. Wang, and Y. Xie, “Multi-level correlation mining framework with
Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. self-supervised label generation for multimodal sentiment analysis,” Inf.
8200–8204. Fusion, vol. 99, p. 101891, 2023.
[83] X. Li, J. Liu, Y. Xie, P. Gong, X. Zhang, and H. He, “Magdra: a multi- [100] J. Peng, T. Wu, W. Zhang, F. Cheng, S. Tan, F. Yi, and Y. Huang,
modal attention graph network with dynamic routing-by-agreement for “A fine-grained modal label-based multi-stage network for multimodal
multi-label emotion recognition,” Knowledge-Based Systems, vol. 283, sentiment analysis,” Expert Syst. Appl., vol. 221, p. 119721, 2023.
p. 111126, 2024. [101] H. Luo, L. Ji, Y. Huang, B. Wang, S. Ji, and T. Li, “Scalevlad:
[84] N. K. Devulapally, S. Anand, S. D. Bhattacharjee, J. Yuan, and Improving multimodal sentiment analysis via multi-scale fusion of
Y. Chang, “Amuse: Adaptive multimodal analysis for speaker emotion locally descriptors,” CoRR, vol. abs/2112.01368, 2021.
recognition in group conversations,” CoRR, vol. abs/2401.15164, 2024. [102] S. Mai, Y. Zhao, Y. Zeng, J. Yao, and H. Hu, “Meta-learn unimodal
[85] W. Han, H. Chen, and S. Poria, “Improving multimodal fusion with signals with weak supervision for multimodal sentiment analysis,”
hierarchical mutual information maximization for multimodal senti- arXiv preprint arXiv:2408.16029, 2024.
ment analysis,” in Proceedings of the 2021 Conference on Empirical
[103] S. Minglong, O. Chunping, L. Yongbin, and R. Lin, “Multimodal emo-
Methods in Natural Language Processing, EMNLP 2021, Virtual Event
tion recognition based on hierarchical fusion strategy and contextual
/ Punta Cana, Dominican Republic, 7-11 November, 2021, M. Moens,
information embedding,” Beijing Da Xue Xue Bao, vol. 60, no. 3, pp.
X. Huang, L. Specia, and S. W. Yih, Eds., pp. 9180–9192.
393–402, 2024.
[86] S. Mai, Y. Zeng, S. Zheng, and H. Hu, “Hybrid contrastive learning
[104] J. Yang, Y. Wang, R. Yi, Y. Zhu, A. Rehman, A. Zadeh, S. Poria, and
of tri-modal representation for multimodal sentiment analysis,” CoRR,
L. Morency, “MTAG: modal-temporal attention graph for unaligned
vol. abs/2109.01797, 2021.
human multimodal language sequences,” in Proceedings of the 2021
[87] J. Zheng, S. Zhang, X. Wang, and Z. Zeng, “Multimodal representa-
Conference of the North American Chapter of the Association for
tions learning based on mutual information maximization and mini-
Computational Linguistics: Human Language Technologies, NAACL-
mization and identity embedding for multimodal sentiment analysis,”
HLT 2021, Online, June 6-11, 2021, 2021, pp. 1009–1021.
arXiv preprint arXiv:2201.03969, 2022.
[88] D. Hazarika, R. Zimmermann, and S. Poria, “MISA: modality-invariant [105] N. Majumder, D. Hazarika, A. Gelbukh, E. Cambria, and S. Poria,
and -specific representations for multimodal sentiment analysis,” in “Multimodal sentiment analysis using hierarchical fusion with context
MM ’20: The 28th ACM International Conference on Multimedia, modeling,” Knowledge-based systems, vol. 161, pp. 124–133, 2018.
Virtual Event / Seattle, WA, USA, October 12-16, 2020, C. W. Chen, [106] J. Zhao, R. Li, and Q. Jin, “Missing modality imagination network for
R. Cucchiara, X. Hua, G. Qi, E. Ricci, Z. Zhang, and R. Zimmermann, emotion recognition with uncertain missing modalities,” in Proceedings
Eds. ACM, 2020, pp. 1122–1131. of the 59th Annual Meeting of the Association for Computational
[89] L. Xiao, X. Wu, S. Yang, J. Xu, J. Zhou, and L. He, “Cross-modal Linguistics and the 11th International Joint Conference on Natural
fine-grained alignment and fusion network for multimodal aspect-based Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers),
sentiment analysis,” Inf. Process. Manag., vol. 60, no. 6, p. 103508, Virtual Event, August 1-6, 2021, 2021, pp. 2608–2618.
2023. [107] S. Parthasarathy and S. Sundaram, “Training strategies to handle miss-
[90] Z. Zhang, Z. Wang, X. Li, N. Liu, B. Guo, and Z. Yu, “Modalnet: an ing modalities for audio-visual expression recognition,” in Compan-
aspect-level sentiment classification model by exploring multimodal ion Publication of the 2020 International Conference on Multimodal
data with fusion discriminant attentional network,” World Wide Web, Interaction, ICMI Companion 2020, Virtual Event, The Netherlands,
vol. 24, no. 6, pp. 1957–1974, 2021. October, 2020, 2020, pp. 400–404.
[91] Y. Zhang, M. Chen, J. Shen, and C. Wang, “Tailor versatile multi- [108] N. Wang, H. Cao, J. Zhao, R. Chen, D. Yan, and J. Zhang, “M2R2:
modal learning for multi-label emotion recognition,” in Proceedings of missing-modality robust emotion recognition framework with iterative
the AAAI Conference on Artificial Intelligence, vol. 36, no. 8, 2022, data augmentation,” IEEE Trans. Artif. Intell., vol. 4, no. 5, pp. 1305–
pp. 9100–9108. 1316, 2023.
[92] S. Ge, Z. Jiang, Z. Cheng, C. Wang, Y. Yin, and Q. Gu, “Learning [109] J. Zeng, J. Zhou, and T. Liu, “Mitigating inconsistencies in multimodal
robust multi-modal representation for multi-label emotion recognition sentiment analysis under uncertain missing modalities,” in Proceedings
via adversarial masking and perturbation,” in Proceedings of the ACM of the 2022 Conference on Empirical Methods in Natural Language
Web Conference 2023, WWW 2023, Austin, TX, USA, 30 April 2023 - Processing, 2022, pp. 2924–2934.
4 May 2023, 2023, pp. 1510–1518. [110] Z. Yuan, W. Li, H. Xu, and W. Yu, “Transformer-based feature
[93] M. Sharafi, M. Yazdchi, R. Rasti, and F. Nasimi, “A novel spatio- reconstruction network for robust multimodal sentiment analysis,” in
temporal convolutional neural framework for multimodal emotion Proceedings of the 29th ACM International Conference on Multimedia,
recognition,” Biomed. Signal Process. Control., vol. 78, p. 103970, 2021, pp. 4400–4407.
2022. [111] W. Luo, M. Xu, and H. Lai, “Multimodal reconstruct and align net for
[94] Y. Li, W. Weng, and C. Liu, “Tscl-fhfn: two-stage contrastive learning missing modality problem in sentiment analysis,” in MultiMedia Mod-
and feature hierarchical fusion network for multimodal sentiment eling - 29th International Conference, MMM 2023, Bergen, Norway,
analysis,” Neural Computing and Applications, pp. 1–15, 2024. January 9-12, 2023, Proceedings, Part II, 2023, pp. 411–422.
[95] S. Mai, H. Hu, and S. Xing, “Divide, conquer and combine: Hierarchi- [112] C. Shang, A. Palmer, J. Sun, K. Chen, J. Lu, and J. Bi, “VIGAN:
cal feature fusion network with local and global perspectives for mul- missing view imputation with generative adversarial networks,” in 2017
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 21

IEEE International Conference on Big Data (IEEE BigData 2017), [131] C. Du, C. Du, H. Wang, J. Li, W.-L. Zheng, B.-L. Lu, and H. He,
Boston, MA, USA, December 11-14, 2017, 2017, pp. 766–775. “Semi-supervised deep generative modelling of incomplete multi-
[113] J. Zeng, T. Liu, and J. Zhou, “Tag-assisted multimodal sentiment modality emotional data,” in Proceedings of the 26th ACM interna-
analysis under uncertain missing modalities,” in SIGIR ’22: The 45th tional conference on Multimedia, 2018, pp. 108–116.
International ACM SIGIR Conference on Research and Development [132] Z. Wang, Z. Wan, and X. Wan, “Transmodality: An end2end fusion
in Information Retrieval, Madrid, Spain, July 11 - 15, 2022, 2022, pp. method with transformer for multimodal sentiment analysis,” in WWW
1545–1554. ’20: The Web Conference 2020, Taipei, Taiwan, April 20-24, 2020,
[114] H. Zuo, R. Liu, J. Zhao, G. Gao, and H. Li, “Exploiting modality- 2020, pp. 2514–2520.
invariant feature for robust multimodal emotion recognition with miss- [133] F. Ma, S. Huang, and L. Zhang, “An efficient approach for audio-visual
ing modalities,” in IEEE International Conference on Acoustics, Speech emotion recognition with missing labels and missing modalities,” in
and Signal Processing ICASSP 2023, Rhodes Island, Greece, June 4- 2021 IEEE International Conference on Multimedia and Expo, ICME
10, 2023, 2023, pp. 1–5. 2021, Shenzhen, China, July 5-9, 2021, 2021, pp. 1–6.
[115] Z. Liu, B. Zhou, D. Chu, Y. Sun, and L. Meng, “Modality translation- [134] W. Yu, H. Xu, Z. Yuan, and J. Wu, “Learning modality-specific
based multimodal sentiment analysis under uncertain missing modali- representations with self-supervised multi-task learning for multimodal
ties,” Inf. Fusion, vol. 101, p. 101973, 2024. sentiment analysis,” in Thirty-Fifth AAAI Conference on Artificial
[116] T. Zhou, S. Canu, P. Vera, and S. Ruan, “Feature-enhanced generation Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Appli-
and multi-modality fusion based deep neural network for brain tumor cations of Artificial Intelligence, IAAI 2021, The Eleventh Symposium
segmentation with missing MR modalities,” Neurocomputing, vol. 466, on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual
pp. 102–112, 2021. Event, February 2-9, 2021, pp. 10 790–10 797.
[117] J. Vazquez-Rodriguez, G. Lefebvre, J. Cumin, and J. L. Crowley, [135] S. Mai, H. Hu, and S. Xing, “Modality to modality translation:
“Accommodating missing modalities in time-continuous multimodal An adversarial representation learning and graph fusion network for
emotion recognition,” in 11th International Conference on Affective multimodal fusion,” in The Thirty-Fourth AAAI Conference on Artificial
Computing and Intelligent Interaction, ACII 2023, Cambridge, MA, Intelligence, AAAI 2020, The Thirty-Second Innovative Applications
USA, September 10-13, 2023. IEEE, 2023, pp. 1–8. of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI
[118] H. Mao, B. Zhang, H. Xu, Z. Yuan, and Y. Liu, “Robust-msa: Symposium on Educational Advances in Artificial Intelligence, EAAI
Understanding the impact of modality noise on multimodal sentiment 2020, New York, NY, USA, February 7-12, 2020. AAAI Press, 2020,
analysis,” in Thirty-Seventh AAAI Conference on Artificial Intelligence, pp. 164–172.
AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Ar- [136] M. S. Akhtar, D. S. Chauhan, D. Ghosal, S. Poria, A. Ekbal, and
tificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational P. Bhattacharyya, “Multi-task learning for multi-modal emotion recog-
Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, nition and sentiment analysis,” in Proceedings of the 2019 Conference
February 7-14, 2023, 2023, pp. 16 458–16 460. of the North American Chapter of the Association for Computa-
[119] S. Lai, X. Hu, Y. Li, Z. Ren, Z. Liu, and D. Miao, “Shared and tional Linguistics: Human Language Technologies, NAACL-HLT 2019,
private information learning in multimodal sentiment analysis with Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short
deep modal alignment and self-supervised multi-task learning,” arXiv Papers), J. Burstein, C. Doran, and T. Solorio, Eds. Association for
preprint arXiv:2305.08473, 2023. Computational Linguistics, 2019, pp. 370–379.
[120] X. Zhang, W. Cui, B. Hu, and Y. Li, “A multi-level alignment and cross- [137] R. Chen, W. Zhou, Y. Li, and H. Zhou, “Video-based cross-modal
modal unified semantic graph refinement network for conversational auxiliary network for multimodal sentiment analysis,” IEEE Trans.
emotion recognition,” IEEE Transactions on Affective Computing, Circuits Syst. Video Technol., vol. 32, no. 12, pp. 8703–8716, 2022.
2024. [138] Y. Zeng, W. Yan, S. Mai, and H. Hu, “Disentanglement translation
[121] E. Shutova, D. Kiela, and J. Maillard, “Black holes and white rabbits: network for multimodal sentiment analysis,” Inf. Fusion, vol. 102, p.
Metaphor identification with visual features,” in NAACL HLT 2016, The 102031, 2024.
2016 Conference of the North American Chapter of the Association for
[139] D. Sun, Y. He, and J. Han, “Using auxiliary tasks in multimodal fusion
Computational Linguistics: Human Language Technologies, San Diego
of wav2vec 2.0 and BERT for multimodal emotion recognition,” CoRR,
California, USA, June 12-17, 2016, 2016, pp. 160–170.
vol. abs/2302.13661, 2023.
[122] S. Moon, S. Kim, and H. Wang, “Multimodal transfer deep learning
for audio visual recognition,” CoRR, vol. abs/1412.3121, 2014. [140] Z. Zhao, Y. Wang, G. Shen, Y. Xu, and J. Zhang, “Tdfnet: Transformer-
[123] G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. W. Senior, based deep-scale fusion network for multimodal emotion recognition,”
“Recent advances in the automatic recognition of audiovisual speech,” IEEE ACM Trans. Audio Speech Lang. Process., vol. 31, pp. 3771–
Proceedings of the IEEE, vol. 91, no. 9, pp. 1306–1326, 2003. 3782, 2023.
[124] G. Evangelopoulos, A. Zlatintsi, A. Potamianos, P. Maragos, K. Ra- [141] M. Ren, X. Huang, J. Liu, M. Liu, X. Li, and A. Liu, “MALN:
pantzikos, G. Skoumas, and Y. Avrithis, “Multimodal saliency and multimodal adversarial learning network for conversational emotion
fusion for movie summarization based on aural, visual, and textual recognition,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 11,
attention,” IEEE Trans. Multim., vol. 15, no. 7, pp. 1553–1568, 2013. pp. 6965–6980, 2023.
[125] M. Glodek, S. Tschechne, G. Layher, M. Schels, T. Brosch, S. Scherer, [142] F. Liu, S. Shen, Z. Fu, H. Wang, A. Zhou, and J. Qi, “LGCCT: A light
M. Kächele, M. Schmidt, H. Neumann, G. Palm, and F. Schwenker, gated and crossed complementation transformer for multimodal speech
“Multiple classifier systems for the classification of audio-visual emo- emotion recognition,” Entropy, vol. 24, no. 7, p. 1010, 2022.
tional states,” in Affective Computing and Intelligent Interaction - [143] C. Zhang, Y. Zhang, and B. Cheng, “Rl-emo: A reinforcement learning
Fourth International Conference, ACII 2011, Memphis, TN, USA, framework for multimodal emotion recognition,” in ICASSP 2024 -
October 9-12, 2011, Proceedings, Part II, 2011, pp. 359–368. 2024 IEEE International Conference on Acoustics, Speech and Signal
[126] S. Poria, E. Cambria, D. Hazarika, N. Majumder, A. Zadeh, and L.- Processing (ICASSP), 2024, pp. 10 246–10 250.
P. Morency, “Context-dependent sentiment analysis in user-generated [144] L. Yang, J. Na, and J. Yu, “Cross-modal multitask transformer for
videos,” in Proceedings of the 55th annual meeting of the association end-to-end multimodal aspect-based sentiment analysis,” Inf. Process.
for computational linguistics (volume 1: Long papers), 2017, pp. 873– Manag., vol. 59, no. 5, p. 103038, 2022.
883. [145] R. Jain, A. Singh, V. K. Gangwar, and S. Saha, “Abcord: Exploit-
[127] A. I. Middya, B. Nag, and S. Roy, “Deep learning based multimodal ing multimodal generative approach for aspect-based complaint and
emotion recognition using model-level fusion of audio-visual modali- rationale detection,” in Proceedings of the 31st ACM International
ties,” Knowl. Based Syst., vol. 244, p. 108580, 2022. Conference on Multimedia, MM 2023, Ottawa, ON, Canada, 29
[128] T. Baltrušaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine October 2023- 3 November 2023, 2023, pp. 8571–8579.
learning: A survey and taxonomy,” IEEE transactions on pattern [146] X. Ju, D. Zhang, R. Xiao, J. Li, S. Li, M. Zhang, and G. Zhou,
analysis and machine intelligence, vol. 41, no. 2, pp. 423–443, 2018. “Joint multi-modal aspect-sentiment analysis with auxiliary cross-
[129] Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang, “A survey of affect modal relation detection,” in Proceedings of the 2021 Conference on
recognition methods: Audio, visual, and spontaneous expressions,” Empirical Methods in Natural Language Processing, EMNLP 2021,
IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 1, pp. 39–58, Virtual Event / Punta Cana, Dominican Republic, 7-11 November,
2009. 2021, 2021, pp. 4395–4405.
[130] N. Majumder, D. Hazarika, A. F. Gelbukh, E. Cambria, and S. Poria, [147] H. Lian, C. Lu, S. Li, Y. Zhao, C. Tang, Y. Zong, and W. Zheng,
“Multimodal sentiment analysis using hierarchical fusion with context “Label distribution adaptation for multimodal emotion recognition with
modeling,” Knowl. Based Syst., vol. 161, pp. 124–133, 2018. multi-label learning,” in Proceedings of the 1st International Workshop
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 22

on Multimodal and Responsible Affective Computing, MRAC 2023, [165] M. Chen and X. Li, “Swafn: Sentimental words aware fusion network
Ottawa, ON, Canada, 29 October 2023, 2023, pp. 51–58. for multimodal sentiment analysis,” in Proceedings of the 28th interna-
[148] “Aobert: All-modalities-in-one bert for multimodal sentiment analysis,” tional conference on computational linguistics, 2020, pp. 1067–1077.
Information Fusion, vol. 92, pp. 37–45, 2023. [166] Y. Fu, S. Okada, L. Wang, L. Guo, Y. Song, J. Liu, and J. Dang,
[149] F. Qian, J. Han, Y. He, T. Zheng, and G. Zheng, “Sentiment knowledge “Context- and knowledge-aware graph convolutional network for mul-
enhanced self-supervised learning for multimodal sentiment analysis,” timodal emotion recognition,” IEEE Multim., vol. 29, no. 3, pp. 91–100,
in Findings of the Association for Computational Linguistics: ACL 2022.
2023, 2023, pp. 12 966–12 978. [167] Y. Li, Y. Wang, and Z. Cui, “Decoupled multimodal distilling for
[150] M. Arjmand, M. J. Dousti, and H. Moradi, “TEASEL: A transformer- emotion recognition,” in IEEE/CVF Conference on Computer Vision
based speech-prefixed language model,” CoRR, vol. abs/2109.05522, and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June
2021. 17-24, 2023, 2023, pp. 6631–6640.
[168] X. Sun, H. He, H. Tang, K. Zeng, and T. Shen, “Multimodal rough
[151] J. Yu and J. Jiang, “Adapting BERT for target-oriented multimodal
set transformer for sentiment analysis and emotion recognition,” in 9th
sentiment classification,” in Proceedings of the Twenty-Eighth Interna-
IEEE International Conference on Cloud Computing and Intelligent
tional Joint Conference on Artificial Intelligence, IJCAI 2019, Macao,
Systems, CCIS 2023, Dali, China, August 12-13, 2023, 2023, pp. 250–
China, August 10-16, 2019, 2019, pp. 5408–5414.
259.
[152] J. Cheng, I. Fostiropoulos, B. W. Boehm, and M. Soleymani, “Mul- [169] P. Wang, S. Zeng, J. Chen, L. Fan, M. Chen, Y. Wu, and X. He,
timodal phased transformer for sentiment analysis,” in Proceedings “Leveraging label information for multimodal emotion recognition,”
of the 2021 Conference on Empirical Methods in Natural Language CoRR, vol. abs/2309.02106, 2023.
Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican [170] P. Yuan, G. Cai, M. Chen, and X. Tang, “Topics guided multimodal fu-
Republic, 7-11 November, 2021, M. Moens, X. Huang, L. Specia, and sion network for conversational emotion recognition,” in International
S. W. Yih, Eds. Association for Computational Linguistics, 2021, pp. Conference on Intelligent Computing. Springer, 2024, pp. 250–262.
2447–2458. [171] F. Zhang, X. Li, C. P. Lim, Q. Hua, C. Dong, and J. Zhai, “Deep
[153] H. Zhang, Y. Wang, G. Yin, K. Liu, Y. Liu, and T. Yu, “Learn- emotional arousal network for multimodal sentiment analysis and
ing language-guided adaptive hyper-modality representation for mul- emotion recognition,” Inf. Fusion, vol. 88, pp. 296–304, 2022.
timodal sentiment analysis,” CoRR, vol. abs/2310.05804, 2023. [172] Z. Xu, Q. Su, and J. Xiao, “Multimodal aspect-based sentiment clas-
[154] J. Li, X. Wang, and Z. Zeng, “Tracing intricate cues in dialogue: sification with knowledge-injected transformer,” in IEEE International
Joint graph structure and sentiment dynamics for multimodal emotion Conference on Multimedia and Expo, ICME 2023, Brisbane, Australia,
recognition,” arXiv preprint arXiv:2407.21536, 2024. July 10-14, 2023, 2023, pp. 1379–1384.
[155] K. Liu, J. Wang, and X. Zhang, “Entity-related unsupervised pretraining [173] H. Yang, Y. Zhao, and B. Qin, “Face-sensitive image-to-emotional-
with visual prompts for multimodal aspect-based sentiment analysis,” in text cross-modal translation for multimodal aspect-based sentiment
Natural Language Processing and Chinese Computing - 12th National analysis,” in Proceedings of the 2022 Conference on Empirical Methods
CCF Conference, NLPCC 2023, Foshan, China, October 12-15, 2023, in Natural Language Processing, EMNLP 2022, Abu Dhabi, United
Proceedings, Part II, 2023, pp. 481–493. Arab Emirates, December 7-11, 2022, 2022, pp. 3324–3335.
[156] Y. Ling, J. Yu, and R. Xia, “Vision-language pre-training for multi- [174] L. Xiao, X. Wu, S. Yang, J. Xu, J. Zhou, and L. He, “Cross-modal
modal aspect-based sentiment analysis,” in Proceedings of the 60th fine-grained alignment and fusion network for multimodal aspect-based
Annual Meeting of the Association for Computational Linguistics sentiment analysis,” Information Processing & Management, vol. 60,
(Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, no. 6, p. 103508, 2023.
2022, pp. 2149–2159. [175] J. Yu, K. Chen, and R. Xia, “Hierarchical interactive multimodal
[157] K. Zhang, K. Zhang, M. Zhang, H. Zhao, Q. Liu, W. Wu, and E. Chen, transformer for aspect-based multimodal sentiment analysis,” IEEE
“Incorporating dynamic semantics into pre-trained language model for Transactions on Affective Computing, 2022.
aspect-based sentiment analysis,” in Findings of the Association for [176] W. Zheng, J. Yu, and R. Xia, “A unimodal valence-arousal driven
Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, contrastive learning framework for multimodal multi-label emotion
2022, 2022, pp. 3599–3610. recognition,” in ACM Multimedia 2024.
[158] Z. Yu, J. Wang, L. Yu, and X. Zhang, “Dual-encoder transformers [177] C. Peng, K. Chen, L. Shou, and G. Chen, “Carat: Contrastive feature
with cross-modal alignment for multimodal aspect-based sentiment reconstruction and aggregation for multi-modal multi-label emotion
analysis,” in Proceedings of the 2nd Conference of the Asia-Pacific recognition,” in Proceedings of the AAAI Conference on Artificial
Chapter of the Association for Computational Linguistics and the Intelligence, vol. 38, no. 13, 2024, pp. 14 581–14 589.
12th International Joint Conference on Natural Language Processing, [178] J. Zhao, Y. Zhao, and J. Li, “M3tr: Multi-modal multi-label recognition
AACL/IJCNLP 2022 - Volume 1: Long Papers, Online Only, November with transformer,” in Proceedings of the 29th ACM international
20-23, 2022, 2022, pp. 414–423. conference on multimedia, 2021, pp. 469–477.
[159] H. Jin, J. Tan, L. Liu, L. Qiu, S. Yao, X. Chen, and X. Zeng, “MSRA: [179] D. Zhang, X. Ju, W. Zhang, J. Li, S. Li, Q. Zhu, and G. Zhou, “Multi-
A multi-aspect semantic relevance approach for e-commerce via mul- modal multi-label emotion recognition with heterogeneous hierarchical
timodal pre-training,” in Proceedings of the 32nd ACM International message passing,” in Proceedings of the AAAI Conference on Artificial
Conference on Information and Knowledge Management, CIKM 2023, Intelligence, vol. 35, no. 16, 2021, pp. 14 338–14 346.
Birmingham, United Kingdom, October 21-25, 2023, 2023, pp. 3988– [180] D. S. Chauhan, M. S. Akhtar, A. Ekbal, and P. Bhattacharyya, “Context-
3992. aware interactive attention for multi-modal sentiment and emotion
analysis,” in Proceedings of the 2019 Conference on Empirical Meth-
[160] Q. Wang, H. Xu, Z. Wen, B. Liang, M. Yang, B. Qin, and R. Xu,
ods in Natural Language Processing and the 9th International Joint
“Image-to-text conversion and aspect-oriented filtration for multimodal
Conference on Natural Language Processing, EMNLP-IJCNLP 2019,
aspect-based sentiment analysis,” IEEE Transactions on Affective Com-
Hong Kong, China, November 3-7, 2019, K. Inui, J. Jiang, V. Ng, and
puting, no. 01, pp. 1–15, 2023.
X. Wan, Eds. Association for Computational Linguistics, 2019, pp.
[161] C. Wang, Y. Luo, C. Meng, and F. Yuan, “An adaptive dual graph 5646–5656.
convolution fusion network for aspect-based sentiment analysis,” ACM [181] S. Poria, E. Cambria, D. Hazarika, N. Majumder, A. Zadeh, and
Transactions on Asian and Low-Resource Language Information Pro- L. Morency, “Multi-level multiple attentions for contextual multimodal
cessing, 2024. sentiment analysis,” in 2017 IEEE International Conference on Data
[162] J. Mu, F. Nie, W. Wang, J. Xu, J. Zhang, and H. Liu, “Mocolnet: Mining, ICDM 2017, New Orleans, LA, USA, November 18-21, 2017,
A momentum contrastive learning network for multimodal aspect- 2017, pp. 1033–1038.
level sentiment analysis,” IEEE Transactions on Knowledge and Data [182] M. Huang, C. Qing, J. Tan, and X. Xu, “Context-based adaptive multi-
Engineering, 2023. modal fusion network for continuous frame-level sentiment prediction,”
[163] D. Wang, X. Guo, Y. Tian, J. Liu, L. He, and X. Luo, “TETFN: A IEEE ACM Trans. Audio Speech Lang. Process., vol. 31, pp. 3468–
text enhanced transformer fusion network for multimodal sentiment 3477, 2023.
analysis,” Pattern Recognit., vol. 136, p. 109259, 2023. [183] Z. Li, Y. Sun, L. Zhang, and J. Tang, “Ctnet: Context-based tandem
[164] C. Zhu, M. Chen, S. Zhang, C. Sun, H. Liang, Y. Liu, and J. Chen, network for semantic segmentation,” IEEE Trans. Pattern Anal. Mach.
“SKEAFN: sentiment knowledge enhanced attention fusion network Intell., vol. 44, no. 12, pp. 9904–9917, 2022.
for multimodal sentiment analysis,” Inf. Fusion, vol. 100, p. 101958, [184] X. Sun, X. Ren, and X. Xie, “A novel multimodal sentiment analysis
2023. model based on gated fusion and multi-task learning,” in ICASSP 2024-
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 23

2024 IEEE International Conference on Acoustics, Speech and Signal Event, August 1-6, 2021. Association for Computational Linguistics,
Processing (ICASSP). IEEE, 2024, pp. 8336–8340. 2021.
[185] J. Hu, Y. Liu, J. Zhao, and Q. Jin, “MMGCN: multimodal fusion via [203] F. Wang, S. Tian, L. Yu, J. Liu, J. Wang, K. Li, and Y. Wang,
deep graph convolution network for emotion recognition in conversa- “TEDT: transformer-based encoding-decoding translation network for
tion,” in Proceedings of the 59th Annual Meeting of the Association for multimodal sentiment analysis,” Cogn. Comput., vol. 15, no. 1, pp.
Computational Linguistics and the 11th International Joint Conference 289–303, 2023.
on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long [204] Z. Yu, J. Wang, L.-C. Yu, and X. Zhang, “Dual-encoder transformers
Papers), Virtual Event, August 1-6, 2021, C. Zong, F. Xia, W. Li, and with cross-modal alignment for multimodal aspect-based sentiment
R. Navigli, Eds. Association for Computational Linguistics, 2021, pp. analysis,” in Proceedings of the 2nd Conference of the Asia-Pacific
5666–5675. Chapter of the Association for Computational Linguistics and the
[186] D. Hu, X. Hou, L. Wei, L. Jiang, and Y. Mo, “MM-DFN: multimodal 12th International Joint Conference on Natural Language Processing
dynamic fusion network for emotion recognition in conversations,” (Volume 1: Long Papers), 2022, pp. 414–423.
in IEEE International Conference on Acoustics, Speech and Signal [205] Y. Ge, D. Chen, and H. Li, “Mutual mean-teaching: Pseudo label re-
Processing, ICASSP 2022, Virtual and Singapore, 23-27 May 2022, finery for unsupervised domain adaptation on person re-identification,”
2022, pp. 7037–7041. arXiv preprint arXiv:2001.01526, 2020.
[187] D. Zhang, F. Chen, J. Chang, X. Chen, and Q. Tian, “Structure
[206] H. Pham, Z. Dai, Q. Xie, and Q. V. Le, “Meta pseudo labels,” in IEEE
aware multi-graph network for multi-modal emotion recognition in
Conference on Computer Vision and Pattern Recognition, CVPR 2021,
conversations,” IEEE Trans. Multim., vol. 26, pp. 3987–3997, 2024.
virtual, June 19-25, 2021. Computer Vision Foundation / IEEE, 2021,
[188] F. Chen, J. Shao, S. Zhu, and H. T. Shen, “Multivariate, multi-frequency
pp. 11 557–11 568.
and multimodal: Rethinking graph neural networks for emotion recog-
nition in conversation,” in IEEE/CVF Conference on Computer Vision [207] Y. Zhang, M. Zhang, S. Wu, and J. Zhao, “Towards unifying the label
and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June space for aspect- and sentence-based sentiment analysis,” in Findings
17-24, 2023, 2023, pp. 10 761–10 770. of the Association for Computational Linguistics: ACL 2022, Dublin,
[189] B. Yao and W. Shi, “Speaker-centric multimodal fusion networks for Ireland, May 22-27, 2022, S. Muresan, P. Nakov, and A. Villavicencio,
emotion recognition in conversations,” in ICASSP 2024 - 2024 IEEE Eds. Association for Computational Linguistics, 2022, pp. 20–30.
International Conference on Acoustics, Speech and Signal Processing [208] Y. Zhang, J. Wang, Y. Liu, L. Rong, Q. Zheng, D. Song, P. Tiwari, and
(ICASSP), 2024, pp. 8441–8445. J. Qin, “A multitask learning model for multimodal sarcasm, sentiment
[190] Z. Li, F. Tang, M. Zhao, and Y. Zhu, “Emocaps: Emotion capsule and emotion recognition in conversations,” Inf. Fusion, vol. 93, pp.
based model for conversational emotion recognition,” arXiv preprint 282–301, 2023.
arXiv:2203.13504, 2022. [209] M. S. Akhtar, D. S. Chauhan, D. Ghosal, S. Poria, A. Ekbal, and
[191] J. Li, X. Wang, G. Lv, and Z. Zeng, “GA2MIF: graph and attention P. Bhattacharyya, “Multi-task learning for multi-modal emotion recog-
based two-stage multi-source information fusion for conversational nition and sentiment analysis,” arXiv preprint arXiv:1905.05812, 2019.
emotion detection,” IEEE Trans. Affect. Comput., vol. 15, no. 1, pp. [210] T. Xie, C. H. Wu, P. Shi, R. Zhong, T. Scholak, M. Yasunaga, C. Wu,
130–143, 2024. M. Zhong, P. Yin, S. I. Wang, V. Zhong, B. Wang, C. Li, C. Boyle,
[192] C. Xu, X. Luo, and D. Wang, “MCPR: A chinese product review A. Ni, Z. Yao, D. R. Radev, C. Xiong, L. Kong, R. Zhang, N. A. Smith,
dataset for multimodal aspect-based sentiment analysis,” in Cognitive L. Zettlemoyer, and T. Yu, “Unifiedskg: Unifying and multi-tasking
Computing - ICCC 2022 - 6th International Conference, Held as Part structured knowledge grounding with text-to-text language models,”
of the Services Conference Federation, SCF 2022, Honolulu, HI, USA, CoRR, vol. abs/2201.05966, 2022.
December 10-14, 2022, Proceedings, 2022, pp. 83–90. [211] J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-
[193] M. Anschütz, T. Eder, and G. Groh, “Retrieving users’ opinions on agnostic visiolinguistic representations for vision-and-language tasks,”
social media with multimodal aspect-based sentiment analysis,” in 17th in Advances in Neural Information Processing Systems 32: Annual
IEEE International Conference on Semantic Computing, ICSC 2023, Conference on Neural Information Processing Systems 2019, NeurIPS
Laguna Hills, CA, USA, February 1-3, 2023, 2023, pp. 1–8. 2019, December 8-14, 2019, Vancouver, BC, Canada, H. M. Wallach,
[194] F. Zhao, C. Li, Z. Wu, Y. Ouyang, J. Zhang, and X. Dai, “M2DF: H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Gar-
multi-grained multi-curriculum denoising framework for multimodal nett, Eds., 2019, pp. 13–23.
aspect-based sentiment analysis,” CoRR, vol. abs/2310.14605, 2023. [212] W. Wang, H. Bao, L. Dong, and F. Wei, “Vlmo: Unified vision-
[195] R. Zhou, W. Guo, X. Liu, S. Yu, Y. Zhang, and X. Yuan, “Aom: language pre-training with mixture-of-modality-experts,” CoRR, vol.
Detecting aspect-oriented information for multimodal aspect-based abs/2111.02358, 2021.
sentiment analysis,” in Findings of the Association for Computational [213] Z. Zhang, X. Meng, Y. Wang, X. Jiang, Q. Liu, and Z. Yang, “Unims:
Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, 2023, pp. A unified framework for multimodal summarization with knowledge
8184–8196. distillation,” in Thirty-Sixth AAAI Conference on Artificial Intelligence,
[196] J. Zhao and F. Yang, “Fusion with gcn and se-resnext network for AAAI 2022, Thirty-Fourth Conference on Innovative Applications of
aspect based multimodal sentiment analysis,” in 2023 IEEE 6th In- Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Edu-
formation Technology, Networking, Electronic and Automation Control cational Advances in Artificial Intelligence, EAAI 2022 Virtual Event,
Conference (ITNEC), vol. 6, 2023, pp. 336–340. February 22 - March 1, 2022, 2022, pp. 11 757–11 764.
[197] D. Zhang, X. Ju, J. Li, S. Li, Q. Zhu, and G. Zhou, “Multi-modal [214] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis,
multi-label emotion detection with modality and label dependence,” in L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized BERT
Proceedings of the 2020 Conference on Empirical Methods in Natural pretraining approach,” CoRR, vol. abs/1907.11692, 2019.
Language Processing, EMNLP 2020, Online, November 16-20, 2020,
[215] S. Qiu, N. Sekhar, and P. Singhal, “Topic and style-aware transformer
2020, pp. 3584–3593.
for multimodal emotion recognition,” in Findings of the Association
[198] X. Ju, D. Zhang, J. Li, and G. Zhou, “Transformer-based label set
for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-
generation for multi-modal multi-label emotion detection,” in MM ’20:
14, 2023, 2023, pp. 2074–2082.
The 28th ACM International Conference on Multimedia, Virtual Event
/ Seattle, WA, USA, October 12-16, 2020, 2020, pp. 512–520. [216] C. Xi, G. Lu, and J. Yan, “Multimodal sentiment analysis based on
[199] J. Zhao, T. Zhang, J. Hu, Y. Liu, Q. Jin, X. Wang, and H. Li, “M3ED: multi-head attention mechanism,” in Proceedings of the 4th interna-
multi-modal multi-scene multi-label emotional dialogue database,” in tional conference on machine learning and soft computing, 2020, pp.
ACL 2022, 2022, pp. 5699–5710. 34–39.
[200] H. Luo, L. Ji, Y. Huang, B. Wang, S. Ji, and T. Li, “Scalevlad: [217] Y. Zhang, D. Song, P. Zhang, P. Wang, J. Li, X. Li, and B. Wang, “A
Improving multimodal sentiment analysis via multi-scale fusion of quantum-inspired multimodal sentiment analysis framework,” Theoret-
locally descriptors,” arXiv preprint arXiv:2112.01368, 2021. ical Computer Science, vol. 752, pp. 21–40, 2018.
[201] Y. Lee, S. Yoon, and K. Jung, “Multimodal speech emotion recog- [218] A. Metallinou, M. Wollmer, A. Katsamanis, F. Eyben, B. Schuller, and
nition using cross attention with aligned audio and text,” CoRR, vol. S. Narayanan, “Context-sensitive learning for enhanced audiovisual
abs/2207.12895, 2022. emotion classification,” IEEE Transactions on Affective Computing,
[202] C. Zong, F. Xia, W. Li, and R. Navigli, Eds., Proceedings of the vol. 3, no. 2, pp. 184–198, 2012.
59th Annual Meeting of the Association for Computational Linguistics [219] Y. Li, K. Zhang, J. Wang, and X. Gao, “A cognitive brain model for
and the 11th International Joint Conference on Natural Language multimodal sentiment analysis based on attention neural networks,”
Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Neurocomputing, vol. 430, pp. 159–173, 2021.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 24

[220] I. Chaturvedi, R. Satapathy, S. Cavallari, and E. Cambria, “Fuzzy [237] M. Wöllmer, F. Weninger, T. Knaup, B. W. Schuller, C. Sun, K. Sagae,
commonsense reasoning for multimodal sentiment analysis,” Pattern and L. Morency, “Youtube movie reviews: Sentiment analysis in an
Recognition Letters, vol. 125, pp. 264–270, 2019. audio-visual context,” IEEE Intell. Syst., vol. 28, no. 3, pp. 46–53,
[221] W. Wu, Y. Wang, S. Xu, and K. Yan, “Sfnn: semantic features 2013.
fusion neural network for multimodal sentiment analysis,” in 2020 [238] Y. Liu, Z. Yuan, H. Mao, Z. Liang, W. Yang, Y. Qiu, T. Cheng, X. Li,
5th International Conference on Automation, Control and Robotics H. Xu, and K. Gao, “Make acoustic and visual cues matter: Ch-sims
Engineering (CACRE), 2020, pp. 661–665. v2. 0 dataset and av-mixup consistent module,” in Proceedings of the
[222] W. Zheng, J. Yu, R. Xia, and S. Wang, “A facial expression-aware 2022 International Conference on Multimodal Interaction, 2022, pp.
multimodal multi-task learning framework for emotion recognition in 247–258.
multi-party conversations,” in Proceedings of the 61st Annual Meeting [239] S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and
of the Association for Computational Linguistics (Volume 1: Long R. Mihalcea, “MELD: A multimodal multi-party dataset for emotion
Papers), 2023, pp. 15 445–15 459. recognition in conversations,” in Proceedings of the 57th Conference
[223] H. Ma, J. Wang, H. Lin, B. Zhang, Y. Zhang, and B. Xu, “A of the Association for Computational Linguistics, ACL 2019, Florence,
transformer-based model with self-distillation for multimodal emotion Italy, July 28- August 2, 2019, Volume 1: Long Papers, A. Korhonen,
recognition in conversations,” CoRR, vol. abs/2310.20494, 2023. D. R. Traum, and L. Màrquez, Eds., pp. 527–536.
[224] Y. Wang, Y. Li, P. Bell, and C. Lai, “Cross-attention is not enough: [240] C. Busso, M. Bulut, C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N.
Incongruity-aware multimodal sentiment analysis and emotion recog- Chang, S. Lee, and S. S. Narayanan, “IEMOCAP: interactive emotional
nition,” CoRR, vol. abs/2305.13583, 2023. dyadic motion capture database,” Lang. Resour. Evaluation, vol. 42,
[225] Y. Zhang, A. Jia, B. Wang, P. Zhang, D. Zhao, P. Li, Y. Hou, X. Jin, no. 4, pp. 335–359, 2008.
D. Song, and J. Qin, “M3GAT: A multi-modal, multi-task interactive [241] Z. Fang, A. He, Q. Yu, B. Gao, W. Ding, T. Zhang, and L. Ma, “FAF: A
graph attention network for conversational sentiment analysis and novel multimodal emotion recognition approach integrating face, body
emotion recognition,” ACM Trans. Inf. Syst., vol. 42, no. 1, pp. 13:1– and text,” CoRR, vol. abs/2211.15425, 2022.
13:32, 2024. [242] Y. Wang and L. Guan, “Recognizing human emotional state from
[226] Y.-P. Ruan, S. Han, T. Li, and Y. Wu, “Fusing modality-specific audiovisual signals,” IEEE Trans. Multim., vol. 10, no. 4, pp. 659–
representations and decisions for multimodal emotion recognition,” 668, 2008.
in ICASSP 2024-2024 IEEE International Conference on Acoustics, [243] S. Zhalehpour, O. Onder, Z. Akhtar, and C. E. Erdem, “BAUM-1: A
Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 7925–7929. spontaneous audio-visual face database of affective and mental states,”
[227] F. Zhao, C. Li, Z. Wu, Y. Ouyang, J. Zhang, and X. Dai, “M2DF: multi- IEEE Trans. Affect. Comput., vol. 8, no. 3, pp. 300–313, 2017.
grained multi-curriculum denoising framework for multimodal aspect- [244] M. Soleymani, J. Lichtenauer, T. Pun, and M. Pantic, “A multimodal
based sentiment analysis,” in Proceedings of the 2023 Conference on database for affect recognition and implicit tagging,” IEEE Trans.
Empirical Methods in Natural Language Processing, EMNLP 2023, Affect. Comput., vol. 3, no. 1, pp. 42–55, 2012.
Singapore, December 6-10, 2023, 2023, pp. 9057–9070. [245] S. Koelstra, C. Muhl, M. Soleymani, J.-S. Lee, A. Yazdani, T. Ebrahimi,
[228] Y. Zhang, M. Chen, J. Shen, and C. Wang, “Tailor versatile multi- T. Pun, A. Nijholt, and I. Patras, “Deap: A database for emotion
modal learning for multi-label emotion recognition,” in Thirty-Sixth analysis; using physiological signals,” IEEE transactions on affective
AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth computing, vol. 3, no. 1, pp. 18–31, 2011.
Conference on Innovative Applications of Artificial Intelligence, IAAI [246] L. Stappen, A. Baird, L. Schumann, and B. W. Schuller, “The multi-
2022, The Twelveth Symposium on Educational Advances in Artificial modal sentiment analysis in car reviews (muse-car) dataset: Collection,
Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022, insights and improvements,” IEEE Trans. Affect. Comput., vol. 14,
2022, pp. 9100–9108. no. 2, pp. 1334–1350, 2023.
[229] D. Zhang, X. Ju, W. Zhang, J. Li, S. Li, Q. Zhu, and G. Zhou, “Multi- [247] Y. Li, J. Tao, B. Schuller, S. Shan, D. Jiang, and J. Jia, “Mec
modal multi-label emotion recognition with heterogeneous hierarchical 2017: Multimodal emotion recognition challenge,” in 2018 First Asian
message passing,” in Thirty-Fifth AAAI Conference on Artificial Intelli- Conference on Affective Computing and Intelligent Interaction (ACII
gence, AAAI 2021, Thirty-Third Conference on Innovative Applications Asia), 2018, pp. 1–5.
of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on [248] C. Busso, S. Parthasarathy, A. Burmania, M. AbdelWahab,
Educational Advances in Artificial Intelligence, EAAI 2021, Virtual N. Sadoughi, and E. M. Provost, “Msp-improv: An acted corpus of
Event, February 2-9, 2021, 2021, pp. 14 338–14 346. dyadic interactions to study emotion perception,” IEEE Transactions
[230] J. Zhao, Y. Zhao, and J. Li, “M3TR: multi-modal multi-label recog- on Affective Computing, vol. 8, no. 1, pp. 67–80, 2017.
nition with transformer,” in MM ’21: ACM Multimedia Conference, [249] M. Firdaus, H. Chauhan, A. Ekbal, and P. Bhattacharyya, “Meisd:
Virtual Event, China, October 20 - 24, 2021, 2021, pp. 469–477. A multimodal multi-label emotion, intensity and sentiment dialogue
[231] Z. Zhang, Z. Wang, X. Li, N. Liu, B. Guo, and Z. Yu, “Modalnet: an dataset for emotion recognition and sentiment analysis in conversa-
aspect-level sentiment classification model by exploring multimodal tions,” in Proceedings of the 28th international conference on compu-
data with fusion discriminant attentional network,” World Wide Web, tational linguistics, 2020, pp. 4441–4453.
vol. 24, pp. 1957–1974, 2021. [250] A. Jia, Y. He, Y. Zhang, S. Uprety, D. Song, and C. Lioma, “Beyond
[232] J. Yang, Y. Xiao, and X. Du, “Multi-grained fusion network emotion: A multi-modal dataset for human desire understanding,” in
with self-distillation for aspect-based multimodal sentiment analysis,” Proceedings of the 2022 Conference of the North American Chapter
Knowledge-Based Systems, vol. 293, p. 111724, 2024. of the Association for Computational Linguistics: Human Language
[233] A. Zadeh, R. Zellers, E. Pincus, and L. Morency, “Multimodal senti- Technologies, NAACL 2022, Seattle, WA, United States, July 10-15,
ment intensity analysis in videos: Facial gestures and verbal messages,” 2022, 2022, pp. 1512–1522.
IEEE Intell. Syst., vol. 31, no. 6, pp. 82–88, 2016. [251] L. Stappen, A. Baird, L. Christ, L. Schumann, B. Sertolli, E. Meßner,
[234] A. Zadeh, P. P. Liang, S. Poria, E. Cambria, and L. Morency, E. Cambria, G. Zhao, and B. W. Schuller, “The muse 2021 multi-
“Multimodal language analysis in the wild: CMU-MOSEI dataset modal sentiment analysis challenge: Sentiment, emotion, physiological-
and interpretable dynamic fusion graph,” in Proceedings of the 56th emotion, and stress,” in MuSe ’21: Proceedings of the 2nd on Multi-
Annual Meeting of the Association for Computational Linguistics, ACL modal Sentiment Analysis Challenge, Virtual Event, China, 24 October
2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, 2021, 2021, pp. 5–14.
I. Gurevych and Y. Miyao, Eds. Association for Computational [252] J. Sun, S. Han, Y.-P. Ruan, X. Zhang, S.-K. Zheng, Y. Liu, Y. Huang,
Linguistics, 2018, pp. 2236–2246. and T. Li, “Layer-wise fusion with modality independence modeling for
[235] W. Yu, H. Xu, F. Meng, Y. Zhu, Y. Ma, J. Wu, J. Zou, and K. Yang, multi-modal emotion recognition,” in Proceedings of the 61st Annual
“Ch-sims: A chinese multimodal sentiment analysis dataset with fine- Meeting of the Association for Computational Linguistics (Volume 1:
grained annotation of modality,” in Proceedings of the 58th annual Long Papers), 2023, pp. 658–670.
meeting of the association for computational linguistics, 2020, pp. [253] J. A. M. Correa, M. K. Abadi, N. Sebe, and I. Patras, “AMIGOS: A
3718–3727. dataset for affect, personality and mood research on individuals and
[236] A. Zadeh, Y. S. Cao, S. Hessner, P. P. Liang, S. Poria, and L.-P. groups,” IEEE Trans. Affect. Comput., vol. 12, no. 2, pp. 479–493,
Morency, “Cmu-moseas: A multimodal language dataset for spanish, 2021.
portuguese, german and french,” in Proceedings of the Conference on [254] Q. Zhang, J. Fu, X. Liu, and X. Huang, “Adaptive co-attention network
Empirical Methods in Natural Language Processing. Conference on for named entity recognition in tweets,” in Proceedings of the Thirty-
Empirical Methods in Natural Language Processing, vol. 2020, 2020, Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th
p. 1801. innovative Applications of Artificial Intelligence (IAAI-18), and the 8th
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 25

AAAI Symposium on Educational Advances in Artificial Intelligence [277] K. Ma, H. Huang, J. Chen, H. Chen, P. Ji, X. Zang, H. Fang,
(EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, 2018, C. Ban, H. Sun, M. Chen et al., “Beyond uncertainty: Evidential
pp. 5674–5681. deep learning for robust video temporal grounding,” arXiv preprint
[255] C. Xu, X. Luo, and D. Wang, “Mcpr: A chinese product review dataset arXiv:2408.16272, 2024.
for multimodal aspect-based sentiment analysis,” in International Con- [278] J. Gao, M. Chen, and C. Xu, “Vectorized evidential learning for
ference on Cognitive Computing. Springer, 2022, pp. 83–90. weakly-supervised temporal action localization,” IEEE Transactions on
[256] H. Yang, Y. Zhao, J. Liu, Y. Wu, and B. Qin, “MACSA: A multi- Pattern Analysis and Machine Intelligence, vol. 45, no. 12, pp. 15 949–
modal aspect-category sentiment analysis dataset with multimodal fine- 15 963, 2023.
grained aligned annotations,” CoRR, vol. abs/2206.13969, 2022. [279] Z. Aldeneh and E. M. Provost, “Using regional saliency for speech
[257] J. Zhou, J. Zhao, J. X. Huang, Q. V. Hu, and L. He, “Masad: A emotion recognition,” in 2017 IEEE International Conference on
large-scale dataset for multimodal aspect-based sentiment analysis,” Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans,
Neurocomputing, vol. 455, pp. 47–58, 2021. LA, USA, March 5-9, 2017. IEEE, 2017, pp. 2741–2745.
[258] M. Luo, H. Fei, B. Li, S. Wu, Q. Liu, S. Poria, E. Cambria, M.-L. [280] P. Li, Y. Song, I. McLoughlin, W. Guo, and L. Dai, “An attention
Lee, and W. Hsu, “Panosent: A panoptic sextuple extraction benchmark pooling based representation learning method for speech emotion
for multimodal conversational aspect-based sentiment analysis,” arXiv recognition,” in 19th Annual Conference of the International Speech
preprint arXiv:2408.09481, 2024. Communication Association, Interspeech 2018, Hyderabad, India,
[259] D. Borth, R. Ji, T. Chen, T. Breuel, and S.-F. Chang, “Large-scale September 2-6, 2018, B. Yegnanarayana, Ed. ISCA, 2018, pp. 3087–
visual sentiment ontology and detectors using adjective noun pairs,” in 3091.
Proceedings of the 21st ACM international conference on Multimedia,
[281] G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou,
2013, pp. 223–232.
B. W. Schuller, and S. Zafeiriou, “Adieu features? end-to-end speech
[260] S. Li and W. Deng, “Deep facial expression recognition: A survey,”
emotion recognition using a deep convolutional recurrent network,”
IEEE transactions on affective computing, vol. 13, no. 3, pp. 1195–
in 2016 IEEE International Conference on Acoustics, Speech and
1215, 2020.
Signal Processing, ICASSP 2016, Shanghai, China, March 20-25, 2016.
[261] Y. Zhang, C. Wang, and W. Deng, “Relative uncertainty learning
IEEE, 2016, pp. 5200–5204.
for facial expression recognition,” Advances in Neural Information
Processing Systems, vol. 34, pp. 17 616–17 627, 2021. [282] S. Mirsamadi, E. Barsoum, and C. Zhang, “Automatic speech emotion
[262] J. Li, K. Jin, D. Zhou, N. Kubota, and Z. Ju, “Attention mechanism- recognition using recurrent neural networks with local attention,” in
based cnn for facial expression recognition,” Neurocomputing, vol. 411, 2017 IEEE International Conference on Acoustics, Speech and Signal
pp. 340–350, 2020. Processing, ICASSP 2017, New Orleans, LA, USA, March 5-9, 2017.
[263] A. H. Farzaneh and X. Qi, “Facial expression recognition in the wild IEEE, 2017, pp. 2227–2231.
via deep attentive center loss,” in Proceedings of the IEEE/CVF winter [283] Z. Zhao, Y. Zheng, Z. Zhang, H. Wang, Y. Zhao, and C. Li, “Ex-
conference on applications of computer vision, 2021, pp. 2402–2411. ploring spatio-temporal representations by integrating attention-based
[264] H. Wang, B. Li, S. Wu, S. Shen, F. Liu, S. Ding, and A. Zhou, bidirectional-lstm-rnns and fcns for speech emotion recognition,” in
“Rethinking the learning paradigm for dynamic facial expression 19th Annual Conference of the International Speech Communication
recognition,” in Proceedings of the IEEE/CVF conference on computer Association, Interspeech 2018, Hyderabad, India, September 2-6, 2018,
vision and pattern recognition, 2023, pp. 17 958–17 968. B. Yegnanarayana, Ed. ISCA, 2018, pp. 272–276.
[265] Z. Zhao and Q. Liu, “Former-dfer: Dynamic facial expression recog- [284] D. Luo, Y. Zou, and D. Huang, “Investigation on joint representation
nition transformer,” in Proceedings of the 29th ACM International learning for robust feature extraction in speech emotion recognition,”
Conference on Multimedia, 2021, pp. 1553–1561. in 19th Annual Conference of the International Speech Communication
[266] H. Li, M. Sui, Z. Zhu et al., “Nr-dfernet: Noise-robust network for dy- Association, Interspeech 2018, Hyderabad, India, September 2-6, 2018,
namic facial expression recognition,” arXiv preprint arXiv:2206.04975, B. Yegnanarayana, Ed. ISCA, 2018, pp. 152–156.
2022. [285] M. Jiménez-Guarneros and G. F. Pineda, “Cross-subject eeg-based
[267] Y. Wang, Y. Sun, W. Song, S. Gao, Y. Huang, Z. Chen, W. Ge, and emotion recognition via semisupervised multisource joint distribution
W. Zhang, “Dpcnet: Dual path multi-excitation collaborative network adaptation,” IEEE Trans. Instrum. Meas., vol. 72, pp. 1–11, 2023.
for facial expression representation learning in videos,” in Proceedings [286] Y. Peng, Y. Zhang, W. Kong, F. Nie, B. Lu, and A. Cichocki,
of the 30th ACM International Conference on Multimedia, 2022, pp. “S3 lrr: A unified model for joint discriminative subspace identification
101–110. and semisupervised EEG emotion recognition,” IEEE Trans. Instrum.
[268] Y. Liu, W. Wang, C. Feng, H. Zhang, Z. Chen, and Y. Zhan, “Ex- Meas., vol. 71, pp. 1–13, 2022.
pression snippet transformer for robust video-based facial expression [287] Y. Peng, W. Kong, F. Qin, F. Nie, J. Fang, B. Lu, and A. Cichocki,
recognition,” 2021. “Self-weighted semi-supervised classification for joint eeg-based emo-
[269] F. Ma, B. Sun, and S. Li, “Logo-former: Local-global spatio-temporal tion recognition and affective activation patterns mining,” IEEE Trans.
transformer for dynamic facial expression recognition,” in ICASSP Instrum. Meas., vol. 70, pp. 1–11, 2021.
2023-2023 IEEE International Conference on Acoustics, Speech and [288] X. Quan, Z. Zeng, J. Jiang, Y. Zhang, B. Lu, and D. Wu, “Physio-
Signal Processing (ICASSP). IEEE, 2023, pp. 1–5. logical signals based affective computing: A systematic review,” Acta
[270] H. Li, H. Niu, Z. Zhu, and F. Zhao, “Intensity-aware loss for dynamic Automatica Sinica, vol. 47, no. 8, pp. 1769–1784, 2021.
facial expression recognition in the wild,” 2022.
[289] V. Chaparro, A. Gomez, A. Salgado, O. L. Quintero, N. López, and
[271] H. Liu, R. An, Z. Zhang, B. Ma, W. Zhang, Y. Song, Y. Hu, W. Chen,
L. F. Villa, “Emotion recognition from EEG and facial expressions:
and Y. Ding, “Norface: Improving facial expression analysis by identity
a multimodal approach,” in 40th Annual International Conference of
normalization,” arXiv preprint arXiv:2407.15617, 2024.
the IEEE Engineering in Medicine and Biology Society, EMBC 2018,
[272] H. Huang, X. Qiao, Z. Chen, H. Chen, B. Li, Z. Sun, M. Chen, and
Honolulu, HI, USA, July 18-21, 2018. IEEE, 2018, pp. 530–533.
X. Li, “Crest: Cross-modal resonance through evidential deep learning
for enhanced zero-shot learning,” arXiv preprint arXiv:2404.09640, [290] Y. Huang, J. Yang, P. Liao, and J. Pan, “Fusion of facial expressions and
2024. EEG for multimodal emotion recognition,” Comput. Intell. Neurosci.,
[273] Z. Han, C. Zhang, H. Fu, and J. T. Zhou, “Trusted multi-view vol. 2017, pp. 2 107 451:1–2 107 451:8, 2017.
classification with dynamic evidential fusion,” IEEE transactions on [291] Q. Zhu, G. Lu, and J. Yan, “Valence-arousal model based emotion
pattern analysis and machine intelligence, vol. 45, no. 2, pp. 2551– recognition using eeg, peripheral physiological signals and facial
2566, 2022. expression,” in ICMLSC 2020: The 4th International Conference on
[274] H. Huang, Z. Liu, S. Letchmunan, M. Lin, M. Deveci, W. Pedrycz, Machine Learning and Soft Computing, Haiphong City, Viet Nam,
and P. Siarry, “Evidential deep partial multi-view classification with January 17-19, 2020. ACM, 2020, pp. 81–85.
discount fusion,” arXiv preprint arXiv:2408.13123, 2024. [292] H. Tang, W. Liu, W. Zheng, and B. Lu, “Multimodal emotion recog-
[275] H. Huang, C. Qin, Z. Liu, K. Ma, J. Chen, H. Fang, C. Ban, H. Sun, and nition using deep neural networks,” in Neural Information Processing
Z. He, “Trusted unified feature-neighborhood dynamics for multi-view - 24th International Conference, ICONIP 2017, Guangzhou, China,
classification,” arXiv preprint arXiv:2409.00755, 2024. November 14-18, 2017, Proceedings, Part IV, ser. Lecture Notes in
[276] J. Gao, M. Chen, and C. Xu, “Vectorized evidential learning for Computer Science, D. Liu, S. Xie, Y. Li, D. Zhao, and E. M. El-Alfy,
weakly-supervised temporal action localization,” IEEE Transactions on Eds., vol. 10637. Springer, 2017, pp. 811–819.
Pattern Analysis and Machine Intelligence, vol. 45, no. 12, pp. 15 949– [293] Y. Tan, Z. Sun, F. Duan, J. Solé-Casals, and C. F. Caiafa, “A mul-
15 963, 2023. timodal emotion recognition method based on facial expressions and
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 26

electroencephalography,” Biomed. Signal Process. Control., vol. 70, p. [309] J. Zhou, C. Ma, D. Long, G. Xu, N. Ding, H. Zhang, P. Xie, and
103029, 2021. G. Liu, “Hierarchy-aware global model for hierarchical text classifica-
[294] J. Vazquez-Rodriguez, G. Lefebvre, J. Cumin, and J. L. Crowley, tion,” in Proceedings of the 58th Annual Meeting of the Association
“Emotion recognition with pre-trained transformers using multimodal for Computational Linguistics, ACL 2020, Online, July 5-10, 2020,
signals,” in 10th International Conference on Affective Computing and D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault, Eds. Association
Intelligent Interaction, ACII 2022, Nara, Japan, October 18-21, 2022, for Computational Linguistics, 2020, pp. 1106–1117.
2022, pp. 1–8. [310] G. Hu, G. Lu, and Y. Zhao, “FSS-GCN: A graph convolutional
[295] G. Hu, G. Lu, and Y. Zhao, “Bidirectional hierarchical attention networks with fusion of semantic and structure for emotion cause
networks based on document-level context for emotion cause extrac- analysis,” Knowl. Based Syst., vol. 212, p. 106584, 2021.
tion,” in Findings of the Association for Computational Linguistics: [311] D. Ghosal, N. Majumder, S. Poria, N. Chhaya, and A. F. Gelbukh,
EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16- “Dialoguegcn: A graph convolutional neural network for emotion
20 November, 2021, 2021, pp. 558–568. recognition in conversation,” in Proceedings of the 2019 Conference on
[296] M. Li, H. Zhao, T. Gu, and D. Ying, “Experiencer-driven and Empirical Methods in Natural Language Processing and the 9th Inter-
knowledge-aware graph model for emotion-cause pair extraction,” national Joint Conference on Natural Language Processing, EMNLP-
Knowl. Based Syst., vol. 278, p. 110703, 2023. IJCNLP 2019, Hong Kong, China, November 3-7, 2019, K. Inui,
[297] W. Li, Y. Li, V. Pandelea, M. Ge, L. Zhu, and E. Cambria, “Ecpec: J. Jiang, V. Ng, and X. Wan, Eds. Association for Computational
Emotion-cause pair extraction in conversations,” IEEE Transactions on Linguistics, 2019, pp. 154–164.
Affective Computing, pp. 1–12, 2022. [312] M. Munezero, C. S. Montero, E. Sutinen, and J. Pajunen, “Are they
[298] B. Li, H. Fei, F. Li, T.-s. Chua, and D. Ji, “Multimodal emotion-cause different? affect, feeling, emotion, sentiment, and opinion detection in
pair extraction with holistic interaction and label constraint,” ACM text,” IEEE Trans. Affect. Comput., vol. 5, no. 2, pp. 101–111, 2014.
Trans. Multimedia Comput. Commun. Appl., aug 2024, just Accepted. [313] K. Cheng, Z. Yang, M. Zhang, and Y. Sun, “Uniker: A unified
[Online]. Available: https://doi.org/10.1145/3689646 framework for combining embedding and definite horn rule reasoning
[299] J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image for knowledge graph inference,” in Proceedings of the 2021 Conference
pre-training for unified vision-language understanding and generation,” on Empirical Methods in Natural Language Processing, EMNLP 2021,
in International Conference on Machine Learning. PMLR, 2022, pp. Virtual Event / Punta Cana, Dominican Republic, 7-11 November,
12 888–12 900. 2021, M. Moens, X. Huang, L. Specia, and S. W. Yih, Eds., pp. 9753–
9771.
[300] Y. Zeng, S. Mai, and H. Hu, “Which is making the contribution: Mod-
[314] W. Wang, H. Bao, L. Dong, J. Bjorck, Z. Peng, Q. Liu, K. Aggarwal,
ulating unimodal and cross-modal dynamics for multimodal sentiment
O. K. Mohammed, S. Singhal, S. Som, and F. Wei, “Image as a foreign
analysis,” in Findings of the Association for Computational Linguistics:
language: Beit pretraining for all vision and vision-language tasks,”
EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20
CoRR, vol. abs/2208.10442, 2022.
November, 2021, 2021, pp. 1262–1274.
[315] D. Hershcovich, S. Frank, H. Lent, M. de Lhoneux, M. Abdou,
[301] C. Fan, H. Yan, J. Du, L. Gui, L. Bing, M. Yang, R. Xu, and
S. Brandl, E. Bugliarello, L. Cabello Piqueras, I. Chalkidis, R. Cui,
R. Mao, “A knowledge regularized hierarchical approach for emotion
C. Fierro, K. Margatina, P. Rust, and A. Søgaard, “Challenges and
cause analysis,” in Proceedings of the 2019 Conference on Empirical
strategies in cross-cultural NLP,” in Proceedings of the 60th Annual
Methods in Natural Language Processing and the 9th International
Meeting of the Association for Computational Linguistics (Volume 1:
Joint Conference on Natural Language Processing, EMNLP-IJCNLP
Long Papers), S. Muresan, P. Nakov, and A. Villavicencio, Eds., May
2019, Hong Kong, China, November 3-7, 2019, 2019, pp. 5613–5623.
2022.
[302] R. Speer, J. Chin, and C. Havasi, “Conceptnet 5.5: An open multilin- [316] S. Hareli, K. Kafetsios, and U. Hess, “A cross-cultural study on emotion
gual graph of general knowledge,” in Proceedings of the Thirty-First expression and the learning of social norms,” Frontiers in psychology,
AAAI Conference on Artificial Intelligence, February 4-9, 2017, San vol. 6, p. 1501, 2015.
Francisco, California, USA, 2017, pp. 4444–4451. [317] M. Obrist, S. A. Seah, and S. Subramanian, “Talking about tactile
[303] E. Cambria, Q. Liu, S. Decherchi, F. Xing, and K. Kwok, “Senticnet 7: experiences,” in Proceedings of Human Factors in Computing Systems,
A commonsense-based neurosymbolic AI framework for explainable 2013, pp. 1659–1668.
sentiment analysis,” in Proceedings of the Thirteenth Language Re-
sources and Evaluation Conference, LREC 2022, Marseille, France, 20-
25 June 2022, N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri,
T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo,
J. Odijk, and S. Piperidis, Eds. European Language Resources
Association, 2022, pp. 3829–3839.
[304] H. Zhang, D. Khashabi, Y. Song, and D. Roth, “Transomcs: From
linguistic graphs to commonsense knowledge,” in Proceedings of the
Twenty-Ninth International Joint Conference on Artificial Intelligence,
IJCAI 2020, 2020, pp. 4004–4010.
[305] A. Bosselut, H. Rashkin, M. Sap, C. Malaviya, A. Celikyilmaz, and
Y. Choi, “COMET: commonsense transformers for automatic knowl-
edge graph construction,” in Proceedings of the 57th Conference of the
Association for Computational Linguistics, ACL 2019, Florence, Italy,
July 28- August 2, 2019, Volume 1: Long Papers, 2019, pp. 4762–4779.
[306] M. Sap, R. L. Bras, E. Allaway, C. Bhagavatula, N. Lourie, H. Rashkin,
B. Roof, N. A. Smith, and Y. Choi, “ATOMIC: an atlas of machine
commonsense for if-then reasoning,” in The Thirty-Third AAAI Confer-
ence on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative
Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth
AAAI Symposium on Educational Advances in Artificial Intelligence,
EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019,
2019, pp. 3027–3035.
[307] D. Li, Y. Li, J. Zhang, K. Li, C. Wei, J. Cui, and B. Wang, “C3KG:
A chinese commonsense conversation knowledge graph,” CoRR, vol.
abs/2204.02549, 2022.
[308] Z. Yang, D. Yang, C. Dyer, X. He, A. J. Smola, and E. H. Hovy, “Hi-
erarchical attention networks for document classification,” in NAACL
HLT 2016, The 2016 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language
Technologies, San Diego California, USA, June 12-17, 2016, K. Knight,
A. Nenkova, and O. Rambow, Eds. The Association for Computational
Linguistics, 2016, pp. 1480–1489.