0% found this document useful (0 votes)
18 views

clsdd

Uploaded by

ruslanpilipiv1
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

clsdd

Uploaded by

ruslanpilipiv1
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

1

Recent Trends of Multimodal Affective Computing:


A Survey from an NLP Perspective
Guimin Hu♠ , Yi Xin† , Weimin Lyu♡ , Haojian Huang
L
♣ , Chang Sun⋄ , Zhihong Zhu♢

Lin Gui , Ruichu Cai , Erik Cambria , Hasti Seifi‡


△ ▽

Abstract—Multimodal affective computing has garnered in- seeks to develop models capable of interpreting and reasoning
creasing attention due to its broad applications in analyzing sentiment or emotional state over multiple modalities.
human behaviors and intentions, especially in text-dominated In its early stages, researchers of affective computing pre-
arXiv:2409.07388v2 [cs.CL] 30 Oct 2024

multimodal affective computing field. This survey presents the


recent trends of multimodal affective computing from an NLP dominantly focused on unimodal tasks, examining text-based,
perspective through four hot tasks: multimodal sentiment anal- audio-based, and vision-based affective computing separately.
ysis, multimodal emotion recognition in conversation, multi- For instance, D-MILN [10] is a textual sentiment classification
modal aspect-based sentiment analysis and multimodal multi- model, while work [11] utilizes BiLSTM models trained on
label emotion recognition. The goal of this survey is to explore raw audio to predict the average sentiment of crowd responses.
the current landscape of multimodal affective research, identify
development trends, and highlight the similarities and differences Today, sentiment analysis is widely employed across various
across various tasks, offering a comprehensive report on the modalities for applications such as market research, brand
recent progress in multimodal affective computing from an monitoring, customer service analysis, and social media mon-
NLP perspective. This survey covers the formalization of tasks, itoring. Recent advancements in multimedia technology [12]–
provides an overview of relevant works, describes benchmark [15] have diversified the channels for information dissemi-
datasets, and details the evaluation metrics for each task. Ad-
ditionally, it briefly discusses research in multimodal affective nation, with an influx of news, social media platforms like
computing involving facial expressions, acoustic signals, phys- Weibo, and video content. These developments have integrated
iological signals, and emotion causes. Additionally, we discuss textual (spoken features), acoustic (rhythm, pitch), and visual
the technical approaches, challenges, and future directions in (facial attributes) information to comprehensively analyze hu-
multimodal affective computing. Finally, we release a repository man emotions. For example, Xu et al. [16] introduced image
that compiles related works in multimodal affective computing,
providing detailed resources and references for the community1 . modality data into traditional text-based aspect-level sentiment
analysis, creating the new task of multimodal aspect-based
sentiment analysis. Similarly, Wang et al. [17] extended textual
emotion-cause pair extraction to a multimodal conversation
I. I NTRODUCTION setting, utilizing multimodal signals to enhance the model’s
ability to understand the emotions and their causes.
Affective computing combines expertise in computer sci- Multimodal affective computing tasks closely related to
ence, psychology, and cognitive science and its goal is to several learning paradigms in machine learning, including
equip machines with the ability to recognize, interpret, and transfer learning [18]–[20], multimodal learning [21], [22],
emulate human emotions [1]–[7]. Today, the world around us muti-task learning [23]–[25] and semantic understanding [26],
comprises various modalities—we perceive objects visually, [27]. Regarding transfer learning, it allows affective analysis
hear sounds audibly, feel textures tangibly, smell odors olfac- models trained in one domain to be adapted for effective
torily, and so forth. A modality refers to the way an experience performance in different domains. By fine-tuning pre-trained
is perceived or occurs, and is often associated with sensory models on limited data from the target domain, these mod-
modalities such as vision or touch, which are essential for els can be transferred to new domain, thereby enhancing
communication and sensation. The significant advancements their performance in multimodal affective computing tasks.
in multimodal learning across various fields [8], [9] garnered In multimodal learning, cross-modal attention dynamically
increasing attention and accelerated the progress of multi- aligns and focuses on relevant information from different
modal affective computing. Multimodal affective computing modalities, enhancing the model’s ability to capture sentiment
by highlighting key features and their interactions. In multi-
♠ University of Copenhagen, Denmark. email: [email protected]. task learning, shared representations across affective com-
† Nanjing University, China.
♡ Stony Brook University, United States.
puting tasks and modalities improve performance by cap-
♣ University of Hong Kong, China. turing common sentiment-related features from text, audio,
⋄ University of Bologna, Italy. and video. More recently, the studies of multimodal learning
♢ Peking University, China.
have advanced the field by pre-training multimodal models on
△ King’s College London, United Kingdom.
▽ Guangdong University of Technology, China.
extensive multimodal datasets over years, further improving
L
Nanyang Technological University, Singapore.
the performance on downstream tasks such as multimodal
‡ Arizona State University, United States. sentiment analysis [28]–[31]. With the scaling of the pre-
1 https://github.com/LeMei/Affective-Computing trained model, parameter-efficient transfer learning emerges
2

such as adapter [32], prompt [33], instruction-tuning [34] and tasks from an NLP perspective. Section X looks ahead to
in-context learning [35], [36]. future work from three aspects of the unification of multimodal
More and more works of multimodal affective computing affective computing tasks, the incorporation of external knowl-
leverage these parameter-efficient transfer learning methods edge, and affective computing with less-studied modalities.
to transfer knowledge from pre-trained models (e.g., uni- Lastly, Section XI concludes this survey and its contribution
modal pre-trained model or multimodal pre-trained model) to to multimodal affective computing community.
downstream affective tasks to improve model performance by
further fine-tuning the pre-trained model. For instance, Zou III. M ULTIMODAL A FFECTIVE C OMPUTING TASKS
et al. [37] design a multimodal prompt Transformer (MPT) In this section, we show the definition of each task and
to perform cross-modal information fusion. UniMSE [38] discuss their application scenarios. Table I presents basic
proposes an adapter-based modal fusion method, which injects information, including task input, output, type, and parent task,
acoustic and visual signals into the T5 model to fuse them with for each of the four tasks.
multi-level textual information.
Multimodal affective computing encompasses tasks like A. Multimodal Sentiment Analysis
sentiment analysis, opinion mining, and emotion recognition
using modalities such as text, audio, images, video, physiolog- Multimodal sentiment analysis (MSA) [44] origins from
ical signals, and haptic feedback. This survey focuses mainly sentiment analysis (SA) task [45] and it extends SA with the
on three key modalities: natural language, visual signals, and multimodal input. As a key research topics for computers to
vocal signals. We highlight four main tasks in this survey: understand human behaviors, the goal of multimodal sentiment
Multimodal Sentiment Analysis (MSA), Multimodal Emotion analysis (MSA) is to predict sentiment polarity and sentiment
Recognition in Conversation (MERC), Multimodal Aspect- intensity based on multimodal signals [46]. This task belongs
Based Sentiment Analysis (MABSA), and Multimodal Multi- to binary classification and regression task.
label Emotion Recognition (MMER). A considerable volume 1) Task Formalization: Given a multimodal signal Ii =
of studies exists in the field of multimodal affective computing, {Iit , Iia , Iiv }, we use Iim , m ∈ {t, a, v} to represent unimodal
and several reviews have been published [15], [39]–[43]. raw sequence drawn from the video fragment i, where {t, a, v}
However, these reviews primarily focus on specific affective denote the three types of modalities—text, acoustic and visual.
computing tasks or specific single modality and overlook an Multimodal sentiment analysis aims to predict the real number
overview of multimodal affective computing across multiple yir ∈ R, where yir ∈ [−3, 3] reflects the sentiment strength. We
tasks, and the consistencies and differences among these tasks. feed Ii as the model input and train a model to predict yir .
The goal of this survey is twofold. First, this survey aims 2) Application Scenarios: We categorize multimodal senti-
to provide a comprehensive overview of multimodal affective ment analysis applications into key areas: social media mon-
computing for beginners exploring deep learning in emotion itoring, customer feedback, market research, content creation,
analysis, detailing tasks, inputs, outputs, and relevant datasets. healthcare, and product reviews. For example, analyzing senti-
Second, it also offers insights for researchers to reflect on past ment in text, images, and videos on social media helps gauge
developments, explore future trends, and examine technical public opinion and monitor brand perception, while analyzing
approaches, challenges, and research directions in areas such multimedia product reviews can improve personalized recom-
as multimodal sentiment analysis and emotion recognition. mendations and user satisfaction.

B. Multimodal Emotion Recognition in Conversation


II. O RGANIZATION OF THIS S URVEY
Initially, MERC [40], [41], [47] extends from emotion
Section III outlines task formalization and application sce- recognition in a conversation (ERC) task [48], [49] and it
narios for multimodal affective tasks. Section IV introduces takes multimodal signals as the model inputs instead of signal
feature extraction methods and recent multimodal pre-trained modality. The goal of MERC is to automatically detect and
models like (e.g., CLIP, BLIP, BLIP2). Section V analyzes monitor the emotional states of speakers in a dialogue using
multimodal affective works from the two perspectives: multi- multimodal signals like text, audio and vision. In the com-
modal fusion and multimodal alignment, and shortly summa- munity of multimodal emotion recognition in a conversation,
rizes the parameter-efficient transfer methods used for further MERC is a multi-class classification task and then categorizes
tuning the pre-trained model. Section VI reviews literature the given utterance into one basic emotion from a pre-define
on MSA, MERC, MABSA, and MMER, focusing on mul- emotion set.
titask learning, pre-trained models, enhanced knowledge, and 1) Task Formalization: Given a dialog for k number of
contextual information. Furthermore, Section VII summarizes utterances, it can be formulated as U = {u1 , · · · , uk }, where
multimodal datasets, and Section VIII covers evaluation met- ui =< Iti , Iai , Ivi > denotes ith utterance of a conversation
rics for each multimodal affective computing task. After the containing text (text transcript), audio (speech segment) and
reviews of multimodal affective computing works. Section IX visual (video clip) modalities, denoted by {Iti , Iai , Ivi }. We use
briefly reviews multimodal affective computing works based Y as the label set of U and each utterance can be formalized
on facial expressions, acoustic signals, physiological signals, as follows:
and emotion causes. It also highlights the consistency, differ-
ences, and recent trends of multimodal affective computing < U, Y >= {{ui , y i , i ∈ [1, k]}} (1)
3

TABLE I
T HE DETAILS OF MULTIMODAL AFFECTIVE TASKS . T,A,V DENOTE TEXT, AUDIO AND VISUAL MODALITIES RESPECTIVELY.

Task Input Output Granularity Task type Parent task


MSA T,A,V score,[-3,3] sentence binary classification, regression sentiment analysis [45]
MERC T,A,V one basic emotion utterance multi-class classification emotion recognition [48], [49]
MABSA T,V (aspect, sentiment polarity) aspect classification, tuple extraction, triple extraction aspect-level sentiment analysis [51]–[53]
MMER T,A,V two or more basic emotions utterance multi-label multi-class classification emotion recognition [54], [55]

Here, y i indicates ith utterance’s emotion category that is D. Multimodal Multi-label Emotion Recognition
predefined before. Multimodal signals may show more than one emotion label,
2) Application Scenarios: Multimodal Emotion Recogni- which boosts the rise of a new task: multimodal multi-label
tion in Conversation (MERC) has broad applications across emotion recognition (MMER). MMER inherits the charac-
key areas: human-computer interaction, virtual assistants, teristic of multimodal emotion recognition and multi-label
healthcare, and customer service. (i) In Human-Computer classification [54], [55]. MMER is developed from multi-label
Interaction, MERC enhances user experience by enabling emotion recognition, which predicts two or more basic emo-
systems to recognize and respond to emotional states, lead- tion categories to analysis the given multimodal information
ing to more personalized interactions. (ii) Virtual Assistants and it is a multi-label multi-class classification.
and Chatbots benefit from improved emotional understand- 1) Task Formalization: Given a multimodal signal Ii =
ing, making conversations more natural and engaging. (iii) {Iit , Iia , Iiv }, Ii contains three types of modalities—text, audio
In Customer Service, MERC helps agents better respond to and visual. Formally, we use Iim ∈ Rdm ×lm , m ∈ {t, a, v} to
customer emotions, enhancing satisfaction. Additionally, bio- represent the raw sequence of text, audio, and visual modalities
sensing systems measuring physiological signals like ECG, from the sample i. dm and lm denote the feature dimension
PPG, EEG, and GSR expand MERC applications in robotics, and sequence length of modality m. The goal of MMER is
healthcare, and virtual reality. to recognize at least one emotion categories from |L| pre-
defined label space Y = {y1 , y2 , · · · , y|L| } according to the
multimodal signal Ii .
C. Multimodal Aspect-based Sentiment Analysis
2) Application Scenarios: Multimodal multi-label emotion
Xu et al. [50] are among the first to put forward the new recognition seeks to create AI systems that can understand
task, aspect based multimodal sentiment analysis. Multimodal and categorize emotions expressed through various modalities
aspect-based sentiment analysis (MABSA) is contructed based simultaneously. This task is challenging due to the complexity
on aspect-based sentiment analysis in texts [51], [52]. In con- and variability of human emotions, differences in emotional
trast with MSA and ERC, multimodal aspect-based sentiment expression across individuals and cultures, and the need for
analysis performs on fined granularity multimodal signals. effective integration of diverse modalities.
MABSA receives texts and vision (image) modalitie as the
inputs and outputs the tuple including aspect and its sentiment IV. M ODAL F EATURE E XTRACTOR
polarity. This task can be viewed as the classification, tuple
For multimodal affective computing tasks, the model input
extraction and triple extraction tasks. Recently, MABSA has
typically includes at least two modalities. In this section, we
attracted increasing attention. Given an image and correspond-
introduce the common feature extractors that transform raw
ing text, MABSA is defined as jointly extracting all aspect
sequences into a feature vectors.
terms from image-text pairs and predicting their sentiment
a) Text Feature Extractor: For text modality, researchers
polarities, i.e., positive, negative and neutral.
adopt static word embedding methods like Word2Vec [56]
1) Task Formalization: Suppose the multimodal inputs in- and GloVec [57] to initialize word representation. Also, text
clude a textual content T = {w1 , w2 , ..., wL } and an im- modality can be encoded into feature vector through pre-
age set I = {I1 , I2 , · · · , IK }, the goal of MABSA is to trained language models like BERT [58], BART [59], and
predict the sentiment polarities with a given aspect phrase T5 [60] to extract the text representation. More recently, a
A = {a1 , a2 , · · · , aN }, where ai denotes the ith aspect (e.g., collection of foundation language models like LLaMA [61],
food), L is the length of textual context, K is the number of [62], Mamba [63] emerge and are used for encoding text
images, and N is the length of aspect phrase. modality.
2) Application Scenarios: Multimodal aspect-based senti- b) Audio Feature Extractor: For audio modality, raw
ment analysis (MABSA) focuses on improving products and acoustic input needs to be processed into numerical sequential
services by analyzing reviews across text, images, and videos vectors. The common way is to use librosa2 to extract Mel-
to identify customer opinions on specific aspects. For example, spectrogram as audio features. It is the short-term power
MABSA can assess dining experiences, like food quality or spectrum of sound and is widely used in modern audio
service, to enhance restaurant operations. It also applies to processing. Transformer structure has achieved tremendous
social media, where analyzing mixed content provides deeper success of in the field of NLP and computer vision. Gong et
insights into public opinion, aiding better decision-making and
marketing strategies. 2 https://github.com/librosa/librosa.
4

al. [64] propose audio spectrogram Transformer (AST), which Flamingo [70] allows the models to interpret and generate
converted waveform into a sequence of 128-dimensional log outputs based on combined visual and textual inputs. In con-
Mel filterbank (fbank) features to encode audio modality. trast with prompt, instruction-tuning belongs to the learning
c) Vision Feature Extractor: For image modality, re- paradigm of prompt. Also, models like InstructBLIP [73]
searchers can extract fixed T frames from each segment and FLAN [75] have demonstrated that instruction-tuning not
and use effecientNet [65] pre-trained (supervised) on VG- only improves the model’s adherence to instructions but also
Gface3 and AFEW dataset as vision initial representation. enhances its ability to generalize across tasks. In the commu-
Furthermore, Dosovitskiy et al. [66] propose to use standard nity of multimodal affective computing, researchers can lever-
Transformer directly to images, which split an image into age these parameter-efficient transfer learning methods (e.g.,
patches and provide the sequence of linear embeddings of adapter, prompt and instruction tuning) to transfer knowledge
these patches as an input to a Transformer. CLIP [67] jointly from pre-trained models (e.g., unimodal pre-trained model or
trained image and its caption with the contrastive learning, multimodal pre-trained model) to downstream affective tasks,
thereby extraction vision features that correspond to texts. further tune the pre-trained model with the affective dataset.
d) Multimodal Feature Extractor: The emergence of Considering that multimodal affective computing involves
multimodal pre-trained model (MPM) marks a significant ad- multimodal learning, therefore, we analyze multimodal affec-
vancement in integrating multimodal signals, as demonstrated tive computing works from multimodal fusion and multimodal
by groundbreaking developments like GPT-4 [68] and Gem- alignment, as shown in Fig. 1.
ini [69]. Among the open-source innovations, Flamingo [70]
represents an early effort to integrate visual features with
B. Multimodal Fusion
LLMs using cross-attention layers. BLIP-2 [71] introduces a
trainable adaptor module (Q-Former) that efficiently connects Multimodal signals are heterogeneous and derived from
a pre-trained image encoder with a pre-trained LLM, ensuring various information sources, making integrating multimodal
precise alignment of visual and textual information. Similarly, signals into one representation essential. Tasi et al. [77] sum-
MiniGPT-4 [72] achieves visual and textual alignment through marize multimodal fusion into early, late or intermediate fusion
a linear projection layer. InstructBLIP [73] advances the field based on the fusion stage. Early fusion combines features
by focusing on vision-language instruction tuning, building from different modalities at the input level before the model
upon BLIP-2, and requiring a deeper understanding and larger processes them. Late fusion processes features from different
datasets for effective training. LLaVA [74] integrates CLIP’s modalities separately through individual sub-networks, and the
image encoder with LLaMA’s language decoder to enhance outputs of these sub-networks are combined at a later stage,
instruction tuning capabilities. Akbari et al. [31] train VATT typically just before making the final decision. Late fusion uses
end-to-end from scratch using multimodal contrastive losses unimodal decision values and combines them using mecha-
and evaluate its performance by the downstream tasks of nisms such as averaging [124], voting schemes [125], weight-
video action recognition, audio event classification, image ing based on channel noise [126] and signal variance [127],
classification, and text-to-video retrieval. Based on multimodal or a learned model [6], [128]. The two fusion strategies face
pre-trained model, raw modal signals can be used to extract some problems. For example, early fusion at the feature level
modal features. can underrate intra-modal dynamics after the fusion operation,
while late fusion at the decision level may struggle to capture
V. M ULTIMODAL L EARNING ON M ULTIMODAL A FFECTIVE inter-modal dynamics before the fusion operation. Different
C OMPUTING from the previous two methods by combining features from
different modalities at intermediate layers of the model learner,
Multimodal learning involves learning representations from Intermediate fusion allows for more interaction between the
different modalities. Generally, the multimodal model should modalities at different processing stages, potentially leading to
first align the modalities based on their semantics before fusing richer representations [38], [129], [130]. Based on these fusion
multimodal signals. After alignment, the model combines strategies, we review multimodal fusion from three aspects:
multiple modalities into one representation vector. cross-modality learning, modal consistency and difference, and
multi-stage modal fusion. Fig. 2 illustrates the three aspects
A. Preliminary of modal fusion.
With the scaling of the pre-trained model, parameter- 1) Cross-modality Learning: Cross-modality learning fo-
efficient transfer learning emerges such as adapter [32], cuses on the incorporation of inter-modality dependencies and
prompt [33], instruction-tuning [34] and in-context learn- interactions for better modal fusion in representation learn-
ing [35], [36]. In this paradigm, instead of adapting pre- ing. Early works of multimodal fusion [76] mainly operate
trained LMs to downstream tasks via objective engineering, geometric manipulation in the feature spaces to fuse multiple
downstream tasks are reformulated to look more like those modalities. The recent common way of cross-modality learn-
solved during the original LM training with the help of prompt, ing is to introduce attention-based learning method to model
instruction-tuning and in-context learning. The use of prompts inter-modality and intra-modality interactions. For example,
in Vision Language Models (VLMs) like GPT-4V [68] and MuLT [77] proposes multimodal Transformer to learn inter-
modal interaction. Chen et al. [78] augment the inter-intra
3 https://www.robots.ox.ac.uk/ vgg/software/vgg face/. modal features with trimodal collaborative interaction and
5

TFN [76], MuLT [77], TCDN [78], CM-BERT [79], HGraph-CL [80], BAFN [81], TeFNA [82],
Cross-modal Learning
CMCF-SRNet [83], MultiEMO [84], MM-RBN [85], MAGDRA [86], AMuSE [87].
Multimodal Learning on

Multimodal Modal Consistency MMIM [88], MPT [89], MMMIE [90], MISA [91], CoolNet [92], ModalNet [93], MAN [87], TAILOR [94],
Fusion (§V-B) and Difference AMP [95], STCN [96].
Affective Computing

Multi-stage TSCL-FHFN [97], HFFN [98], CLMLF [99], RMFN [100], CTFN [101], MCM [102], FmlMSN [103],
Modal Fusion ScaleVLAD [104], MUG [105], HFCE [106], MTAG [107], CHFusion [108].

MMIN [109], CMAL [110], M2R2 [111], EMMR [112], TFR-Net [113], MRAN [114], VIGAN [115],
Miss Modality
TATE [116], IF-MMIN [117], CTFN [101], MTMSA [118], FGR [119], MMTE+AMMTD [120].
Multimodal
Alignment (§V-C)
Semantic Alignment MuLT [77], ScaleVLAD [104], Robust-MSA [121], HGraph-CL [80], SPIM [122], MA-CMU-SGRNet [123].

Fig. 1. Taxonomy of multimodal affective computing from multimodal fusion and multimodal alignment.

image encoder
vision audio
vision
audio encoder vision
text
audio audio
..felt a bit text encoder
common
frustrated text modal consistency
text cross-modality and difference multi-stage fusion

Fig. 2. Illustration of multimodal fusion from following aspects: 1) cross-modality modal fusion, 2) modal fusion based on modal consistency and difference
and 3) multi-stage modal fusion.

unifies the characteristics of the three modals (inter-modal). while modal difference leverages complementary informa-
Yang et al. [79] propose the cross-modal BERT (CM-BERT), tion from each modality to improve overall data understand-
aiming to model the interaction of text and audio modality ing. For example, several works [89], [90] have explored
based on pre-trained BERT model. Lin et al. [80] explore the learning modal consistency and difference using contrastive
intricate relations of intra- and inter-modal representations for learning. Han et al. [88] maximized the mutual information
sentiment extraction. More recently, Tang et al. [81] propose between modalities and between each modality to explore the
the multimodal dynamic enhanced block to capture the intra- modal consistency. Another study [89] proposes a hybrid con-
modality sentiment context, which decrease the intra-modality trastive learning framework that performs intra-/inter-modal
redundancy of auxiliary modalities. Huang et al. [82] propose contrastive learning and semi-contrastive learning simultane-
a Text-centered fusion network with cross-modal attention ously, models cross-modal interactions, preserves inter-class
(TeFNA), a multimodal fusion network that uses crossmodal relationships, and reduces the modality gap. Additionally,
attention to model unaligned multimodal timing information. Zheng et al. [90] combined mutual information maximization
In the community of emotion recognition, CMCF-SRNet [83] between modal pairs with mutual information minimization
is a cross-modality context fusion and semantic refinement net- between input data and corresponding features. This method
work, which contains a cross-modal locality-constrained trans- aims to extract modal-invariant and task-related information.
former and a graph-based semantic refinement transformer, Modal consistency can also be viewed as the process of
aiming to explore the multimodal interaction and dependencies projecting multiple modalities into a common latent space
among utterances. Shi et al. [84] propose an attention-based (modality-invariant representation), while modal difference
correlation-aware multimodal fusion framework MultiEMO, refers to projecting modalities into modality-specific repre-
which captures cross-modal mapping relationships across sentation spaces. For example, Hazarika et al. [91] propose
textual, audio and visual modalities based on bidirectional a method that projects each modality into both a modality-
multi-head cross attention layers. In summary, cross-modality invariant and a modality-specific space. They implemented
learning mainly focuses on modeling the relation between a decoder to reconstruct the original modal representation
modalities. using both modality-invariant and modality-specific features.
2) Modal Consistency and Difference: Modal consistency AMuSE [87] proposes a multimodal attention network to
refers to the shared feature space across different modali- capture cross-modal interactions at various levels of spatial
ties for the same sample, while modal difference highlights abstraction by jointly learning its interactive bunch of mode-
the unique information each modality provides. Most multi- specific peripheral and central networks. For the fine-grain
modal fusion approaches separate representations into modal- sentiment analysis, Xiao et al. [92] present CoolNet to boost
invariant (consistency) and modal-specific (difference) com- the performance of visual-language models in seamlessly
ponents. Modal consistency helps handle missing modalities, integrating vision and language information. Zhang et al. [93]
6

multi-task learning, Peng et al. [103] propose a fine-grained


modal label-based multi-stage network (FmlMSN), which uti-
lize seven sentiment labels in unimodal, bimodal and trimodal
mask
information at different granularities from text, audio, image
and the combinations of them. Researchers generally focus
on the scale-level modal alignment and modal fusion before
model’ decision. Sharafi et al. [96] design a new fusion method
mask
was proposed for multimodal emotion recognition utilizing
different scales.
..felt a bit frustrated ..felt a bit frustrated
C. Multimodal Alignment
(a) Multimodal alignment involves synchronizing modal se-
mantics before fusing multimodal data. A key challenge is
handling missing modalities, which can occur due to issues
like a camera being turned off, a user being silent, or device
errors affecting both voice and text. Since the assumption of
mask always having all modalities is often unrealistic, multimodal
alignment must address these gaps. Additionally, it involves
aligning objects across images, text, and audio through se-
mantic alignment. Thus, we discuss multimodal alignment in
mask terms of managing missing modalities and achieving semantic
alignment. Fig. 3 illustrates the multimodal alignment.
rustrated ..felt a bit frustrated 1) Alignment for Missing Modality: In real-world scenar-
ios, data collection can sometimes result in the simultaneous
loss of certain modalities due to unforeseen events. While mul-
(b)
timodal affective computing typically assumes the availability
Fig. 3. Illustration multimodal alignment:(a) semantic alignment and (b) of all modalities, this assumption often fails in practice, which
alignment with missing modal fragments.
can cause issues in modal fusion and alignment models when
some modalities are missing. We classify existing methods for
handling missing modalities into four groups.
propose an aspect-level sentiment classification model by The first group features the data augmentation approach,
exploring modal consistency with fusion discriminant attention which randomly ablates the inputs to mimic missing modality
network. cases. Parthasarathy et al. [110] propose a strategy to randomly
3) Multi-stage Modal Fusion: Multi-stage multimodal fu- ablate visual inputs during training at the clip or frame level
sion [131], [132] refers to combine modal information ex- to mimic real world scenarios. Wang et al. [111] deal with
tracted from multiple stages or multiple scales to fuse modal the utterance-level modalities missing problem by training
representation. Li et al. [97] design a two-stage contrastive emotion recognition model with iterative data augmentation by
learning task, which learns similar features for data with the learned common representation. The second group is based on
same emotion category and learns distinguishable features for generative methods to directly predict the missing modalities
data with different emotion categories. HFFN [98] divides given the available modalities [133]. For example, Zhao et
the procession of multimodal fusion into divide, conquer and al. [109] propose a missing modality imagination network
combine, which learns local interactions at each local chunk (MMIN), which can predict the representation of any missing
and explores global interactions by conveying information modality given available modalities under different missing
across local interactions. Different from the work of HFFN, modality conditions, so as to to deal with the uncertain missing
Li et al. [99] align and fused the token-level features of text modality problem. Zeng et al. [112] propose an ensemble-
and image and designed label based contrastive learning and based missing modality reconstruction (EMMR) network to
data based contrastive learning to capture common features detect and recover semantic features of the key missing
related to sentiment in multimodal data. There are some modality. Yuan et al. [113] propose a transformer-based fea-
work [100] decomposed the fusion procession into multiple ture reconstruction network (TFR-Net), which improves the
stages, each of them focused on a subset of multimodal robustness of models for the random missing in non-aligned
signals for specialized, effective fusion. Also, CTFN [108] modality sequences. Luo et al. [114] propose the multimodal
presents a novel feature fusion strategy that proceeds in a reconstruction and align net (MRAN) to tackle the missing
hierarchical fashion, first fusing the modalities two in two and modality problem, especially to relieve the decline caused by
only then fusing all three modalities. Moreover, the modal the text modality’s absence.
fusion at multiple levels has made progress, such as Li et The third group aims to learn the joint multimodal rep-
al. [102] propose a multimodal sentiment analysis method resentations that can contain related information from these
based on multi-level correlation mining and self-supervised modalities [134]. For example, Ma et al. [135] propose a
7

unified deep learning framework to efficiently handle missing Lai et al. [122] propose a deep modal shared information
labels and missing modalities for audio-visual emotion recog- learning module based on the covariance matrix to capture the
nition through correlation analysis. Zeng et al. [116] propose shared information between modalities. Additionally, we use
a tag-assisted Transformer encoder (TATE) network to handle a label generation module based on a self- supervised learning
the problem of missing uncertain modalities, which designs strategy to capture the private information of the modalities.
a tag encoding module to cover both the single modality and Our module is plug-and-play in multimodal tasks, and by
multiple modalities missing cases, so as to guide the network’s changing the parameterization, it can adjust the information
attention to those missing modalities. Zuo et al. [117] propose exchange relationship between the modes and learn the private
to use invariant features for a missing modality imagination or shared information between the specified modes. We also
network (IF-MMIN), which includes an invariant feature learn- employ a multi-task learning strategy to help the model focus
ing strategy and an invariant feature based imagination module its attention on the modal differentiation training data. For
(IF-IM). Through the two strategies, IF-MMIN can alleviate model robustness, Robust-MSA [121] present an interactive
the modality gap during the missing modalities prediction, thus platform that visualizes the impact of modality noise to help
improving the robustness of multimodal joint representation. researchers improve model capacity.
Zhou et al. [119] propose a novel brain tumor segmentation
network in the case of missing one or more modalities. VI. M ODELS ACROSS M ULTIMODAL A FFECTIVE
The proposed network consists of three sub-networks: a C OMPUTING
feature-enhanced generator, a correlation constraint block and In the community of multimodal affective computing, the
a segmentation network. The last group is translation-base works appear to significant consistency in term of development
methods. Tang et al. [101] propose the coupled-translation technical route. For clarity, we group the these works based
fusion network (CTFN) to model bi-direction interplay via on multitask learning, pre-trained model, enhanced knowledge,
couple learning, ensuring the robustness in respect to missing contextual information. Meanwhile, we briefly summarized the
modalities. Liu et al. [118] propose a modality translation- advancements of MSA, MERC, MABSA and MMER tasks
based MSA model (MTMSA), which is robust to uncertain through the above four aspects. Fig. 4 summarizes the typical
missing modalities. In summary, the works about alignment works of multimodal affective computing from these aspects
for miss modality focus on miss modality reconstruction and and Table II shows the taxonomy of multimodal affective
learning based on the available modal information. computing.
2) Alignment for Cross-modal Semantics: Semantic align-
ment aims to find the connection between multiple modalities
in one sample, which refers to searching one modal informa- A. Multitask Learning
tion through another modal information and vice versa. In the Multitask learning trains a model on multiple related tasks
filed of MSA, Tsai et al. [77] leverage cross-modality and simultaneously, using shared information to enhance perfor-
multi-scale modal alignment to implement the modal consis- mance. The loss function combines losses from all tasks,
tency in the semantic aspects, respectively. ScaleVLAD [202] with model parameters updated via gradient descent. In mul-
proposes a fusion model to gather multi-Scale representation timodal affective computing, multitask learning helps distin-
from text, video, and audio with shared vectors of locally guish between modal-invariant and modal-specific features and
aggregated descriptors to improve unaligned multimodal senti- integrates emotion-related sub-tasks into a unified framework.
ment analysis. Yang et al. [107] convert unaligned multimodal Fig. 5 shows the learning paradigm of multitask learning in
sequence data into a graph with heterogeneous nodes and multimodal affective learning task.
edges that captures the rich interactions across modalities and 1) Multimodal Sentiment Analysis: In filed of multi-
through time. Lee et al. [203] segment the audio and the under- modal sentiment analysis, Self-MM [136] generates a pseudo-
lying text signals into equal number of steps in an aligned way label [207]–[209] for single modality and then jointly train
so that the same time steps of the sequential signals cover the unimodal and multimodal representations based on the gener-
same time span in the signals. Zong et al. [204] exploit multi- ated and original labels. Furthermore, a translation framework
ple bi-direction translations, leading to double multimodal fus- ARGF between modalities, i.e., translating from one modality
ing embeddings compared with traditional translation methods. to another is used as an auxiliary task to regualize the
Wang et al. [205] propose a multimodal encoding–decoding multimodal representation learning [137]. Akhtar et al. [138]
translation network with a transformer and adopted a joint en- leverage the interdependence of the tasks sentiment and emo-
coding–decoding method with text as the primary information tion to improve the model performance on two tasks. Chen et
and sound and image as the secondary information. Zhang al. [139] propose a video-based cross-modal auxiliary network
et al. [123] propose a novel multi-level alignment to bridge (VCAN), which is comprised of an audio features map module
the gap between acoustic and lexical modalities, which can and a cross-modal selection module to make use of auxiliary
effectively contrast both the instance-level and prototype-level information. Zheng et al. [140] propose a disentanglement
relationships, separating the multimodal features in the latent translation network (DTN) with slack reconstruction to cap-
space. Yu et al. [206] propose an unsupervised approach which ture desirable information properties, obtain a unified feature
minimizes the Wasserstein distance between both modalities, distribution and reduce redundancy. Zheng et al. [90] com-
forcing both encoders to produce more appropriate repre- bine mutual information maximization (MMMIE) between
sentations for the final extraction to align text and image. modal pairs with mutual information minimization between
8

MSA (§VI-A1) Self-MM [136], ARGF [137], MultiSE [138], VCAN [139], DTN [140], MMMIE [90], MMIM [88], MISA [91],

FacialMMT [25], MMMIE [90], AuxEmo [141], TDFNet [142], MALN [143], LGCCT [144], MultiEMO [84],
MERC (§VI-A2)
Multitask RLEMO [145],
Learning (§VI-A)
MABSA (§VI-A3) CMMT [146], AbCoRD [147], JML [148], MPT [37], MMRBN [85],

MMER (§VI-A4) AMP [95], MEGLN-LDA [149], MultiSE [138], AMP [95].

MAG-XLNet [22], UniMSE [38], AOBERT [150], SKESL [151], TEASAL [152], TO-BERT [153], SPT [154],
MSA (§VI-B1)
ALMT [155]
Multimodal Affective Computing

MERC (§VI-B2) FacialMMT [25], QAP [20], UniMSE [38], GraphSmile [156],
Pre-trained
Model (§VI-B)
MIMN [25], GMP [18], ERUP [157], VLP-MABSA [158], DR-BERT [159], DTCA [160], MSRA [161],
MABSA (§VI-B3)
AOF-ABSA [162], AD-GCFN [163], MOCOLNet [164],

MMER (§VI-B4) TAILOR [94],

MSA (§VI-C1) TETFN [165], ITP [19], SKEAFN [166], SAWFN [167], MTAG [107],

Enhanced MERC (§VI-C2) ConSK-GCN [168], DMD [169], MRST [170], SF [171], TGMFN [172], RLEMO [145], DEAN [173],
Knowledge
(§VI-C) MABSA (§VI-C3) KNIT [174], FITE [175], CoolNet [176], HIMT [177],

MMER (§VI-C4) UniVA-RoBERTa [178], CARAT [179], M3TR [180], MAGDRA [86], HHMPN [181],

MuLT [77], CIA [182], CAT-LSTM [183], CAMFNet [184], MTAG [107], CTNet [185], ScaleVLAD [104],
MSA (§VI-D1)
MMML [186], GFML [186], CHFusion [108],

CMCF-SRNet [83], MMGCN [187], MM-DFN [188], SAMGN [189], M3Net [190], M3GAT [187], RL-EMO [145],
Contextual MERC (§VI-D2)
SCMFN [191], EmoCaps [192], GA2MIF [193], MALN [143], COGMEN [49],
Information
(§VI-D)
MABSA (§VI-D3) DTCA [160], MCPR [194], Elbphilharmonie [195], M2DF [196], AoM [197], FGSN [198], MIMN [16],

MMER (§VI-D4) MMS2S [199], MESGN [200], MDI [201],

Fig. 4. Taxonomy of multimodal affective computing works from aspects multitask learning, pre-trained model, enhanced knowledge and contextual information.

unsupervised face clustering, and face matching in a unified


architecture, so as to leverages the frame-level facial emotion
distributions to help improve utterance-level emotion recogni-
Task A
knowledge sharing

tion based on multi-task learning. Zhang et al. [210] design


two kinds of multitask learning (MTL) decoders, i.e., single-
audio level and multi-level decoders, to explore their potential. More
shared specifically, the core of a single-level decoder is a masked
encoder Task B outer-modal self-attention mechanism. Sun et al. [141] design
vision two auxiliary tasks to alleviate the insufficient fusion between
modalities and guide the network to capture and align emotion-
related features. Zhao et al. [142] propose a transformer-
text Task C based deep-scale fusion network (TDFNet) for multimodal
emotion recognition, solving the aforementioned problems.
The multimodal embedding (ME) module in TDFNet uses
pre-trained models to alleviate the data scarcity problem by
Fig. 5. Illustration of multitask learning in multimodal affective computing providing a prior knowledge of multimodal information to the
tasks. model with the help of a large amount of unlabeled data. Ren
et al. [143] propose a novel multimodal adversarial learning
network (MALN), which first mines the speaker’s charac-
input data and corresponding features, jointly extract modal- teristics from context sequences and then incorporate them
invariant and task-related information in a single architecture. with the unimodal features. Liu et al. [144] propose LGCCT,
a light gated and crossed complementation transformer for
2) Multimodal Emotion Recognition in Conversation:
multimodal speech emotion recognition.
In community of multimodal emotion recognition, Zheng
et al. [25] propose a two-stage framework named Fa- 3) Multimodal Aspect-based Sentiment Analysis: Yang et
cial expression-aware multimodal multi-task learning (Fa- al. [146] propose a multi-task learning framework named
cialMMT), which jointly trains multimodal face recognition, cross-modal multitask Transformer (CMMT), which incor-
9

porates two auxiliary tasks to learn the aspect/sentiment-


aware intra-modal representations and introduces a text-guided
cross-modal interaction module to dynamically control the
contributions of the visual information to the representation
of each word in the inter-modal interaction. Jain et al. [147] audio PLM
propose a hierarchical multimodal generative approach (Ab- encoder
CoRD) for aspect-based complaint and rationale detection that
reframes the multitasking problem as a multimodal text-to-text
generation task. Ju et al. [148] is the first to jointly perform
vision
multimodal ATE (MATE) and multimodal ASC (MASC),
and propose a joint framework JML with auxiliary cross- probe fusion
modal relation detection for multimodal aspect-level sentiment
analysis (MALSA) to control the proper exploitation of visual
text
information. Zou et al. [37] design a multimodal prompt Trans-
former (MPT) to perform cross-modal information fusion. Fig. 6. An illustration of pre-trained model in multimodal affective computing
tasks. PLM denotes pre-trained language model.
Meanwhile, this work used the hybrid contrastive learning
(HCL) strategy to optimize the model’s ability to handle
labels with few samples. Chen et al. [85] design that audio
module should be more expressive than the text module, AOBERT [150] introduces a single-stream transformer struc-
and the single-modality emotional representation should be ture, which integrates all modalities into one BERT model.
dynamically fused into the multimodal emotion representation, Qian et al. [151] embed sentiment information at the word
and proposes corresponding rule-based multimodal multi-task level into pre-trained multimodal representation to facilitate
network (MMRBN) to restrict the representation learning. further learning on limited labeled data. TEASAL [152] is
a Transformer-Based speech-prefixed language model, which
4) Multimodal Multi-label Emotion Recognition: For mul-
exploits a conventional pre-trained language model as a cross-
timodal multi-label emotion recognition, Ge et al. [95] design
modal Transformer encoder. Yu et al. [153] study target-
adversarial temporal masking strategy and adversarial param-
oriented multimodal sentiment classification (TMSC) and pro-
eter perturbation strategy to jointly enhance the encoding of
pose a multimodal BERT architecture for multimodal senti-
other modalities and generalization of the model respectively.
ment analysis task. Cheng et al. [154] set layer-wise parame-
MER-MULTI [149] is label distribution adaptation which
ter sharing and factorized co-attention that share parameters
adapts the label distribution between the training set and
between cross attention blocks, so as to allow multimodal
testing set to remove training samples that do not match
signal to interact within every layer. ALMT [155] incorporates
the features of the testing set. Akhtar et al. [211] present
an adaptive hyper-modality learning (AHL) module to learn
a deep multi-task learning framework that jointly performs
an irrelevance/conflict-suppressing representation from visual
sentiment and emotion analysis both, which leverage the inter-
and audio features under the guidance of language features at
dependence of two related tasks (i.e. sentiment and emotion)
different scales.
in improving each others performance using an effective
multimodal framework. 2) Multimodal Emotion Recognition in Conversation: In
the domain of multimodal emotion recognition in conversa-
tion, FacialMMT [25] is a two-stage framework, which takes
B. Pre-trained Model RoBERTa [216] and Swin Transformer as the backbone for
In recent years, large language model (LLM) [59], [212] representation learning. Qiu et al. [217] adopt VATT [31]
and multimodal pre-trained model [22], [27], [213], [214] has to encode vision, text and audio respectively, and makes an
achieved significant progress [26], [212], [215]. Compared alignment between the learned modal representation. QAP [20]
with non-pretrained model, pre-trained model contains massive is a quantum-inspired adaptive-priority-learning model, which
transferred knowledge [28], [32], which can be introduced employs ALBERT as the text encoder and introduces quantum
into multimodal representation learning to probe the richer theory (QT) to learn modal priority adaptively. UniMSE [38]
information. Fig. 6 shows the use of pre-trained model in proposes a multimodal fusion method based on pre-trained
multimodal affective learning task. model T5, aiming to fuse the modal information with pre-
1) Multimodal Sentiment Analysis: In filed of multimodal trained knowledge. GraphSmile [156] adopts Roberta [216]
sentiment analysis, Rahman et al. [22] propose an attachment to track intricate emotional cues in multimodal dialogues,
to pre-trained model BERT and XLNet called multimodal alternately assimilating inter-modal and intra-modal emotional
adaptation gate (MAG), which allows BERT and XL-Net to dependencies layer by layer, adequately capturing cross-modal
accept multimodal nonverbal data by generating a shift that is cues while effectively circumventing fusion conflicts.
conditioned on the visual and acoustic modalities to internal 3) Multimodal Aspect-based Sentiment Analysis: In the
representation of BERT and XLNet. UniMSE [38] is a unified study of multimodal aspect-based sentiment analysis, Xu et
sentiment-sharing framework based on T5 model [60], which al. [50] are the first to put forward the task, multimodal aspect-
injects the non-verbal signals into pre-trained Transformer- based sentiment analysis, and propose a novel multi-interactive
based model for probing the knowledge stored in LLM. memory network (MIMN), which includes two interactive
10

TABLE II
TAXONOMY OF MULTIMODAL AFFECTIVE COMPUTING TASK FROM TASKS OF MSA, MERC, MABSA AND MMER.

Task Model Model architecture Miss Modality Datasets


TFN [76] LSTM % MOSI,MOSEI
MuLT [77] cross-attention % MOSI,MOSEI
MISA [91] Transformer % MOSI,MOSEI,UR FUNNY
Self-MM [136] BiLSTM % MOSI,MOSEI,SIMS
MTAG [107] Graph-based neural network % MOSI, IEMOCAP
ScaleVLAD [104] VLAD,CNN % MOSI,MOSEI,IEMOCAP
MMIM [88] BERT, CPC % MOSI,MOSEI
UniMSE [38] Adapter, T5, contrastive learning % MOSI,MOSEI,IEMOCAP,MELD
MSA
CHFusion [108] CNN % MOSI,IEMOCAP
MMHA [218] Multi-head Attention % MOUD,MOSI
QMR [219] Quantum Language Mode % Getty Images,Flickr
RMFN [100] LSTHM % MOSI
HMM-BLSTM [220] HMM,BiLSTM % IEMOCAP
HALCB [221] Cognitive brain limbic system % MOSI,YouTube,MOSEI
CSFC [222] CNN,fuzzy logic % Alh,Mos,Sag,MOUD
SFNN [223] CNN,attention % vista-net
GFML [186] Multi-head attention % MOSEI,MOSI
MMML [186] cross-attention % MOSEI,MOSI,CH-SIMS
TATE [116] Transformer ! IEMOCAP,MOSI
FacialMMT-RoBERTa [224] RoBERTa % MELD, Aff-Wild2
MultiEMO [84] Multi-head attention % IEMOCAP, MELD
MMGCN [187] Graph convolutional network % IEMOCAP, MELD
MM-DFN [188] GRU,Graph conventional network % IEMOCAP,MELD
EmoCaps [192] Multi-head attention % IEMOCAP,MELD
GA2MIF [193] Graph attention networks,Multi-head attention % IEMOCAP,MELD
MALN [143] LSTM, cross-attention, Transformer-based model % IEMOCAP, MELD
M2R2 [111] GRU ! IEMOCAP, MELD
TDF-Net [142] CNN,GRU,Transformer ! IEMOCAP
SDT [225] CNN,Transformer % IEMOCAP,MELD
HCT-DMG [226] Cross-modal Transformer % CMU-MOSI, MOSEI,IEMOCAP
MERC COGMEN [49] Graph neural network, Transformer % CMU-MOSI, IEMOCAP
Qiu et al. [217] VATT(Video-Audio-Text Transformer) % CMU-MOSEI
QAP [20] VGG,AlBERT % IEMOCAP,CMU-MOSEI
MPT-HCL [37] Transformer % IEMOCAP,MELD
CMCF-SRNet [83] Transformer,RGCN % IEMOCAP,MELD
SAMGN [189] Graph neural network % IEMOCAP,MELD
M3Net [190] Graph neural network % IEMOCAP,MELD
M3GAT [227] Graph attention network % MELD,MEISD,MESD
IF-MMIN [117] Imagination Module ! IEMOCAP
AMuSE [87] Attention-based model ! IEMOCAP,MELD
DEAN [173] Transformer % MOSI,MOSEI,IEMOCAP
MMIN [109] Transformer ! IEMOCAP,MSP-IMPROV
RLEMO [145] Graph Convolutional Network,GRU % IEMOCAP,MELD
Yao et al. [191] Graph convolutional network % IEMOCAP,MELD
BCFN [228] CNN % CHERMA,CH-SIMS
MM-RBN [85] Transformer % IEMOCAP
AoM [197] BART,CNN % Twitter2015, Twitter2017
JML [148] BERT, ResNet % TRC, Twitter2015, Twitter2017
VLP-MABSA [158] Vision-Language pre-trained model % Twitter2015, Twitter2017
MABSA CMMT [146] cross-attention,self-attention % Twitter-2015,Twitter-2017,Political-Twitter
DTCA [160] RoBERTa, ViT % Twitter2015, Twitter2017
M2DF [229] pre-trained model CLIP % Twitter2015, Twitter2017
MIMN [16] Memory Network % ZOL
KNIT [174] Transformer, % Twitter2015, Twitter2017
MMS2S [199] cross-attention, multi-head attention % MOSEI
MESGN [200] cross-modal transformer % MOSEI
TAILOR [230] Transformer, cross-attention % MOSEI
MMER HHMPN [231] MPNN, multi-head attention % MOSEI
AMP [95] Transformer-based encoder-decoder, mask learning % CMU-MOSEI,NEMu
M3TR [232] CNN, Transformer, cross-attention % MS-COCO,VOC 2007
UniVA [178] VAD model, contrastive learning % MOSEI, M 3 ED
11

memory networks to supervise the textual and visual informa-


tion with the given aspect, and learns not only the interactive
influences between cross-modality data but also the self influ-
ences in single-modality data. Yang et al. [18] propose a novel audio
generative multimodal prompt (GMP) model for MABSA, external
which includes the multimodal encoder module and the N- knowledge
Stream decoders module and perform three MABSA-related vision
tasks with quite a small number of labeled multimodal sam- fusion representation
ples. Liu et al. [157] propose an entity-related unsupervised Fusion
pre-training with visual prompts for MABSA. Instead of using
text
sentiment-related supervised pre-training, two entity-related
unsupervised pre-training tasks are applied and compared,
which are targeted at locating the entities in text with the sup-
Fig. 7. Illustration of enhanced knowledge in multimodal affective computing
port of visual prompts. Ling et al. [158] propose a task-specific tasks.
Vision-Language Pre-training framework for MABSA (VLP-
MABSA), which is a unified multimodal encoder-decoder ar-
chitecture for all the pre-training and downstream tasks. Zhang representations in a granularity descent way.
et al. [159] construct a dynamic re-weighting BERT (DR-
BERT) based BERT and designed to learn dynamic aspect- C. Enhanced Knowledge
oriented semantics for ABSA. Jin et al. [161] propose a multi- External knowledge in machine learning and AI refers
aspect semantic relevance model that takes into account the to information from outside the training dataset, including
match between search queries and the title, attribute and image knowledge bases, text corpora, knowledge graphs, pre-trained
information of items simultaneously. Wang et al. [162] propose models, and expert insights. Integrating this knowledge can
an end-to-end MABSA framework with image conversion improve performance, generalization, interpretability, and ro-
and noise filtration, which bridges the representation gap in bustness to noisy or limited data [235], [236]. Fig. 7 shows
different modalities by translating images into the input space the common way of incorporating external knowledge into
of a pre-trained language model (PLM). Wang et al. [163] multimodal affective learning task.
propose an adaptive dual graph convolution fusion network 1) Multimodal Sentiment Analysis: In the area of study
(AD-GCFN) for aspect-based sentiment analysis. This model focused on multimodal sentiment analysis, Rahmani et al. [19]
uses two graph convolution networks: one for the semantic construct an adaptive tree by hierarchically dividing users
layer to learn semantic correlations by an attention mecha- and utilizes an attention-based fusion to transfer cognitive-
nism, and the other for the syntactic layer to learn syntactic oriented knowledge within the tree. TETFN [165] is a novel
structure by dependency parsing. Mu et al. [164] propose a method named text enhanced Transformer fusion network,
novel momentum contrastive learning network (MOCOLNet) which learns text-oriented pairwise cross-modal mappings for
which is an unified multimodal encoder-decoder framework obtaining effective unified multimodal representations. Zhu et
for multimodal aspect-level sentiment analysis task. An end- al. [166] propose the sentiment knowledge enhanced attention
to-end training manner is proposed to optimize MOCOLNet fusion network (SKEAFN), a novel end-to-end fusion network
parameters, which alleviates the problem of scarcity of la- that enhances multimodal fusion by incorporating additional
belled pre-training data. To fully explore multimodal contents sentiment knowledge representations from an external knowl-
especially the location information, Zhang et al. [233] design edge base. Chen et al. [167] try to incorporate sentimental
a multimodal interactive network to model the textual and words knowledge into the fusion network to guide the learning
visual information of each aspect. Two memory network-based of joint representation of multimodal features.
modules are used to capture intra-modality features, including 2) Multimodal Emotion Recognition in Conversation: In
text to aspect alignment and image to aspect alignment. the discipline of research related to multimodal emotion
Yang et al. [234] propose multi-grained fusion network with recognition in conversation, Fu et al. [168] integrate context
self-distillation (MGFN-SD) to analyze aspect-based senti- modeling, knowledge enrichment, and multimodal (text and
ment polarity, which can effectively integrate multi-grained audio) learning into a GCN-based architecture. Li et al. [169]
representation learning with self-distillation to obtain more propose a decoupled multimodal distillation (DMD) approach
representative multimodal features. that facilitates flexible and adaptive crossmodal knowledge
4) Multimodal Multi-label Emotion Recognition: A few distillation, aiming to enhance the discriminative features of
works on multimodal multi-label emotion recognition leverage each modality. Sun et al. [170] investigate a multimodal fusion
pre-trained model to improve model performance. To our transformer network based on rough set theory, which facil-
best known, TAILOR [94] is a novel framework of versa- itates the interaction of multimodal information and feature
tile multimodal learning for multi-labeL emotion recogni- guidance through rough set cross-attention. Wang et al. [171]
tion, which adversarially depicts commonality and diversity devise a novel label-guided attentive fusion module to fuse the
among multiple modalities. TAILOR adversarially extracts label-aware text and speech representations, which learns the
private and common modality representations. Then a BERT- label-enhanced text/speech representations for each utterance
like transformer encoder is devised to gradually fuse these via label-token and label-frame interactions.
12

3) Multimodal Aspect-based Sentiment Analysis: In the


field of research on multimodal aspect-based sentiment anal- context
ysis, Xu et al. [174] introduce external knowledge, including u1 u1
textual syntax and cross-modal relevancy knowledge to Trans-
former layer, which cut off the irrelevant connections among inter-utterance
textual or cross-modal modalities by using a knowledge- dependency
induced matrix. Yang et al. [175] distill visual emotional u2 u2
cues and align them with the textual content to selectively target inter-modality target
match and fuse with the target aspect in textual modality.
CoolNet [176] is a cross-modal fine-grained alignment and
fusion network, aims to boost the performance of visual- u3 context u3
language models in seamlessly integrating vision and language
information, which transforms an image into a textual caption
and a graph structure, then dynamically aligns the semantic Fig. 8. Illustration of context information in multimodal affective computing
tasks.
and syntactic information from both the input sentence and
the generated caption, as well as models the object-level
visual features. To strengthen the semantic meanings of image
representations, Yu et al. [177] propose to detect a set of salient 1) Multimodal Sentiment Analysis: In the community of
objects in each image based on a pre-trained Faster R-CNN multimodal sentiment analysis, Chauhan et al. [182] employ a
model, and represent each object by concatenating its hidden context-aware attention module to learn intra-modality inter-
representation and associated semantic concepts, followed by action among particating modalities through encoder-decoder
an Aspect-Guided Attention layer to learn the relevance of structure. Multimodal context integrate the unimodal context,
each semantic concept with the guidance of given aspects. Poria et al. [183] propose a recurrent model with multi-level
4) Multimodal Multi-label Emotion Recognition: In the multiple attentions to capture contextual information among
area of study focused on multimodal multi-label emotion utterances, and design a recurrent model to capture contextual
recognition, Zheng et al. [178] propose to represent each information among utterances and introduced attention-based
emotion category with the valence-arousal (VA) space to networks for improving both context learning and dynamic
capture the correlation between emotion categories and de- feature fusion. Huang et al. [184] propose a novel context-
sign a unimodal VA-driven contrastive learning algorithm. based adaptive multimodal fusion network (CAMFNet) for
CARAT [179] presents contrastive feature reconstruction and consecutive frame-level sentiment prediction. Li et al. [185]
aggregation for the MMER task. Specifically, CARAT devises propose a spatial context extraction block to explore the spatial
a reconstruction-based fusion mechanism to better model context by calculating the relationships between feature maps
fine-grained modality-to-label dependencies by contrastively and the higher-level semantic representation in images.
learning modal-separated and label-specific features. Zhao et 2) Multimodal Emotion Recognition in Conversation: In
al. [180] propose a novel multimodal multi-label TRansformer the realm of research concerning multimodal emotion recog-
(M3TR) learning framework, which embeds the high-level nition in conversation, Hu et al. [187] make use of multimodal
semantics, visual structures and label-wise co-occurrences of dependencies effectively, and leverages speaker information to
multiple modalities into one unified encoding. Li et al. [86] model inter-speaker and intra-speaker dependency. Zhang et
propose a novel multimodal attention graph network with dy- al. [83] propose a cross-modality context fusion and semantic
namic routing-by-agreement (MAGDRA) for MMER. MAG- refinement network (CMCF-SRNet) to solve the limitation of
DRA is able to efficiently fuse graph data with various insufficient semantic relationship information between utter-
node and edge types as well as properly learn the cross- ances. Zhang et al. [189] construct multiple modality-specific
modal and temporal interactions between multimodal data graphs to model the heterogeneity of the multimodal context.
without pre-aligning. Zhang et al. [181] propose heterogeneous Chen et al. [190] propose a GNN-based model that explores
hierarchical message passing network (HHMPN), which can multivariate relationships and captures the varying importance
simultaneously model the feature-to-label, label-to-label and of emotion discrepancy and commonality by valuing multi-
modality-to-label dependencies via graph message passing. frequency signals. Zhang et al. [227] propose a multimodal,
multi-task interactive graph attention network, termed M3GAT,
D. Contextual Information to simultaneously solve conversational context dependency,
Context refers to the surrounding words, sentences, or multimodal interaction, and multi-task correlation in a unified
paragraphs that give meaning to a particular word or phrase. framework. RL-EMO [145] is a novel reinforcement Learn-
Understanding context is crucial for tasks like dialogue sys- ing framework for the multimodal emotion recognition task,
tems or sentiment analysis. In a conversation, context includes which combines reinforcement learning (RL) module to model
the history of previous utterances, while for news, it refers context at both the semantic and emotional levels respectively.
to the overall description provided by the entire document. Yao et al. [191] propose a speaker-centric multimodal fusion
Overall, contextual information helps machines make more network for emotion recognition in a conversation, to model
accurate predictions. Fig. 8 shows the significance of context intra-modal feature fusion and speaker-centric cross-modal
information to multimodal affective learning task. feature fusion.
13

3) Multimodal Aspect-based Sentiment Analysis: In the video segments in the wild with both multimodal and
stduy of multimodal aspect-based sentiment analysis, Yu et independent unimodal annotations.
al. [160] propose an unsupervised approach which minimizes • CH-SIMS v2.0 [242] is an extended version of CH-
the Wasserstein distance between both modalities, forcing both SIMS that includes more data instances, spanning text,
encoders to produce more appropriate representations for the audio and visual modalities. Each modality of sample
final extraction. Xu et al. [194] design and construct a multi- is annotated with sentiment polarity, and then sample is
modal Chinese product review dataset (MCPR) to support the annotated with a concluded sentiment.
research of MABSA. Anschutz et al. [195] report the results of • CMU-MOSEAS [240] is the first large-scale multimodal
an empirical study on how semantic computing can provide in- language dataset for Spanish, Portuguese, German and
sights into user-generated content for domain experts. In addi- French, and it is collect from YouTube and its samples
tion, this work discussed different image-based aspect retrieval are 4,000 in total.
and aspect-based sentiment analysis approaches to handle and • ICT-MMMO [241] is collected from online social review
structure large datasets. Zhao et al. [196] borrow the idea videos that encompass a strong diversity in how people
of Curriculum Learning and propose a multi-grained multi- express opinions about movies and include a real-world
curriculum denoising Framework (M2DF) to adjust the order variability in video recording quality4 .
of training data, so as to obtain more contextual information. • YouTube [46] collects 47 videos from the social media
Zhou et al. [197] propose an aspect-oriented method (AoM) web site YouTube. Each video contains 3-11 utterances
to detect aspect-relevant semantic and sentiment information. with most videos having 5-6 utterances in the extracted
Specifically, an aspect-aware attention module is designed to 30 seconds.
simultaneously select textual tokens and image blocks that are
semantically related to the aspects. Zhao et al. [198] propose B. Multimodal Emotion Recognition in Conversation
a fusion with GCN and SE ResNeXt Network (FGSN), which • MELD [243] contains 13,707 video clips of multi-party
constructs a graph convolution network on the dependency tree conversations, with labels following Ekman’s six univer-
of sentences to obtain the context representation and aspects sal emotions, including joy, sadness, fear, anger, surprise
words representation by using syntactic information and word and disgust.
dependency. • IEMOCAP [244] has 7,532 video clips of multi-party
4) Multimodal Multi-label Emotion Recognition: conversations, with labels following Ekman’s six univer-
MMS2S [199] is a multimodal sequence-to-set approach to sal emotions, including joy, sadness, fear, anger, surprise
effectively model label dependence and modality dependence. and disgust.
MESGN [200] firstly proposes this task, which simultaneously • HED [245] contains happy, sad, disgust, angry and scared
models the modality-to-label and label-to-label dependencies. emotion-aligned face, body and text samples, which are
Many works consider the dependencies of multi-label based much larger than existing datasets. Moreover, the emotion
on the characteristics of co-occurrence labels. Zhao et labels were correspondingly attached to those samples by
al. [201] propose a general multimodal dialogue-aware strictly following a standard psychological paradigm.
interaction framework, named by MDI, to model the impacts • RML [246] collects video samples from eight subjects,
of dialogue context on emotion recognition. speaking six different languages. The six languages are
English, Mandarin, Urdu, Punjabi, Persian, and Italian.
VII. DATASETS OF M ULTIMODAL A FFECTIVE C OMPUTING This dataset contains 500 video samples, each delivered
with one of the six particular emotions.
In this section, we introduce the benchmark datasets of • BAUM-1 [247] contains two sets: BAUM-1a and BAUM-
MSA, MERC, MABSA, and MMER tasks. To facilitate easy 1s databases. BAUM-1a database contains clips con-
navigation and reference, the details of datasets are shown in taining expressions of five basic emotions (happiness,
Table III with a comprehensive overview of the studies that sadness, anger, disgust, fear) along with expressions
we cover. of boredom, confusion (unsure) and interest (curiosity).
BAUM-1s database contains clips reflecting six basic
A. Multimodal Sentiment Analysis emotions and also expressions of boredom, contempt,
confusion, thinking, concentrating, bothered, and neutral.
• MOSI [237] contains 2,199 utterance video segments, • MAHNOB-HCI [248] includes 527 facial video record-
and each segment is manually annotated with a sentiment ings of 27 participants engaged in various tasks and
score ranging from -3 to +3 to indicate the sentiment interactions, while their physiological signals such as 32-
polarity and relative sentiment strength of the segment. channel electroencephalogram (EEG), 3-channel electro-
• MOSEI [238] is an upgraded version of MOSI, anno- cardiogram.
tated with both sentiment and emotion. MOSEI contains • Deap [249] contains data from 32 participants, aged
22,856 movie review clips from YouTube. Each sample between 19 and 37 (50% female), who were recorded
in MOSEI includes sentiment annotations ranging from watching 40 one-minute music videos. Each participant
-3 to +3 and multi-label emotion annotations. was asked to evaluate each video by assigning values
• CH-SIMS [239] is a Chinese single- and multimodal
sentiment analysis dataset, which contains 2,281 refined 4 http://multicomp.ict.usc.edu
14

TABLE III
L IST OF MULTIMODAL AFFECTIVE COMPUTING DATASETS . T,A,V DENOTE THE TEXT, AUDIO AND VISION MODALITIES , RESPECTIVELY. E MOTION
DENOTES THE SAMPLE IN DATASET IS LABELED WITH EMOTION CATEGORY AND SENTIMENT DENOTES THE SAMPLE IN DATESET IS LABELED WITH
SENTIMENT POLARITY.

Task Dataset Modalities Source Emotion Sentiment Language Datasize


CMU-MOSI [237] T,A,V Video Blog, YouTube % ! English 2,199
CMU-MOSEI [238] T,A,V YouTube ! ! English 22,856
CH-SIMS [239] T,A,V Movies,TVs % ! Chinese 2,281
MSA
CMU-MOSEAS [240] T,A,V YouTube ! ! Spanish,Portuguese,German,French 4,000
ICT-MMMO [241] T,A,V reviews - - - -
YouTube [46] T,A,V YouTube - - English 300
CH-SIMS v2.0 [242] T,A,V TV series,Shows,Movies % ! Chinese 14,402
MELD [243] T,A,V Friends TV ! % English 13,707
IEMOCAP [244] T,A,V Act ! % English 7,532
HED [245] T,V Movies, TVs ! % English 17,441
MERC RML [246] A,V Video ! % English, Mandarin, Urdu, Punjabi, Persian, and Italian -
BAUM-1 [247] A,V Data collection ! % Turkish 1,184
MAHNOB-HCI [248] V, EEG Data collection ! % - -
Deap [249] EEG Act, Data collection ! ! physiological signal -
MuSe-CaR [250] T,A,V Car reviews,YouTube ! ! English
CHEAVD [251] A,V Movies,TVs - - Mandarin 7,030
MSP-IMPROV [252] T,A,V Act ! % English 8,438
MEISD [253] T,A,V TVs ! ! English -
MESD [254] T,A,V TVs ! ! English 9,190
Ulm-TSST [255] A,V,EEG job interviews ! ! English -
CHERMA [256] T,A,V TV series,Shows,Movies ! ! Chinese 28,717
AMIGOS [257] EEG, ECG, GSR Data collection % ! - -
Twitter2015 [258] T,V Twitter % ! English -
Twitter2017 [258] T,V Twitter % ! English -
MABSA MCPR [259] T,V Product reviews % ! Chinese 15,000
Multi-ZOL [50] T,V Product reviews % ! Chinese 5,288
MACSA [260] T,V Hotel service reviews % ! Chinese 21,108
MASAD [261] T,V Visual sentiment ontology datasets % ! English 38,532
PanoSent [262] T,A,V Social media % ! English,Chinese, Spanish 10,000
CMU-MOSEI [238] T,A,V YouTube ! ! English 22,856
MMER
M 3 ED [201] T,A,V 56 TVs ! % Mandarin 24,449

from 1 to 9 for arousal, valence, dominance, like/dislike, • AMIGOS [257] is collected in two experimental set-
and familiarity. tings. In the first setting, 40 participants viewed 16
• MuSe-CaR [250] focuses on the tasks of emotion, short emotional videos. In the second setting, participants
emotion-target engagement, and trustworthiness recogni- watched 4 longer videos, some individually and others
tion by means of comprehensively integrating the audio- in groups. During these sessions, participants’ phys-
visual and language modalities. iological signals—Electroencephalogram (EEG), Elec-
• CHEAVD 2.0 [251] is selected from Chinese movies, trocardiogram (ECG), and Galvanic Skin Response
soap operas and TV shows, which contains noise in the (GSR)—were recorded using wearable sensors.
background to mimic real-world conditions.
• MSP-IMPROV [252] is a multimodal emotional
database comprised of spontaneous dyadic interactions, C. Multimodal Aspect-based Sentiment Analysis
designed to study audiovisual perception of expressive • Twitter2015 and Twitter2017 are originally provided by
behaviors. the work [258] for multimodal named entity recognition
• MEISD [253] is a large-scale balanced multimodal multi- and annotated with the sentiment polarity for each aspect
label emotion, intensity, and sentiment dialogue dataset by the work [14].
(MEISD) collected from different TV series that has • MCPR [259] has 2,719 text-image pairs and 610 distinct
textual, audio, and visual features. aspects in total, which collects 1.5k product reviews
• MESD [254] is the first multimodal and multi-task senti- involving clothing and furniture departments, from the e-
ment, emotion, and desire dataset, which contains 9,190 commercial platform JD.com. It is the first aspect-based
text-image pairs, with English text. multimodal Chinese product review dataset.
• Ulm-TSST [255] is a multimodal dataset, where partic- • Multi-ZOL [50] consists of reviews of mobile phones
ipants were recorded in a stressful situation emulating a collected from ZOL.com. It contains 5,288 sets of multi-
job interview, following the TSST protocol. modal data points that cover various models of mobile
• CHERMA [256] provides uni-modal labels for each indi- phones from multiple brands. These data points are
vidual modality, and multi-modal labels for all modalities annotated with a sentiment intensity rating from 1 to 10
jointly observed. It is collected from various source, for six aspects.
including 148 TV series, 7 variety shows, and 2 movies. • MACSA [260] contains more than 21K text-image pairs,
and provides fine-grained annotations for both textual and
15

visual content and firstly uses the aspect category as the IX. D ISCUSS
pivot to align the fine-grained elements between the two In this section, we briefly discuss the works of multimodal
modalities. affective computing based on facial expression, acoustic sig-
• MASAD [261] selects 38,532 samples from a partial nal, physiological signals, and emotion cause. Furthermore, we
VSO visual dataset [263] (approximately 120,000 sam- discuss the technical routes across multiple multimodal affec-
ples) that can clearly express sentiments and categorized tive computing tasks to track their consistency and difference.
them into seven domains: food, goods, buildings, animal,
human, plant, scenery, with a total of 57 predefined
aspects. A. Other Multimodal Affective Computing
• PanoSent [262] is annotated both manually and au- a) Multimodal Affective Computing Based on Facial
tomatically, featuring high quality, large scale (10,000 Expression Recognition: Facial expression recognition has
dialogues), multimodality (text, image, audio and video), significantly evolved over the years, progressing from static to
multilingualism (English, Chinese and Spanish), multi- dynamic methods. Initially, static facial expression recognition
scenarios (over 100 domains), and covering both im- (SFER) relied on single-frame images, utilizing traditional
plicit&explicit sentiment elements. image processing techniques such as Local Binary Patterns
(LBP) and Gabor filters to extract features for classification.
The advent of deep learning brought Convolutional Neural
D. Multimodal Multi-label Emotion Recognition Networks (CNNs), which markedly improved the accuracy
• CMU-MOSEI [238] contains 22,856 movie review clips of SFER [264]–[267]. However, static methods were lim-
from Youtube videos. Each video intrinsically contains ited in capturing the temporal dynamics of facial expres-
three modalities: text, audio, and visual, and each movie sions [268]. Some methods attempt to approach the problem
review clip is annotated with at least one emotion cate- from a local-global feature perspective, extracting more fine-
gory of the set: angry, disgust, fear, happy, sad, surprise. grained visual representations and identifying key informative
• M3 ED [201] is a multimodal emotional dialogue dataset segments [269]–[274]. These approaches enhance robustness
in Chinese, which contains a total of 9,082 turns and against noisy frames, enabling uncertainty-aware inference. To
24,449 utterances, and each utterance is annotated with further enhance accuracy, recent advancements in DFER focus
the seven emotion categories (happy, surprise, sad, dis- on integrating multimodal data and employing parameter-
gust, anger, fear, and neutral). efficient fine-tuning (PEFT) to adapt large pre-trained models
for enhanced performance [275]–[277], while Liu et al. [278]
introduces the concept of expression reenactment (i.e. nor-
VIII. E VALUATION M ETRICS malization), harnessing generative AI to mitigate noise in in-
the-wild datasets. Moreover, the burgeoning evidential deep
In this section, we report the mainstream evaluation metrics learning (EDL) has shown considerable promise by enabling
for each multimodal affective computing task. explicit uncertainty quantification through the distributional
a) Multimodal Sentiment Analysis: Previous works adopt measurement in latent spaces for improved interpretability,
mean absolute error (MAE), Pearson correlation (Corr), seven- with demonstrated efficacy in zero-shot learning [279], multi-
class classification accuracy (ACC-7), binary classification view classification [280]–[282], video understanding [283]–
accuracy (ACC-2) and F1 score computed for positive/negative [285] and multi-modal named entity recognition.
and non-negative/negative classification as evaluation metrics. b) Multimodal Affective Computing Based on Acoustic
b) Multimodal Emotion Recognition in a conversation: Signal: The model based on single-sentence single-task is
Accuracy (ACC) and weighted F1 (WF1) are used for evalua- the most common model in speech emotion recognition. For
tion. Additionally, the imbalance label distribution results in a example, Aldeneh et al. [286] use CNN to perform convolu-
phenomenon that the trained model performs better on some tions in the time direction of handcrafted temporal features
categories and perform poorly on others. In order to verify the (40-dimensional MFSC) to identify emotion-salient regions
impacts of data distribution on model performance, researchers and used global max pooling to capture important temporal
also provide ACC and F1 on each emotion category to measure areas. Li et al. [287] apply two different convolution kernels
the model performance. on spectrograms to extract temporal and frequency domain
c) Multimodal Aspect-based Sentiment Analysis: With features, concatenated them, and input them into a CNN
the previous methods, for multimodal aspect term extrac- for learning, followed by attention mechanism pooling for
tion (MATE) and joint multimodal aspect sentiment analysis classification. Trigeorgis et al. [288] use CNN for end-to-end
(JMASA) tasks, researchers use precision (P), recall (R) and learning directly on speech signals, avoiding the problem of
micro-F1 (F1) as the evaluation metrics. For the multimodal feature extraction not being robust for all speakers. Mirsamadi
aspect sentiment classification(MASC) task, accuracy (ACC) et al. [289] combine Bidirectional LSTM (Bi-LSTM) with
and macro-F1 are as evaluation metrics. a novel pooling strategy, utilizing attention mechanisms to
d) Multimodal Multi-label Emotion Recognition: Ac- enable the network to focus on emotionally prominent parts of
cording to the prior work, multi-label classification works sentences. Zhao et al. [290] consider the temporal and spatial
mostly adopt accuracy (ACC), micro-F1, precision (P) and characteristics of the spectrum in the attention mechanism to
recall (R) as evaluation metrics. learn time-related features in spectrograms, and using CNN
16

to learn frequency-related features in spectrograms. Luo et tasks such as image captioning [8], [306], the impact of vision
al. [291] propose a dual-channel speech emotion recognition is more significant than language. In contrast, multimodal
model that uses CNN and RNN to learn from spectrograms affective computing tasks place a greater emphasis on lan-
on one hand, and separately learns HSFs features on the other, guage [38], [307].
finally concatenating the obtained features for classification. b) Pre-trained model: Generally, pre-trained models are
c) Multimodal Affective Computing Based on Physiolog- used to encode raw modal information into vectors. From this
ical Signals: In medical measurements and health monitoring, perspective, multimodal affective computing tasks adopt pre-
EEG-based emotion recognition (EER) is one of the most trained models as the backbone and then fine-tune them for
promising directions within emotion recognition and has at- downstream tasks. For example, UniMSE [38] uses T5 as the
tracted substantial research attention [292]–[294]. Notably, the backbone, while GMP [18] utilizes BART. These approaches
field of affective computing has seen nearly 1,000 publica- aim to transfer the general knowledge embedded in pre-trained
tions related to EER since 2010 [295]. Numerous EEG-based language models to the field of affective computing.
multimodal emotion recognition (EMER) methods have been c) Enhanced knowledge: Commonsense knowledge en-
proposed [296]–[300], leveraging the complementarity and re- compasses facts and judgments about our natural world. In
dundancy between EEG and other physiological signals in ex- the field of affective computing, this knowledge is crucial for
pressing emotions. For example, Vazquez et al. [301] address enabling machines to understand human emotions and their
the problem of multimodal emotion recognition from multiple underlying causes. Researchers enhance affective computing
physiological signal, which demonstrates Transformer-based by integrating external knowledge sources such as sentiment
approach is suitable for emotion recognition based on physi- lexicons [308], English knowledge bases [309]–[313], and
ological signal. Chinese knowledge bases [314] as the external knowledge to
d) Multimodal Affective Computing Based on Emotion enhance affective computing.
Cause: Apart from focusing on the emotions themselves, d) Contextual information: Affective computing tasks
the capacity of machine for understanding the cause that require an understanding of contextual information. In MERC,
triggers an emotion is essential for comprehending human contextual information encompasses the entire conversation,
behaviors, which makes emotion-cause pair extraction (ECPE) including both previous and subsequent utterances relative to
crucial. Over the years, text-based ECPE has made signifi- the current utterance. For MABSA, contextual information
cant progress [302], [303]. Based on ECPE, Li et al. [304] refers to the full sentence containing customer opinions. Re-
propose multimodal emotion-cause pair extraction (MECPE), searchers integrate contextual information using hierarchical
which aims to extract emotion-cause pairs with multimodal approaches [315], [316], self-attention mechanisms [58], and
information. Initially, Li et al. [304] construct a joint train- graph-based dependency modeling [317], [318]. Additionally,
ing architecture, which contains the main task, i.e., multi- affective computing tasks can enhance understanding by incor-
modal emotion-cause pair extraction and two subtasks, i.e., porating non-verbal cues such as facial expressions and vocal
multimodal emotion detection and cause detection. To solve tone, alongside textual information.
MECPE, researchers borrowed the multitask learning frame-
work to train the model using multiple training objectives of
sub-tasks, aiming to enhance the knowledge sharing among C. Difference among Multimodal Affective Computing
them. For example, Li et al. [305] propose a novel model We examine the differences among multimodal affective
that captures holistic interaction and label constraint (HiLo) computing tasks by considering the type of downstream tasks,
features for the MECPE task. HiLo enables cross-modality and sentiment granularity, and application contexts to identify the
cross-utterance feature interactions through various attention unique characteristics of each task.
mechanisms, providing a strong foundation for accurate cause For downstream tasks, MSA predicts sentiment strength as
extraction. a regression task. MERC is a multi-class classification task for
identifying emotion categories. MMER performs multilabel
B. Consistency among Multimodal Affective Computing emotion recognition, detecting multiple emotions simultane-
We categorize the multimodal affective computing tasks ously. MABSA involves extracting aspects and opinions to
into several key areas: multimodal alignment and fusion, determine sentiment polarity, categorizing it as information ex-
multi-task learning, pre-trained models, enhanced knowledge, traction. In terms of analysis granularity, MERC and MECPE
and contextual information. To ensure clarity, we discuss the focus on utterances and speakers within a conversation, while
consistencies across these aspects. MSA and MMER concentrate on sentence-level information
a) Multimodal alignment and fusion: Among MSA, within a document. MABSA, on the other hand, focuses
MERC, MABSA and MMER tasks, each is fundamentally a on aspects within comments. Some studies infer fine-grained
multimodal task that involves considering and combining at sentiment from coarse-grained sentiment [209], [319] or in-
least two modalities to make decisions. This process includes tegrate tasks of different granularities into a unified training
extracting features from each modality and integrating them framework [307]. Due to these differences in granularity,
into a unified representation vector. In multimodal represen- the contextual information varies as well. For instance, in
tation learning, modal alignment and fusion are two critical MABSA, the context includes the comment along with any
issues that must be addressed to advance the field of multi- associated images and short descriptions of aspects, whereas
modal affective computing. For vision-dominated multimodal in MERC, the context encompasses the entire conversation
17

and speaker information. In terms of application scenarios, XI. C ONCLUSION


MSA, MMER, and MABSA are used for public opinion
analysis and mining user experiences related to products or Multimodal Affective Computing has emerged as a crucial
services. MERC and MECPE help machines understand and research direction in artificial intelligence with significant
mimic human behaviors, generating empathetic responses in progress in understanding and interpreting emotions. This
dialogue agents. While many tasks are context-specific, there survey provides a comprehensive overview of the diverse tasks
is a growing trend toward unified frameworks for analyzing associated with multimodal affective computing, covering
human emotions across diverse settings (e.g., task type and its research background, definitions, related work, technical
emotion granularity) [38], [138]. approaches, benchmark dataset, and evaluation metrics. We
group multimodal affective computing across MSA, MERC,
MABSA and MMER tasks into four categories: multi-task
X. F UTURE WORK learning, pre-trained modal, enhanced knowledge and context
We outline directions for future work in multimodal af- information. Additionally, we summarize the consistency and
fective computing from Transfer Learning with Multimodal differences among various affective computing tasks. Also,
Pre-trained Model, Unification of Multimodal Affective Com- we report the inherent challenges in multimodal sentiment
puting tasks, Model with External Knowledge Distill, and analysis and explore potential directions for future research
Affective Computing with Less-studied Modalities. and development.
a) Unification of Multimodal Affective Computing tasks:
Recent advances have made significant strides by unifying R EFERENCES
related yet distinct tasks into a single framework [26], [212],
[215]. For example, T5 [60] integrates various NLP tasks by [1] Z. Zhu, X. Zhuang, Y. Zhang, D. Xu, G. Hu, X. Wu, and Y. Zheng,
representing all text-based problems in a text-to-text format, “Tfcd: Towards multi-modal sarcasm detection via training-free coun-
terfactual debiasing,” in Proceedings of the Thirty-Third International
achieving state-of-the-art results across numerous benchmarks. Joint Conference on Artificial Intelligence, IJCAI-24, K. Larson, Ed.
These studies highlight the effectiveness of such unified International Joint Conferences on Artificial Intelligence Organization,
frameworks in enhancing model performance and generaliza- 2024, pp. 6687–6695.
[2] Z. Zhu, X. Cheng, G. Hu, Y. Li, Z. Huang, and Y. Zou, “Towards multi-
tion [214], [214], [320]. Meanwhile, the progress also suggests modal sarcasm detection via disentangled multi-grained multi-modal
the potential for unifying multimodal affective computing distilling,” in Proceedings of the 2024 Joint International Conference
tasks across diverse application scenarios. First, unification on Computational Linguistics, Language Resources and Evaluation,
LREC/COLING 2024, 20-25 May, 2024, Torino, Italy, N. Calzolari,
across different granularity–from fine to coarse—has been M. Kan, V. Hoste, A. Lenci, S. Sakti, and N. Xue, Eds. ELRA and
employed to train models effectively [209]. Second, an in- ICCL, 2024, pp. 16 581–16 591.
creasing number of pre-trained models now handle language, [3] A. Ben-Ze’ev, The subtlety of emotions. MIT press, 2001.
[4] R. K. Shelly, “Emotions, sentiments, and performance expectations,” in
vision, and audio simultaneously, enabling end-to-end process- Theory and research on human emotions. Emerald Group Publishing
ing across single, dual, and multiple modalities [321]. Third, Limited, 2004.
integrating emotion-cause analysis with emotion recognition in [5] R. J. Davidson, K. R. Sherer, and H. H. Goldsmith, Handbook of
affective sciences. Oxford University Press, 2009.
a multimodal setting within a single architecture can enhance
[6] G. A. Ramı́rez, T. Baltrusaitis, and L. Morency, “Modeling latent
their mutual indications and improve overall performance. discriminative dynamic of multi-dimensional affective signals,” in
b) Transfer Learning with External Knowledge Distill: Affective Computing and Intelligent Interaction - Fourth International
Conference, ACII 2011, Memphis, TN, USA, October 9-12, 2011,
In the field of affective computing, incorporating external Proceedings, Part II, 2011, pp. 396–406.
knowledge such as sentiment lexicons and commonsense [7] D. Jiang, R. Wei, H. Liu, J. Wen, G. Tu, L. Zheng, and E. Cambria,
knowledge is crucial for a deeper understanding of emotional “A multitask learning framework for multimodal sentiment analysis,”
in 2021 International conference on data mining workshops (ICDMW).
expressions within the context of social norms and cultural IEEE, 2021, pp. 151–157.
backgrounds [322]. For example, The expression and percep- [8] J. Li, R. R. Selvaraju, A. Gotmare, S. R. Joty, C. Xiong, and S. C. Hoi,
tion of emotion also varies across cultures, both in text and “Align before fuse: Vision and language representation learning with
in face-to-face communication [323]. These differences are momentum distillation,” in Advances in Neural Information Processing
Systems 34: Annual Conference on Neural Information Processing Sys-
critical for cross-cultural sentiment analysis. tems 2021, NeurIPS 2021, December 6-14, 2021, virtual, M. Ranzato,
c) Affective Computing with Less-studied Modalities: A. Beygelzimer, Y. N. Dauphin, P. Liang, and J. W. Vaughan, Eds.,
2021, pp. 9694–9705.
Natural language (spoken and written), visual data (images
[9] N. C. Garcia, P. Morerio, and V. Murino, “Modality distillation with
and videos), and auditory signals (speech, sound, and music) multiple stream networks for action recognition,” in Computer Vision -
have long been central to multimodal affective computing. ECCV 2018 - 15th European Conference, Munich, Germany, September
Recently, new sensing data types, like haptic and ECG signals, 8-14, 2018, Proceedings, Part VIII, ser. Lecture Notes in Computer
Science, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Eds.,
are gaining attention [324]. Haptic signals, which involve vol. 11212. Springer, 2018, pp. 106–121.
touch and convey sensory and emotional attributes, enhance [10] Y. Ji, H. Liu, B. He, X. Xiao, H. Wu, and Y. Yu, “Diversified
user experiences in areas such as gaming, virtual reality, and multiple instance learning for document-level multi-aspect sentiment
classification,” in Proceedings of the 2020 conference on empirical
mobile apps [325]. These signals offer immediate feedback methods in natural language processing (EMNLP), 2020, pp. 7012–
and can improve user engagement. As research progresses, 7023.
less-studied modalities like haptic will likely become crucial, [11] P. J. Donnelly and A. Prestwich, “Identifying sentiment from crowd
audio,” in 7th International Conference on Frontiers of Signal Pro-
complementing established methods and advancing the field cessing, ICFSP 2022, Paris, France, September 7-9, 2022, 2022, pp.
of affective computing. 64–69.
18

[12] Z. Sun, P. K. Sarma, W. A. Sethares, and Y. Liang, “Learning relation- “Learning transferable visual models from natural language supervi-
ships between text, audio, and video via deep canonical correlation for sion,” in Proceedings of the 38th International Conference on Machine
multimodal language analysis,” in The Thirty-Fourth AAAI Conference Learning, ICML 2021, 18-24 July 2021, Virtual Event, 2021, pp. 8748–
on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative 8763.
Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth [29] H. Bao, W. Wang, L. Dong, Q. Liu, O. K. Mohammed, K. Aggarwal,
AAAI Symposium on Educational Advances in Artificial Intelligence, S. Som, S. Piao, and F. Wei, “Vlmo: Unified vision-language pre-
EAAI 2020, New York, NY, USA, February 7-12, 2020, 2020, pp. 8992– training with mixture-of-modality-experts,” in NeurIPS, 2022.
8999. [30] J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, and
[13] A. Zadeh, C. Mao, K. Shi, Y. Zhang, P. P. Liang, S. Poria, and Y. Wu, “Coca: Contrastive captioners are image-text foundation mod-
L. Morency, “Factorized multimodal transformer for multimodal se- els,” Trans. Mach. Learn. Res., vol. 2022, 2022.
quential learning,” CoRR, vol. abs/1911.09826, 2019. [31] H. Akbari, L. Yuan, R. Qian, W. Chuang, S. Chang, Y. Cui, and
[14] D. Lu, L. Neves, V. Carvalho, N. Zhang, and H. Ji, “Visual attention B. Gong, “VATT: transformers for multimodal self-supervised learning
model for name tagging in multimodal social media,” in Proceedings from raw video, audio and text,” in Advances in Neural Information
of the 56th Annual Meeting of the Association for Computational Processing Systems 34: Annual Conference on Neural Information
Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual,
1: Long Papers, 2018, pp. 1990–1999. 2021, pp. 24 206–24 221.
[15] A. Gandhi, K. Adhvaryu, S. Poria, E. Cambria, and A. Hussain, “Mul- [32] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. de Laroussilhe,
timodal sentiment analysis: A systematic review of history, datasets, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer
multimodal fusion methods, applications, challenges and future direc- learning for NLP,” in Proceedings of the 36th International Conference
tions,” Inf. Fusion, vol. 91, pp. 424–444, 2023. on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach,
[16] N. Xu, W. Mao, and G. Chen, “Multi-interactive memory network California, USA, 2019, pp. 2790–2799.
for aspect based multimodal sentiment analysis,” in The Thirty-Third [33] X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts
AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty- for generation,” in Proceedings of the 59th Annual Meeting of the
First Innovative Applications of Artificial Intelligence Conference, Association for Computational Linguistics and the 11th International
IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Joint Conference on Natural Language Processing, ACL/IJCNLP 2021,
Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, C. Zong,
- February 1, 2019, 2019, pp. 371–378. F. Xia, W. Li, and R. Navigli, Eds. Association for Computational
[17] F. Wang, Z. Ding, R. Xia, Z. Li, and J. Yu, “Multimodal emotion-cause Linguistics, 2021, pp. 4582–4597.
pair extraction in conversations,” CoRR, vol. abs/2110.08020, 2021. [34] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du,
[18] X. Yang, S. Feng, D. Wang, Q. Sun, W. Wu, Y. Zhang, P. Hong, and A. M. Dai, and Q. V. Le, “Finetuned language models are zero-shot
S. Poria, “Few-shot joint multimodal aspect-sentiment analysis based learners,” arXiv preprint arXiv:2109.01652, 2021.
on generative multimodal prompt,” in Findings of the Association for [35] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhari-
Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, wal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal,
2023, 2023, pp. 11 575–11 589. A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh,
[19] S. Rahmani, S. Hosseini, R. Zall, M. R. Kangavari, S. Kamran, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler,
and W. Hua, “Transfer-based adaptive tree for multimodal sentiment M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish,
analysis based on user latent aspects,” Knowl. Based Syst., vol. 261, p. A. Radford, I. Sutskever, and D. Amodei, “Language models are few-
110219, 2023. shot learners,” in Advances in Neural Information Processing Systems
[20] Z. Li, Y. Zhou, Y. Liu, F. Zhu, C. Yang, and S. Hu, “QAP: 33: Annual Conference on Neural Information Processing Systems
A quantum-inspired adaptive-priority-learning model for multimodal 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle,
emotion recognition,” in Findings of the Association for Computational M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., 2020.
Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, 2023, pp.
[36] Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun,
12 191–12 204.
J. Xu, and Z. Sui, “A survey on in-context learning,” arXiv preprint
[21] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng,
arXiv:2301.00234, 2022.
“Multimodal deep learning,” in Proceedings of the 28th International
[37] S. Zou, X. Huang, and X. Shen, “Multimodal prompt transformer with
Conference on Machine Learning, ICML 2011, Bellevue, Washington,
hybrid contrastive learning for emotion recognition in conversation,”
USA, June 28 - July 2, 2011, L. Getoor and T. Scheffer, Eds.
CoRR, vol. abs/2310.04456, 2023.
Omnipress, 2011, pp. 689–696.
[22] W. Rahman, M. K. Hasan, S. Lee, A. B. Zadeh, C. Mao, L. Morency, [38] G. Hu, T. Lin, Y. Zhao, G. Lu, Y. Wu, and Y. Li, “Unimse: Towards
and M. E. Hoque, “Integrating multimodal information in large pre- unified multimodal sentiment analysis and emotion recognition,” in
trained transformers,” in Proceedings of the 58th Annual Meeting of Proceedings of the 2022 Conference on Empirical Methods in Nat-
the Association for Computational Linguistics, ACL 2020, Online, July ural Language Processing, EMNLP 2022, Abu Dhabi, United Arab
5-10, 2020, 2020, pp. 2359–2369. Emirates, December 7-11, 2022, 2022, pp. 7837–7851.
[23] Y. Zhang and Q. Yang, “A survey on multi-task learning,” IEEE Trans. [39] Y. Zhang, X. Yang, X. Xu, Z. Gao, Y. Huang, S. Mu, S. Feng, D. Wang,
Knowl. Data Eng., vol. 34, no. 12, pp. 5586–5609, 2022. Y. Zhang, K. Song et al., “Affective computing in the era of large
[24] Y. Xie, K. Yang, C. Sun, B. Liu, and Z. Ji, “Knowledge-interactive language models: A survey from the nlp perspective,” arXiv preprint
network with sentiment polarity intensity-aware multi-task learning arXiv:2408.04638, 2024.
for emotion recognition in conversations,” in Findings of the Asso- [40] B. Pan, K. Hirota, Z. Jia, and Y. Dai, “A review of multimodal emotion
ciation for Computational Linguistics: EMNLP 2021, Virtual Event / recognition from datasets, preprocessing, features, and fusion methods,”
Punta Cana, Dominican Republic, 16-20 November, 2021, M. Moens, Neurocomputing, vol. 561, p. 126866, 2023.
X. Huang, L. Specia, and S. W. Yih, Eds. Association for Computa- [41] K. Ezzameli and H. Mahersia, “Emotion recognition from unimodal to
tional Linguistics, 2021, pp. 2879–2889. multimodal analysis: A review,” Inf. Fusion, vol. 99, p. 101847, 2023.
[25] W. Zheng, J. Yu, R. Xia, and S. Wang, “A facial expression-aware [42] Z. WANG, X. ZHANG, J. CUI, S.-B. HO, and E. CAMBRIA, “A
multimodal multi-task learning framework for emotion recognition in review of chinese sentiment analysis: Subjects, methods, and trends.”
multi-party conversations,” in Proceedings of the 61st Annual Meeting [43] A. Gandhi, K. Adhvaryu, S. Poria, E. Cambria, and A. Hussain, “Mul-
of the Association for Computational Linguistics (Volume 1: Long timodal sentiment analysis: A systematic review of history, datasets,
Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, 2023, pp. multimodal fusion methods, applications, challenges and future direc-
15 445–15 459. tions,” Information Fusion, vol. 91, pp. 424–444, 2023.
[26] Z. Chen, L. Chen, B. Chen, L. Qin, Y. Liu, S. Zhu, J. Lou, and [44] L. Zhu, Z. Zhu, C. Zhang, Y. Xu, and X. Kong, “Multimodal sentiment
K. Yu, “Unidu: Towards A unified generative dialogue understanding analysis based on fusion methods: A survey,” Inf. Fusion, vol. 95, pp.
framework,” CoRR, vol. abs/2204.04637, 2022. 306–325, 2023.
[27] H. Luo, L. Ji, B. Shi, H. Huang, N. Duan, T. Li, X. Chen, and M. Zhou, [45] T. Thongtan and T. Phienthrakul, “Sentiment classification using doc-
“Univilm: A unified video and language pre-training model for mul- ument embeddings trained with cosine similarity,” in Proceedings of
timodal understanding and generation,” CoRR, vol. abs/2002.06353, the 57th Conference of the Association for Computational Linguistics,
2020. ACL 2019, Florence, Italy, July 28 - August 2, 2019, Volume 2: Student
[28] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, Research Workshop, F. Alva-Manchego, E. Choi, and D. Khashabi, Eds.
G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, Association for Computational Linguistics, 2019, pp. 407–414.
19

[46] L. Morency, R. Mihalcea, and P. Doshi, “Towards multimodal sentiment E. Grave, and G. Lample, “Llama: Open and efficient foundation
analysis: harvesting opinions from the web,” in Proceedings of the language models,” CoRR, vol. abs/2302.13971, 2023.
13th International Conference on Multimodal Interfaces, ICMI 2011, [62] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei,
Alicante, Spain, November 14-18, 2011, 2011, pp. 169–176. N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher,
[47] A. G. A. and V. Vetriselvi, “Survey on multimodal approaches to C. Canton-Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes,
emotion recognition,” Neurocomputing, vol. 556, p. 126693, 2023. J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn,
[48] Y. Sun, N. Yu, and G. Fu, “A discourse-aware graph neural network S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa,
for emotion recognition in multi-party conversation,” in Findings of I. Kloumann, A. Korenev, P. S. Koura, M. Lachaux, T. Lavril, J. Lee,
the Association for Computational Linguistics: EMNLP 2021, Virtual D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra,
Event / Punta Cana, Dominican Republic, 16-20 November, 2021, I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi,
M. Moens, X. Huang, L. Specia, and S. W. Yih, Eds. Association for A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan,
Computational Linguistics, 2021, pp. 2949–2958. B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov,
[49] A. Joshi, A. Bhat, A. Jain, A. V. Singh, and A. Modi, “COGMEN: Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic,
contextualized GNN based multimodal emotion recognition,” CoRR, S. Edunov, and T. Scialom, “Llama 2: Open foundation and fine-tuned
vol. abs/2205.02455, 2022. chat models,” CoRR, vol. abs/2307.09288, 2023.
[50] N. Xu, W. Mao, and G. Chen, “Multi-interactive memory network for [63] A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with
aspect based multimodal sentiment analysis,” in Proceedings of the selective state spaces,” CoRR, vol. abs/2312.00752, 2023.
AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. [64] Y. Gong, Y. Chung, and J. R. Glass, “AST: audio spectrogram
371–378. transformer,” in 22nd Annual Conference of the International Speech
[51] Z. Chen and T. Qian, “Transfer capsule network for aspect level Communication Association, Interspeech 2021, Brno, Czechia, August
sentiment classification,” in Proceedings of the 57th Conference of 30 - September 3, 2021, H. Hermansky, H. Cernocký, L. Burget,
the Association for Computational Linguistics, ACL 2019, Florence, L. Lamel, O. Scharenborg, and P. Motlı́cek, Eds. ISCA, 2021, pp.
Italy, July 28- August 2, 2019, Volume 1: Long Papers, A. Korhonen, 571–575.
D. R. Traum, and L. Màrquez, Eds. Association for Computational [65] M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling for
Linguistics, 2019, pp. 547–556. convolutional neural networks,” in Proceedings of the 36th Interna-
[52] H. Yan, J. Dai, T. Ji, X. Qiu, and Z. Zhang, “A unified generative tional Conference on Machine Learning, ICML 2019, 9-15 June 2019,
framework for aspect-based sentiment analysis,” in Proceedings of the Long Beach, California, USA, ser. Proceedings of Machine Learning
59th Annual Meeting of the Association for Computational Linguistics Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR,
and the 11th International Joint Conference on Natural Language 2019, pp. 6105–6114.
Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual [66] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
Event, August 1-6, 2021, C. Zong, F. Xia, W. Li, and R. Navigli, Eds. T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly,
Association for Computational Linguistics, 2021, pp. 2416–2429. J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words:
[53] C. Li, F. Gao, J. Bu, L. Xu, X. Chen, Y. Gu, Z. Shao, Q. Zheng, Transformers for image recognition at scale,” in 9th International
N. Zhang, Y. Wang, and Z. Yu, “Sentiprompt: Sentiment knowledge Conference on Learning Representations, ICLR 2021, Virtual Event,
enhanced prompt-tuning for aspect-based sentiment analysis,” CoRR, Austria, May 3-7, 2021. OpenReview.net, 2021.
vol. abs/2109.08306, 2021. [67] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal,
[54] P. Yang, X. Sun, W. Li, S. Ma, W. Wu, and H. Wang, “SGM: sequence G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable
generation model for multi-label classification,” in Proceedings of the visual models from natural language supervision,” in International
27th International Conference on Computational Linguistics, COLING conference on machine learning, 2021, pp. 8748–8763.
2018, Santa Fe, New Mexico, USA, August 20-26, 2018, 2018, pp. [68] “Gpt-4v(ision) system card,” 2023. [Online]. Available: https:
3915–3926. //api.semanticscholar.org/CorpusID:263218031
[55] Q. Ma, C. Yuan, W. Zhou, and S. Hu, “Label-specific dual graph
[69] G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut,
neural network for multi-label text classification,” in Proceedings of the
J. Schalkwyk, A. M. Dai, A. Hauth et al., “Gemini: a family of highly
59th Annual Meeting of the Association for Computational Linguistics
capable multimodal models,” arXiv preprint arXiv:2312.11805, 2023.
and the 11th International Joint Conference on Natural Language
Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual [70] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson,
Event, August 1-6, 2021, 2021, pp. 3855–3864. K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo:
[56] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, a visual language model for few-shot learning,” Advances in Neural
“Distributed representations of words and phrases and their compo- Information Processing Systems, vol. 35, pp. 23 716–23 736, 2022.
sitionality,” in Advances in Neural Information Processing Systems 26: [71] J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-
27th Annual Conference on Neural Information Processing Systems image pre-training with frozen image encoders and large language
2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, models,” arXiv preprint arXiv:2301.12597, 2023.
Nevada, United States, 2013, pp. 3111–3119. [72] D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: En-
[57] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors hancing vision-language understanding with advanced large language
for word representation,” in Proceedings of the 2014 Conference on models,” arXiv preprint arXiv:2304.10592, 2023.
Empirical Methods in Natural Language Processing, EMNLP 2014, [73] W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li,
October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special P. Fung, and S. Hoi, “Instructblip: Towards general-purpose vision-
Interest Group of the ACL, 2014, pp. 1532–1543. language models with instruction tuning,” 2023.
[58] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. [74] H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,”
Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 36, 2024.
in Advances in Neural Information Processing Systems 30: Annual [75] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li,
Conference on Neural Information Processing Systems 2017, December X. Wang, M. Dehghani, S. Brahma et al., “Scaling instruction-finetuned
4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, language models,” Journal of Machine Learning Research, vol. 25,
H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, Eds., no. 70, pp. 1–53, 2024.
2017, pp. 5998–6008. [76] A. Zadeh, M. Chen, S. Poria, E. Cambria, and L. Morency, “Tensor
[59] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, fusion network for multimodal sentiment analysis,” in Proceedings
V. Stoyanov, and L. Zettlemoyer, “BART: denoising sequence-to- of the 2017 Conference on Empirical Methods in Natural Language
sequence pre-training for natural language generation, translation, and Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11,
comprehension,” in Proceedings of the 58th Annual Meeting of the 2017, M. Palmer, R. Hwa, and S. Riedel, Eds. Association for
Association for Computational Linguistics, ACL 2020, Online, July 5- Computational Linguistics, 2017, pp. 1103–1114.
10, 2020, 2020, pp. 7871–7880. [77] Y. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L. Morency, and
[60] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, R. Salakhutdinov, “Multimodal transformer for unaligned multimodal
Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning language sequences,” in Proceedings of the 57th Conference of the
with a unified text-to-text transformer,” J. Mach. Learn. Res., vol. 21, Association for Computational Linguistics, ACL 2019, Florence, Italy,
pp. 140:1–140:67, 2020. July 28- August 2, 2019, Volume 1: Long Papers, A. Korhonen,
[61] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, D. R. Traum, and L. Màrquez, Eds. Association for Computational
B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, Linguistics, 2019, pp. 6558–6569.
20

[78] C. Chen, H. Hong, J. Guo, and B. Song, “Inter-intra modal representa- Web Conference 2023, WWW 2023, Austin, TX, USA, 30 April 2023 -
tion augmentation with trimodal collaborative disentanglement network 4 May 2023, 2023, pp. 1510–1518.
for multimodal sentiment analysis,” IEEE ACM Trans. Audio Speech [96] M. Sharafi, M. Yazdchi, R. Rasti, and F. Nasimi, “A novel spatio-
Lang. Process., vol. 31, pp. 1476–1488, 2023. temporal convolutional neural framework for multimodal emotion
[79] K. Yang, H. Xu, and K. Gao, “CM-BERT: cross-modal BERT for text- recognition,” Biomed. Signal Process. Control., vol. 78, p. 103970,
audio sentiment analysis,” in MM ’20: The 28th ACM International 2022.
Conference on Multimedia, Virtual Event / Seattle, WA, USA, October [97] Y. Li, W. Weng, and C. Liu, “Tscl-fhfn: two-stage contrastive learning
12-16, 2020, 2020, pp. 521–528. and feature hierarchical fusion network for multimodal sentiment
[80] Z. Lin, B. Liang, Y. Long, Y. Dang, M. Yang, M. Zhang, and analysis,” Neural Computing and Applications, pp. 1–15, 2024.
R. Xu, “Modeling intra- and inter-modal relations: Hierarchical graph [98] S. Mai, H. Hu, and S. Xing, “Divide, conquer and combine: Hierarchi-
contrastive learning for multimodal sentiment analysis,” in Proceedings cal feature fusion network with local and global perspectives for mul-
of the 29th International Conference on Computational Linguistics, timodal affective computing,” in Proceedings of the 57th Conference
COLING 2022, Gyeongju, Republic of Korea, October 12-17, 2022, of the Association for Computational Linguistics, ACL 2019, Florence,
2022, pp. 7124–7135. Italy, July 28- August 2, 2019, Volume 1: Long Papers, 2019, pp. 481–
[81] J. Tang, D. Liu, X. Jin, Y. Peng, Q. Zhao, Y. Ding, and W. Kong, 492.
“BAFN: bi-direction attention based fusion network for multimodal [99] Z. Li, B. Xu, C. Zhu, and T. Zhao, “CLMLF: A contrastive learning
sentiment analysis,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, and multi-layer fusion method for multimodal sentiment detection,”
no. 4, pp. 1966–1978, 2023. in Findings of the Association for Computational Linguistics: NAACL
[82] C. Huang, J. Zhang, X. Wu, Y. Wang, M. Li, and X. Huang, “Tefna: 2022, Seattle, WA, United States, July 10-15, 2022, 2022, pp. 2282–
Text-centered fusion network with crossmodal attention for multimodal 2294.
sentiment analysis,” Knowl. Based Syst., vol. 269, p. 110502, 2023. [100] P. P. Liang, Z. Liu, A. Zadeh, and L. Morency, “Multimodal language
[83] X. Zhang and Y. Li, “A cross-modality context fusion and analysis with recurrent multistage fusion,” in Proceedings of the 2018
semantic refinement network for emotion recognition in conversation,” Conference on Empirical Methods in Natural Language Processing,
in Proceedings of the 61st Annual Meeting of the Association for Brussels, Belgium, October 31 - November 4, 2018, 2018, pp. 150–
Computational Linguistics (Volume 1: Long Papers). Toronto, Canada: 161.
Association for Computational Linguistics, Jul. 2023, pp. 13 099– [101] J. Tang, K. Li, X. Jin, A. Cichocki, Q. Zhao, and W. Kong, “CTFN:
13 110. [Online]. Available: https://aclanthology.org/2023.acl-long.732 hierarchical learning for multimodal sentiment analysis using coupled-
[84] T. Shi and S. Huang, “Multiemo: An attention-based correlation-aware translation fusion network,” in Proceedings of the 59th Annual Meeting
multimodal fusion framework for emotion recognition in conversa- of the Association for Computational Linguistics and the 11th Interna-
tions,” in Proceedings of the 61st Annual Meeting of the Association tional Joint Conference on Natural Language Processing, ACL/IJCNLP
for Computational Linguistics (Volume 1: Long Papers), ACL 2023, 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, 2021,
Toronto, Canada, July 9-14, 2023, 2023, pp. 14 752–14 766. pp. 5301–5311.
[85] X. Chen, “Mmrbn: Rule-based network for multimodal emotion recog- [102] Z. Li, Q. Guo, Y. Pan, W. Ding, J. Yu, Y. Zhang, W. Liu, H. Chen,
nition,” in ICASSP 2024-2024 IEEE International Conference on H. Wang, and Y. Xie, “Multi-level correlation mining framework with
Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. self-supervised label generation for multimodal sentiment analysis,” Inf.
8200–8204. Fusion, vol. 99, p. 101891, 2023.
[86] X. Li, J. Liu, Y. Xie, P. Gong, X. Zhang, and H. He, “Magdra: a multi- [103] J. Peng, T. Wu, W. Zhang, F. Cheng, S. Tan, F. Yi, and Y. Huang,
modal attention graph network with dynamic routing-by-agreement for “A fine-grained modal label-based multi-stage network for multimodal
multi-label emotion recognition,” Knowledge-Based Systems, vol. 283, sentiment analysis,” Expert Syst. Appl., vol. 221, p. 119721, 2023.
p. 111126, 2024. [104] H. Luo, L. Ji, Y. Huang, B. Wang, S. Ji, and T. Li, “Scalevlad:
[87] N. K. Devulapally, S. Anand, S. D. Bhattacharjee, J. Yuan, and Improving multimodal sentiment analysis via multi-scale fusion of
Y. Chang, “Amuse: Adaptive multimodal analysis for speaker emotion locally descriptors,” CoRR, vol. abs/2112.01368, 2021.
recognition in group conversations,” CoRR, vol. abs/2401.15164, 2024. [105] S. Mai, Y. Zhao, Y. Zeng, J. Yao, and H. Hu, “Meta-learn unimodal
[88] W. Han, H. Chen, and S. Poria, “Improving multimodal fusion with signals with weak supervision for multimodal sentiment analysis,”
hierarchical mutual information maximization for multimodal senti- arXiv preprint arXiv:2408.16029, 2024.
ment analysis,” in Proceedings of the 2021 Conference on Empirical [106] S. Minglong, O. Chunping, L. Yongbin, and R. Lin, “Multimodal emo-
Methods in Natural Language Processing, EMNLP 2021, Virtual Event tion recognition based on hierarchical fusion strategy and contextual
/ Punta Cana, Dominican Republic, 7-11 November, 2021, M. Moens, information embedding,” Beijing Da Xue Xue Bao, vol. 60, no. 3, pp.
X. Huang, L. Specia, and S. W. Yih, Eds., pp. 9180–9192. 393–402, 2024.
[89] S. Mai, Y. Zeng, S. Zheng, and H. Hu, “Hybrid contrastive learning [107] J. Yang, Y. Wang, R. Yi, Y. Zhu, A. Rehman, A. Zadeh, S. Poria, and
of tri-modal representation for multimodal sentiment analysis,” CoRR, L. Morency, “MTAG: modal-temporal attention graph for unaligned
vol. abs/2109.01797, 2021. human multimodal language sequences,” in Proceedings of the 2021
[90] J. Zheng, S. Zhang, X. Wang, and Z. Zeng, “Multimodal representa- Conference of the North American Chapter of the Association for
tions learning based on mutual information maximization and mini- Computational Linguistics: Human Language Technologies, NAACL-
mization and identity embedding for multimodal sentiment analysis,” HLT 2021, Online, June 6-11, 2021, 2021, pp. 1009–1021.
arXiv preprint arXiv:2201.03969, 2022. [108] N. Majumder, D. Hazarika, A. Gelbukh, E. Cambria, and S. Poria,
[91] D. Hazarika, R. Zimmermann, and S. Poria, “MISA: modality-invariant “Multimodal sentiment analysis using hierarchical fusion with context
and -specific representations for multimodal sentiment analysis,” in modeling,” Knowledge-based systems, vol. 161, pp. 124–133, 2018.
MM ’20: The 28th ACM International Conference on Multimedia, [109] J. Zhao, R. Li, and Q. Jin, “Missing modality imagination network for
Virtual Event / Seattle, WA, USA, October 12-16, 2020, C. W. Chen, emotion recognition with uncertain missing modalities,” in Proceedings
R. Cucchiara, X. Hua, G. Qi, E. Ricci, Z. Zhang, and R. Zimmermann, of the 59th Annual Meeting of the Association for Computational
Eds. ACM, 2020, pp. 1122–1131. Linguistics and the 11th International Joint Conference on Natural
[92] L. Xiao, X. Wu, S. Yang, J. Xu, J. Zhou, and L. He, “Cross-modal Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers),
fine-grained alignment and fusion network for multimodal aspect-based Virtual Event, August 1-6, 2021, 2021, pp. 2608–2618.
sentiment analysis,” Inf. Process. Manag., vol. 60, no. 6, p. 103508, [110] S. Parthasarathy and S. Sundaram, “Training strategies to handle miss-
2023. ing modalities for audio-visual expression recognition,” in Compan-
[93] Z. Zhang, Z. Wang, X. Li, N. Liu, B. Guo, and Z. Yu, “Modalnet: an ion Publication of the 2020 International Conference on Multimodal
aspect-level sentiment classification model by exploring multimodal Interaction, ICMI Companion 2020, Virtual Event, The Netherlands,
data with fusion discriminant attentional network,” World Wide Web, October, 2020, 2020, pp. 400–404.
vol. 24, no. 6, pp. 1957–1974, 2021. [111] N. Wang, H. Cao, J. Zhao, R. Chen, D. Yan, and J. Zhang, “M2R2:
[94] Y. Zhang, M. Chen, J. Shen, and C. Wang, “Tailor versatile multi- missing-modality robust emotion recognition framework with iterative
modal learning for multi-label emotion recognition,” in Proceedings of data augmentation,” IEEE Trans. Artif. Intell., vol. 4, no. 5, pp. 1305–
the AAAI Conference on Artificial Intelligence, vol. 36, no. 8, 2022, 1316, 2023.
pp. 9100–9108. [112] J. Zeng, J. Zhou, and T. Liu, “Mitigating inconsistencies in multimodal
[95] S. Ge, Z. Jiang, Z. Cheng, C. Wang, Y. Yin, and Q. Gu, “Learning sentiment analysis under uncertain missing modalities,” in Proceedings
robust multi-modal representation for multi-label emotion recognition of the 2022 Conference on Empirical Methods in Natural Language
via adversarial masking and perturbation,” in Proceedings of the ACM Processing, 2022, pp. 2924–2934.
21

[113] Z. Yuan, W. Li, H. Xu, and W. Yu, “Transformer-based feature [131] T. Baltrušaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine
reconstruction network for robust multimodal sentiment analysis,” in learning: A survey and taxonomy,” IEEE transactions on pattern
Proceedings of the 29th ACM International Conference on Multimedia, analysis and machine intelligence, vol. 41, no. 2, pp. 423–443, 2018.
2021, pp. 4400–4407. [132] Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang, “A survey of affect
[114] W. Luo, M. Xu, and H. Lai, “Multimodal reconstruct and align net for recognition methods: Audio, visual, and spontaneous expressions,”
missing modality problem in sentiment analysis,” in MultiMedia Mod- IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 1, pp. 39–58,
eling - 29th International Conference, MMM 2023, Bergen, Norway, 2009.
January 9-12, 2023, Proceedings, Part II, 2023, pp. 411–422. [133] C. Du, C. Du, H. Wang, J. Li, W.-L. Zheng, B.-L. Lu, and H. He,
[115] C. Shang, A. Palmer, J. Sun, K. Chen, J. Lu, and J. Bi, “VIGAN: “Semi-supervised deep generative modelling of incomplete multi-
missing view imputation with generative adversarial networks,” in 2017 modality emotional data,” in Proceedings of the 26th ACM interna-
IEEE International Conference on Big Data (IEEE BigData 2017), tional conference on Multimedia, 2018, pp. 108–116.
Boston, MA, USA, December 11-14, 2017, 2017, pp. 766–775. [134] Z. Wang, Z. Wan, and X. Wan, “Transmodality: An end2end fusion
[116] J. Zeng, T. Liu, and J. Zhou, “Tag-assisted multimodal sentiment method with transformer for multimodal sentiment analysis,” in WWW
analysis under uncertain missing modalities,” in SIGIR ’22: The 45th ’20: The Web Conference 2020, Taipei, Taiwan, April 20-24, 2020,
International ACM SIGIR Conference on Research and Development 2020, pp. 2514–2520.
in Information Retrieval, Madrid, Spain, July 11 - 15, 2022, 2022, pp. [135] F. Ma, S. Huang, and L. Zhang, “An efficient approach for audio-visual
1545–1554. emotion recognition with missing labels and missing modalities,” in
[117] H. Zuo, R. Liu, J. Zhao, G. Gao, and H. Li, “Exploiting modality- 2021 IEEE International Conference on Multimedia and Expo, ICME
invariant feature for robust multimodal emotion recognition with miss- 2021, Shenzhen, China, July 5-9, 2021, 2021, pp. 1–6.
ing modalities,” in IEEE International Conference on Acoustics, Speech [136] W. Yu, H. Xu, Z. Yuan, and J. Wu, “Learning modality-specific
and Signal Processing ICASSP 2023, Rhodes Island, Greece, June 4- representations with self-supervised multi-task learning for multimodal
10, 2023, 2023, pp. 1–5. sentiment analysis,” in Thirty-Fifth AAAI Conference on Artificial
[118] Z. Liu, B. Zhou, D. Chu, Y. Sun, and L. Meng, “Modality translation- Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Appli-
based multimodal sentiment analysis under uncertain missing modali- cations of Artificial Intelligence, IAAI 2021, The Eleventh Symposium
ties,” Inf. Fusion, vol. 101, p. 101973, 2024. on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual
[119] T. Zhou, S. Canu, P. Vera, and S. Ruan, “Feature-enhanced generation Event, February 2-9, 2021, pp. 10 790–10 797.
and multi-modality fusion based deep neural network for brain tumor [137] S. Mai, H. Hu, and S. Xing, “Modality to modality translation:
segmentation with missing MR modalities,” Neurocomputing, vol. 466, An adversarial representation learning and graph fusion network for
pp. 102–112, 2021. multimodal fusion,” in The Thirty-Fourth AAAI Conference on Artificial
[120] J. Vazquez-Rodriguez, G. Lefebvre, J. Cumin, and J. L. Crowley, Intelligence, AAAI 2020, The Thirty-Second Innovative Applications
“Accommodating missing modalities in time-continuous multimodal of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI
emotion recognition,” in 11th International Conference on Affective Symposium on Educational Advances in Artificial Intelligence, EAAI
Computing and Intelligent Interaction, ACII 2023, Cambridge, MA, 2020, New York, NY, USA, February 7-12, 2020. AAAI Press, 2020,
USA, September 10-13, 2023. IEEE, 2023, pp. 1–8. pp. 164–172.
[121] H. Mao, B. Zhang, H. Xu, Z. Yuan, and Y. Liu, “Robust-msa: [138] M. S. Akhtar, D. S. Chauhan, D. Ghosal, S. Poria, A. Ekbal, and
Understanding the impact of modality noise on multimodal sentiment P. Bhattacharyya, “Multi-task learning for multi-modal emotion recog-
analysis,” in Thirty-Seventh AAAI Conference on Artificial Intelligence, nition and sentiment analysis,” in Proceedings of the 2019 Conference
AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Ar- of the North American Chapter of the Association for Computa-
tificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational tional Linguistics: Human Language Technologies, NAACL-HLT 2019,
Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short
February 7-14, 2023, 2023, pp. 16 458–16 460. Papers), J. Burstein, C. Doran, and T. Solorio, Eds. Association for
[122] S. Lai, X. Hu, Y. Li, Z. Ren, Z. Liu, and D. Miao, “Shared and Computational Linguistics, 2019, pp. 370–379.
private information learning in multimodal sentiment analysis with [139] R. Chen, W. Zhou, Y. Li, and H. Zhou, “Video-based cross-modal
deep modal alignment and self-supervised multi-task learning,” arXiv auxiliary network for multimodal sentiment analysis,” IEEE Trans.
preprint arXiv:2305.08473, 2023. Circuits Syst. Video Technol., vol. 32, no. 12, pp. 8703–8716, 2022.
[123] X. Zhang, W. Cui, B. Hu, and Y. Li, “A multi-level alignment and cross- [140] Y. Zeng, W. Yan, S. Mai, and H. Hu, “Disentanglement translation
modal unified semantic graph refinement network for conversational network for multimodal sentiment analysis,” Inf. Fusion, vol. 102, p.
emotion recognition,” IEEE Transactions on Affective Computing, 102031, 2024.
2024. [141] D. Sun, Y. He, and J. Han, “Using auxiliary tasks in multimodal fusion
[124] E. Shutova, D. Kiela, and J. Maillard, “Black holes and white rabbits: of wav2vec 2.0 and BERT for multimodal emotion recognition,” CoRR,
Metaphor identification with visual features,” in NAACL HLT 2016, The vol. abs/2302.13661, 2023.
2016 Conference of the North American Chapter of the Association for [142] Z. Zhao, Y. Wang, G. Shen, Y. Xu, and J. Zhang, “Tdfnet: Transformer-
Computational Linguistics: Human Language Technologies, San Diego based deep-scale fusion network for multimodal emotion recognition,”
California, USA, June 12-17, 2016, 2016, pp. 160–170. IEEE ACM Trans. Audio Speech Lang. Process., vol. 31, pp. 3771–
[125] S. Moon, S. Kim, and H. Wang, “Multimodal transfer deep learning 3782, 2023.
for audio visual recognition,” CoRR, vol. abs/1412.3121, 2014. [143] M. Ren, X. Huang, J. Liu, M. Liu, X. Li, and A. Liu, “MALN:
[126] G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. W. Senior, multimodal adversarial learning network for conversational emotion
“Recent advances in the automatic recognition of audiovisual speech,” recognition,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 11,
Proceedings of the IEEE, vol. 91, no. 9, pp. 1306–1326, 2003. pp. 6965–6980, 2023.
[127] G. Evangelopoulos, A. Zlatintsi, A. Potamianos, P. Maragos, K. Ra- [144] F. Liu, S. Shen, Z. Fu, H. Wang, A. Zhou, and J. Qi, “LGCCT: A light
pantzikos, G. Skoumas, and Y. Avrithis, “Multimodal saliency and gated and crossed complementation transformer for multimodal speech
fusion for movie summarization based on aural, visual, and textual emotion recognition,” Entropy, vol. 24, no. 7, p. 1010, 2022.
attention,” IEEE Trans. Multim., vol. 15, no. 7, pp. 1553–1568, 2013. [145] C. Zhang, Y. Zhang, and B. Cheng, “Rl-emo: A reinforcement learning
[128] M. Glodek, S. Tschechne, G. Layher, M. Schels, T. Brosch, S. Scherer, framework for multimodal emotion recognition,” in ICASSP 2024 -
M. Kächele, M. Schmidt, H. Neumann, G. Palm, and F. Schwenker, 2024 IEEE International Conference on Acoustics, Speech and Signal
“Multiple classifier systems for the classification of audio-visual emo- Processing (ICASSP), 2024, pp. 10 246–10 250.
tional states,” in Affective Computing and Intelligent Interaction - [146] L. Yang, J. Na, and J. Yu, “Cross-modal multitask transformer for
Fourth International Conference, ACII 2011, Memphis, TN, USA, end-to-end multimodal aspect-based sentiment analysis,” Inf. Process.
October 9-12, 2011, Proceedings, Part II, 2011, pp. 359–368. Manag., vol. 59, no. 5, p. 103038, 2022.
[129] S. Poria, E. Cambria, D. Hazarika, N. Majumder, A. Zadeh, and L.- [147] R. Jain, A. Singh, V. K. Gangwar, and S. Saha, “Abcord: Exploit-
P. Morency, “Context-dependent sentiment analysis in user-generated ing multimodal generative approach for aspect-based complaint and
videos,” in Proceedings of the 55th annual meeting of the association rationale detection,” in Proceedings of the 31st ACM International
for computational linguistics (volume 1: Long papers), 2017, pp. 873– Conference on Multimedia, MM 2023, Ottawa, ON, Canada, 29
883. October 2023- 3 November 2023, 2023, pp. 8571–8579.
[130] A. I. Middya, B. Nag, and S. Roy, “Deep learning based multimodal [148] X. Ju, D. Zhang, R. Xiao, J. Li, S. Li, M. Zhang, and G. Zhou,
emotion recognition using model-level fusion of audio-visual modali- “Joint multi-modal aspect-sentiment analysis with auxiliary cross-
ties,” Knowl. Based Syst., vol. 244, p. 108580, 2022. modal relation detection,” in Proceedings of the 2021 Conference on
22

Empirical Methods in Natural Language Processing, EMNLP 2021, [166] C. Zhu, M. Chen, S. Zhang, C. Sun, H. Liang, Y. Liu, and J. Chen,
Virtual Event / Punta Cana, Dominican Republic, 7-11 November, “SKEAFN: sentiment knowledge enhanced attention fusion network
2021, 2021, pp. 4395–4405. for multimodal sentiment analysis,” Inf. Fusion, vol. 100, p. 101958,
[149] H. Lian, C. Lu, S. Li, Y. Zhao, C. Tang, Y. Zong, and W. Zheng, 2023.
“Label distribution adaptation for multimodal emotion recognition with [167] M. Chen and X. Li, “Swafn: Sentimental words aware fusion network
multi-label learning,” in Proceedings of the 1st International Workshop for multimodal sentiment analysis,” in Proceedings of the 28th interna-
on Multimodal and Responsible Affective Computing, MRAC 2023, tional conference on computational linguistics, 2020, pp. 1067–1077.
Ottawa, ON, Canada, 29 October 2023, 2023, pp. 51–58. [168] Y. Fu, S. Okada, L. Wang, L. Guo, Y. Song, J. Liu, and J. Dang,
[150] “Aobert: All-modalities-in-one bert for multimodal sentiment analysis,” “Context- and knowledge-aware graph convolutional network for mul-
Information Fusion, vol. 92, pp. 37–45, 2023. timodal emotion recognition,” IEEE Multim., vol. 29, no. 3, pp. 91–100,
[151] F. Qian, J. Han, Y. He, T. Zheng, and G. Zheng, “Sentiment knowledge 2022.
enhanced self-supervised learning for multimodal sentiment analysis,” [169] Y. Li, Y. Wang, and Z. Cui, “Decoupled multimodal distilling for
in Findings of the Association for Computational Linguistics: ACL emotion recognition,” in IEEE/CVF Conference on Computer Vision
2023, 2023, pp. 12 966–12 978. and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June
[152] M. Arjmand, M. J. Dousti, and H. Moradi, “TEASEL: A transformer- 17-24, 2023, 2023, pp. 6631–6640.
based speech-prefixed language model,” CoRR, vol. abs/2109.05522, [170] X. Sun, H. He, H. Tang, K. Zeng, and T. Shen, “Multimodal rough
2021. set transformer for sentiment analysis and emotion recognition,” in 9th
[153] J. Yu and J. Jiang, “Adapting BERT for target-oriented multimodal IEEE International Conference on Cloud Computing and Intelligent
sentiment classification,” in Proceedings of the Twenty-Eighth Interna- Systems, CCIS 2023, Dali, China, August 12-13, 2023, 2023, pp. 250–
tional Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, 259.
China, August 10-16, 2019, 2019, pp. 5408–5414. [171] P. Wang, S. Zeng, J. Chen, L. Fan, M. Chen, Y. Wu, and X. He,
[154] J. Cheng, I. Fostiropoulos, B. W. Boehm, and M. Soleymani, “Mul- “Leveraging label information for multimodal emotion recognition,”
timodal phased transformer for sentiment analysis,” in Proceedings CoRR, vol. abs/2309.02106, 2023.
of the 2021 Conference on Empirical Methods in Natural Language [172] P. Yuan, G. Cai, M. Chen, and X. Tang, “Topics guided multimodal fu-
Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican sion network for conversational emotion recognition,” in International
Republic, 7-11 November, 2021, M. Moens, X. Huang, L. Specia, and Conference on Intelligent Computing. Springer, 2024, pp. 250–262.
S. W. Yih, Eds. Association for Computational Linguistics, 2021, pp. [173] F. Zhang, X. Li, C. P. Lim, Q. Hua, C. Dong, and J. Zhai, “Deep
2447–2458. emotional arousal network for multimodal sentiment analysis and
[155] H. Zhang, Y. Wang, G. Yin, K. Liu, Y. Liu, and T. Yu, “Learn- emotion recognition,” Inf. Fusion, vol. 88, pp. 296–304, 2022.
ing language-guided adaptive hyper-modality representation for mul- [174] Z. Xu, Q. Su, and J. Xiao, “Multimodal aspect-based sentiment clas-
timodal sentiment analysis,” CoRR, vol. abs/2310.05804, 2023. sification with knowledge-injected transformer,” in IEEE International
[156] J. Li, X. Wang, and Z. Zeng, “Tracing intricate cues in dialogue: Conference on Multimedia and Expo, ICME 2023, Brisbane, Australia,
Joint graph structure and sentiment dynamics for multimodal emotion July 10-14, 2023, 2023, pp. 1379–1384.
recognition,” arXiv preprint arXiv:2407.21536, 2024.
[175] H. Yang, Y. Zhao, and B. Qin, “Face-sensitive image-to-emotional-
[157] K. Liu, J. Wang, and X. Zhang, “Entity-related unsupervised pretraining
text cross-modal translation for multimodal aspect-based sentiment
with visual prompts for multimodal aspect-based sentiment analysis,” in
analysis,” in Proceedings of the 2022 Conference on Empirical Methods
Natural Language Processing and Chinese Computing - 12th National
in Natural Language Processing, EMNLP 2022, Abu Dhabi, United
CCF Conference, NLPCC 2023, Foshan, China, October 12-15, 2023,
Arab Emirates, December 7-11, 2022, 2022, pp. 3324–3335.
Proceedings, Part II, 2023, pp. 481–493.
[176] L. Xiao, X. Wu, S. Yang, J. Xu, J. Zhou, and L. He, “Cross-modal
[158] Y. Ling, J. Yu, and R. Xia, “Vision-language pre-training for multi-
modal aspect-based sentiment analysis,” in Proceedings of the 60th fine-grained alignment and fusion network for multimodal aspect-based
sentiment analysis,” Information Processing & Management, vol. 60,
Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, no. 6, p. 103508, 2023.
2022, pp. 2149–2159. [177] J. Yu, K. Chen, and R. Xia, “Hierarchical interactive multimodal
[159] K. Zhang, K. Zhang, M. Zhang, H. Zhao, Q. Liu, W. Wu, and E. Chen, transformer for aspect-based multimodal sentiment analysis,” IEEE
“Incorporating dynamic semantics into pre-trained language model for Transactions on Affective Computing, 2022.
aspect-based sentiment analysis,” in Findings of the Association for [178] W. Zheng, J. Yu, and R. Xia, “A unimodal valence-arousal driven
Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, contrastive learning framework for multimodal multi-label emotion
2022, 2022, pp. 3599–3610. recognition,” in ACM Multimedia 2024.
[160] Z. Yu, J. Wang, L. Yu, and X. Zhang, “Dual-encoder transformers [179] C. Peng, K. Chen, L. Shou, and G. Chen, “Carat: Contrastive feature
with cross-modal alignment for multimodal aspect-based sentiment reconstruction and aggregation for multi-modal multi-label emotion
analysis,” in Proceedings of the 2nd Conference of the Asia-Pacific recognition,” in Proceedings of the AAAI Conference on Artificial
Chapter of the Association for Computational Linguistics and the Intelligence, vol. 38, no. 13, 2024, pp. 14 581–14 589.
12th International Joint Conference on Natural Language Processing, [180] J. Zhao, Y. Zhao, and J. Li, “M3tr: Multi-modal multi-label recognition
AACL/IJCNLP 2022 - Volume 1: Long Papers, Online Only, November with transformer,” in Proceedings of the 29th ACM international
20-23, 2022, 2022, pp. 414–423. conference on multimedia, 2021, pp. 469–477.
[161] H. Jin, J. Tan, L. Liu, L. Qiu, S. Yao, X. Chen, and X. Zeng, “MSRA: [181] D. Zhang, X. Ju, W. Zhang, J. Li, S. Li, Q. Zhu, and G. Zhou, “Multi-
A multi-aspect semantic relevance approach for e-commerce via mul- modal multi-label emotion recognition with heterogeneous hierarchical
timodal pre-training,” in Proceedings of the 32nd ACM International message passing,” in Proceedings of the AAAI Conference on Artificial
Conference on Information and Knowledge Management, CIKM 2023, Intelligence, vol. 35, no. 16, 2021, pp. 14 338–14 346.
Birmingham, United Kingdom, October 21-25, 2023, 2023, pp. 3988– [182] D. S. Chauhan, M. S. Akhtar, A. Ekbal, and P. Bhattacharyya, “Context-
3992. aware interactive attention for multi-modal sentiment and emotion
[162] Q. Wang, H. Xu, Z. Wen, B. Liang, M. Yang, B. Qin, and R. Xu, analysis,” in Proceedings of the 2019 Conference on Empirical Meth-
“Image-to-text conversion and aspect-oriented filtration for multimodal ods in Natural Language Processing and the 9th International Joint
aspect-based sentiment analysis,” IEEE Transactions on Affective Com- Conference on Natural Language Processing, EMNLP-IJCNLP 2019,
puting, no. 01, pp. 1–15, 2023. Hong Kong, China, November 3-7, 2019, K. Inui, J. Jiang, V. Ng, and
[163] C. Wang, Y. Luo, C. Meng, and F. Yuan, “An adaptive dual graph X. Wan, Eds. Association for Computational Linguistics, 2019, pp.
convolution fusion network for aspect-based sentiment analysis,” ACM 5646–5656.
Transactions on Asian and Low-Resource Language Information Pro- [183] S. Poria, E. Cambria, D. Hazarika, N. Majumder, A. Zadeh, and
cessing, 2024. L. Morency, “Multi-level multiple attentions for contextual multimodal
[164] J. Mu, F. Nie, W. Wang, J. Xu, J. Zhang, and H. Liu, “Mocolnet: sentiment analysis,” in 2017 IEEE International Conference on Data
A momentum contrastive learning network for multimodal aspect- Mining, ICDM 2017, New Orleans, LA, USA, November 18-21, 2017,
level sentiment analysis,” IEEE Transactions on Knowledge and Data 2017, pp. 1033–1038.
Engineering, 2023. [184] M. Huang, C. Qing, J. Tan, and X. Xu, “Context-based adaptive multi-
[165] D. Wang, X. Guo, Y. Tian, J. Liu, L. He, and X. Luo, “TETFN: A modal fusion network for continuous frame-level sentiment prediction,”
text enhanced transformer fusion network for multimodal sentiment IEEE ACM Trans. Audio Speech Lang. Process., vol. 31, pp. 3468–
analysis,” Pattern Recognit., vol. 136, p. 109259, 2023. 3477, 2023.
23

[185] Z. Li, Y. Sun, L. Zhang, and J. Tang, “Ctnet: Context-based tandem [203] Y. Lee, S. Yoon, and K. Jung, “Multimodal speech emotion recog-
network for semantic segmentation,” IEEE Trans. Pattern Anal. Mach. nition using cross attention with aligned audio and text,” CoRR, vol.
Intell., vol. 44, no. 12, pp. 9904–9917, 2022. abs/2207.12895, 2022.
[186] X. Sun, X. Ren, and X. Xie, “A novel multimodal sentiment analysis [204] C. Zong, F. Xia, W. Li, and R. Navigli, Eds., Proceedings of the
model based on gated fusion and multi-task learning,” in ICASSP 2024- 59th Annual Meeting of the Association for Computational Linguistics
2024 IEEE International Conference on Acoustics, Speech and Signal and the 11th International Joint Conference on Natural Language
Processing (ICASSP). IEEE, 2024, pp. 8336–8340. Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual
[187] J. Hu, Y. Liu, J. Zhao, and Q. Jin, “MMGCN: multimodal fusion via Event, August 1-6, 2021. Association for Computational Linguistics,
deep graph convolution network for emotion recognition in conversa- 2021.
tion,” in Proceedings of the 59th Annual Meeting of the Association for [205] F. Wang, S. Tian, L. Yu, J. Liu, J. Wang, K. Li, and Y. Wang,
Computational Linguistics and the 11th International Joint Conference “TEDT: transformer-based encoding-decoding translation network for
on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long multimodal sentiment analysis,” Cogn. Comput., vol. 15, no. 1, pp.
Papers), Virtual Event, August 1-6, 2021, C. Zong, F. Xia, W. Li, and 289–303, 2023.
R. Navigli, Eds. Association for Computational Linguistics, 2021, pp. [206] Z. Yu, J. Wang, L.-C. Yu, and X. Zhang, “Dual-encoder transformers
5666–5675. with cross-modal alignment for multimodal aspect-based sentiment
[188] D. Hu, X. Hou, L. Wei, L. Jiang, and Y. Mo, “MM-DFN: multimodal analysis,” in Proceedings of the 2nd Conference of the Asia-Pacific
dynamic fusion network for emotion recognition in conversations,” Chapter of the Association for Computational Linguistics and the
in IEEE International Conference on Acoustics, Speech and Signal 12th International Joint Conference on Natural Language Processing
Processing, ICASSP 2022, Virtual and Singapore, 23-27 May 2022, (Volume 1: Long Papers), 2022, pp. 414–423.
2022, pp. 7037–7041. [207] Y. Ge, D. Chen, and H. Li, “Mutual mean-teaching: Pseudo label re-
finery for unsupervised domain adaptation on person re-identification,”
[189] D. Zhang, F. Chen, J. Chang, X. Chen, and Q. Tian, “Structure
arXiv preprint arXiv:2001.01526, 2020.
aware multi-graph network for multi-modal emotion recognition in
[208] H. Pham, Z. Dai, Q. Xie, and Q. V. Le, “Meta pseudo labels,” in IEEE
conversations,” IEEE Trans. Multim., vol. 26, pp. 3987–3997, 2024.
Conference on Computer Vision and Pattern Recognition, CVPR 2021,
[190] F. Chen, J. Shao, S. Zhu, and H. T. Shen, “Multivariate, multi-frequency virtual, June 19-25, 2021. Computer Vision Foundation / IEEE, 2021,
and multimodal: Rethinking graph neural networks for emotion recog- pp. 11 557–11 568.
nition in conversation,” in IEEE/CVF Conference on Computer Vision [209] Y. Zhang, M. Zhang, S. Wu, and J. Zhao, “Towards unifying the label
and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June space for aspect- and sentence-based sentiment analysis,” in Findings
17-24, 2023, 2023, pp. 10 761–10 770. of the Association for Computational Linguistics: ACL 2022, Dublin,
[191] B. Yao and W. Shi, “Speaker-centric multimodal fusion networks for Ireland, May 22-27, 2022, S. Muresan, P. Nakov, and A. Villavicencio,
emotion recognition in conversations,” in ICASSP 2024 - 2024 IEEE Eds. Association for Computational Linguistics, 2022, pp. 20–30.
International Conference on Acoustics, Speech and Signal Processing [210] Y. Zhang, J. Wang, Y. Liu, L. Rong, Q. Zheng, D. Song, P. Tiwari, and
(ICASSP), 2024, pp. 8441–8445. J. Qin, “A multitask learning model for multimodal sarcasm, sentiment
[192] Z. Li, F. Tang, M. Zhao, and Y. Zhu, “Emocaps: Emotion capsule and emotion recognition in conversations,” Inf. Fusion, vol. 93, pp.
based model for conversational emotion recognition,” arXiv preprint 282–301, 2023.
arXiv:2203.13504, 2022. [211] M. S. Akhtar, D. S. Chauhan, D. Ghosal, S. Poria, A. Ekbal, and
[193] J. Li, X. Wang, G. Lv, and Z. Zeng, “GA2MIF: graph and attention P. Bhattacharyya, “Multi-task learning for multi-modal emotion recog-
based two-stage multi-source information fusion for conversational nition and sentiment analysis,” arXiv preprint arXiv:1905.05812, 2019.
emotion detection,” IEEE Trans. Affect. Comput., vol. 15, no. 1, pp. [212] T. Xie, C. H. Wu, P. Shi, R. Zhong, T. Scholak, M. Yasunaga, C. Wu,
130–143, 2024. M. Zhong, P. Yin, S. I. Wang, V. Zhong, B. Wang, C. Li, C. Boyle,
[194] C. Xu, X. Luo, and D. Wang, “MCPR: A chinese product review A. Ni, Z. Yao, D. R. Radev, C. Xiong, L. Kong, R. Zhang, N. A. Smith,
dataset for multimodal aspect-based sentiment analysis,” in Cognitive L. Zettlemoyer, and T. Yu, “Unifiedskg: Unifying and multi-tasking
Computing - ICCC 2022 - 6th International Conference, Held as Part structured knowledge grounding with text-to-text language models,”
of the Services Conference Federation, SCF 2022, Honolulu, HI, USA, CoRR, vol. abs/2201.05966, 2022.
December 10-14, 2022, Proceedings, 2022, pp. 83–90. [213] J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-
[195] M. Anschütz, T. Eder, and G. Groh, “Retrieving users’ opinions on agnostic visiolinguistic representations for vision-and-language tasks,”
social media with multimodal aspect-based sentiment analysis,” in 17th in Advances in Neural Information Processing Systems 32: Annual
IEEE International Conference on Semantic Computing, ICSC 2023, Conference on Neural Information Processing Systems 2019, NeurIPS
Laguna Hills, CA, USA, February 1-3, 2023, 2023, pp. 1–8. 2019, December 8-14, 2019, Vancouver, BC, Canada, H. M. Wallach,
[196] F. Zhao, C. Li, Z. Wu, Y. Ouyang, J. Zhang, and X. Dai, “M2DF: H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Gar-
multi-grained multi-curriculum denoising framework for multimodal nett, Eds., 2019, pp. 13–23.
aspect-based sentiment analysis,” CoRR, vol. abs/2310.14605, 2023. [214] W. Wang, H. Bao, L. Dong, and F. Wei, “Vlmo: Unified vision-
[197] R. Zhou, W. Guo, X. Liu, S. Yu, Y. Zhang, and X. Yuan, “Aom: language pre-training with mixture-of-modality-experts,” CoRR, vol.
Detecting aspect-oriented information for multimodal aspect-based abs/2111.02358, 2021.
sentiment analysis,” in Findings of the Association for Computational [215] Z. Zhang, X. Meng, Y. Wang, X. Jiang, Q. Liu, and Z. Yang, “Unims:
Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, 2023, pp. A unified framework for multimodal summarization with knowledge
8184–8196. distillation,” in Thirty-Sixth AAAI Conference on Artificial Intelligence,
AAAI 2022, Thirty-Fourth Conference on Innovative Applications of
[198] J. Zhao and F. Yang, “Fusion with gcn and se-resnext network for
Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Edu-
aspect based multimodal sentiment analysis,” in 2023 IEEE 6th In-
cational Advances in Artificial Intelligence, EAAI 2022 Virtual Event,
formation Technology, Networking, Electronic and Automation Control
February 22 - March 1, 2022, 2022, pp. 11 757–11 764.
Conference (ITNEC), vol. 6, 2023, pp. 336–340.
[216] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis,
[199] D. Zhang, X. Ju, J. Li, S. Li, Q. Zhu, and G. Zhou, “Multi-modal L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized BERT
multi-label emotion detection with modality and label dependence,” in pretraining approach,” CoRR, vol. abs/1907.11692, 2019.
Proceedings of the 2020 Conference on Empirical Methods in Natural [217] S. Qiu, N. Sekhar, and P. Singhal, “Topic and style-aware transformer
Language Processing, EMNLP 2020, Online, November 16-20, 2020, for multimodal emotion recognition,” in Findings of the Association
2020, pp. 3584–3593. for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-
[200] X. Ju, D. Zhang, J. Li, and G. Zhou, “Transformer-based label set 14, 2023, 2023, pp. 2074–2082.
generation for multi-modal multi-label emotion detection,” in MM ’20: [218] C. Xi, G. Lu, and J. Yan, “Multimodal sentiment analysis based on
The 28th ACM International Conference on Multimedia, Virtual Event multi-head attention mechanism,” in Proceedings of the 4th interna-
/ Seattle, WA, USA, October 12-16, 2020, 2020, pp. 512–520. tional conference on machine learning and soft computing, 2020, pp.
[201] J. Zhao, T. Zhang, J. Hu, Y. Liu, Q. Jin, X. Wang, and H. Li, “M3ED: 34–39.
multi-modal multi-scene multi-label emotional dialogue database,” in [219] Y. Zhang, D. Song, P. Zhang, P. Wang, J. Li, X. Li, and B. Wang, “A
ACL 2022, 2022, pp. 5699–5710. quantum-inspired multimodal sentiment analysis framework,” Theoret-
[202] H. Luo, L. Ji, Y. Huang, B. Wang, S. Ji, and T. Li, “Scalevlad: ical Computer Science, vol. 752, pp. 21–40, 2018.
Improving multimodal sentiment analysis via multi-scale fusion of [220] A. Metallinou, M. Wollmer, A. Katsamanis, F. Eyben, B. Schuller, and
locally descriptors,” arXiv preprint arXiv:2112.01368, 2021. S. Narayanan, “Context-sensitive learning for enhanced audiovisual
24

emotion classification,” IEEE Transactions on Affective Computing, I. Gurevych and Y. Miyao, Eds. Association for Computational
vol. 3, no. 2, pp. 184–198, 2012. Linguistics, 2018, pp. 2236–2246.
[221] Y. Li, K. Zhang, J. Wang, and X. Gao, “A cognitive brain model for [239] W. Yu, H. Xu, F. Meng, Y. Zhu, Y. Ma, J. Wu, J. Zou, and K. Yang,
multimodal sentiment analysis based on attention neural networks,” “Ch-sims: A chinese multimodal sentiment analysis dataset with fine-
Neurocomputing, vol. 430, pp. 159–173, 2021. grained annotation of modality,” in Proceedings of the 58th annual
[222] I. Chaturvedi, R. Satapathy, S. Cavallari, and E. Cambria, “Fuzzy meeting of the association for computational linguistics, 2020, pp.
commonsense reasoning for multimodal sentiment analysis,” Pattern 3718–3727.
Recognition Letters, vol. 125, pp. 264–270, 2019. [240] A. Zadeh, Y. S. Cao, S. Hessner, P. P. Liang, S. Poria, and L.-P.
[223] W. Wu, Y. Wang, S. Xu, and K. Yan, “Sfnn: semantic features Morency, “Cmu-moseas: A multimodal language dataset for spanish,
fusion neural network for multimodal sentiment analysis,” in 2020 portuguese, german and french,” in Proceedings of the Conference on
5th International Conference on Automation, Control and Robotics Empirical Methods in Natural Language Processing. Conference on
Engineering (CACRE), 2020, pp. 661–665. Empirical Methods in Natural Language Processing, vol. 2020, 2020,
[224] W. Zheng, J. Yu, R. Xia, and S. Wang, “A facial expression-aware p. 1801.
multimodal multi-task learning framework for emotion recognition in [241] M. Wöllmer, F. Weninger, T. Knaup, B. W. Schuller, C. Sun, K. Sagae,
multi-party conversations,” in Proceedings of the 61st Annual Meeting and L. Morency, “Youtube movie reviews: Sentiment analysis in an
of the Association for Computational Linguistics (Volume 1: Long audio-visual context,” IEEE Intell. Syst., vol. 28, no. 3, pp. 46–53,
Papers), 2023, pp. 15 445–15 459. 2013.
[225] H. Ma, J. Wang, H. Lin, B. Zhang, Y. Zhang, and B. Xu, “A [242] Y. Liu, Z. Yuan, H. Mao, Z. Liang, W. Yang, Y. Qiu, T. Cheng, X. Li,
transformer-based model with self-distillation for multimodal emotion H. Xu, and K. Gao, “Make acoustic and visual cues matter: Ch-sims
recognition in conversations,” CoRR, vol. abs/2310.20494, 2023. v2. 0 dataset and av-mixup consistent module,” in Proceedings of the
[226] Y. Wang, Y. Li, P. Bell, and C. Lai, “Cross-attention is not enough: 2022 International Conference on Multimodal Interaction, 2022, pp.
Incongruity-aware multimodal sentiment analysis and emotion recog- 247–258.
nition,” CoRR, vol. abs/2305.13583, 2023. [243] S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and
[227] Y. Zhang, A. Jia, B. Wang, P. Zhang, D. Zhao, P. Li, Y. Hou, X. Jin, R. Mihalcea, “MELD: A multimodal multi-party dataset for emotion
D. Song, and J. Qin, “M3GAT: A multi-modal, multi-task interactive recognition in conversations,” in Proceedings of the 57th Conference
graph attention network for conversational sentiment analysis and of the Association for Computational Linguistics, ACL 2019, Florence,
emotion recognition,” ACM Trans. Inf. Syst., vol. 42, no. 1, pp. 13:1– Italy, July 28- August 2, 2019, Volume 1: Long Papers, A. Korhonen,
13:32, 2024. D. R. Traum, and L. Màrquez, Eds., pp. 527–536.
[244] C. Busso, M. Bulut, C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N.
[228] Y.-P. Ruan, S. Han, T. Li, and Y. Wu, “Fusing modality-specific
representations and decisions for multimodal emotion recognition,” Chang, S. Lee, and S. S. Narayanan, “IEMOCAP: interactive emotional
dyadic motion capture database,” Lang. Resour. Evaluation, vol. 42,
in ICASSP 2024-2024 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 7925–7929. no. 4, pp. 335–359, 2008.
[245] Z. Fang, A. He, Q. Yu, B. Gao, W. Ding, T. Zhang, and L. Ma, “FAF: A
[229] F. Zhao, C. Li, Z. Wu, Y. Ouyang, J. Zhang, and X. Dai, “M2DF: multi-
novel multimodal emotion recognition approach integrating face, body
grained multi-curriculum denoising framework for multimodal aspect-
and text,” CoRR, vol. abs/2211.15425, 2022.
based sentiment analysis,” in Proceedings of the 2023 Conference on
[246] Y. Wang and L. Guan, “Recognizing human emotional state from
Empirical Methods in Natural Language Processing, EMNLP 2023,
audiovisual signals,” IEEE Trans. Multim., vol. 10, no. 4, pp. 659–
Singapore, December 6-10, 2023, 2023, pp. 9057–9070.
668, 2008.
[230] Y. Zhang, M. Chen, J. Shen, and C. Wang, “Tailor versatile multi- [247] S. Zhalehpour, O. Onder, Z. Akhtar, and C. E. Erdem, “BAUM-1: A
modal learning for multi-label emotion recognition,” in Thirty-Sixth spontaneous audio-visual face database of affective and mental states,”
AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth IEEE Trans. Affect. Comput., vol. 8, no. 3, pp. 300–313, 2017.
Conference on Innovative Applications of Artificial Intelligence, IAAI [248] M. Soleymani, J. Lichtenauer, T. Pun, and M. Pantic, “A multimodal
2022, The Twelveth Symposium on Educational Advances in Artificial database for affect recognition and implicit tagging,” IEEE Trans.
Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022, Affect. Comput., vol. 3, no. 1, pp. 42–55, 2012.
2022, pp. 9100–9108. [249] S. Koelstra, C. Muhl, M. Soleymani, J.-S. Lee, A. Yazdani, T. Ebrahimi,
[231] D. Zhang, X. Ju, W. Zhang, J. Li, S. Li, Q. Zhu, and G. Zhou, “Multi- T. Pun, A. Nijholt, and I. Patras, “Deap: A database for emotion
modal multi-label emotion recognition with heterogeneous hierarchical analysis; using physiological signals,” IEEE transactions on affective
message passing,” in Thirty-Fifth AAAI Conference on Artificial Intelli- computing, vol. 3, no. 1, pp. 18–31, 2011.
gence, AAAI 2021, Thirty-Third Conference on Innovative Applications [250] L. Stappen, A. Baird, L. Schumann, and B. W. Schuller, “The multi-
of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on modal sentiment analysis in car reviews (muse-car) dataset: Collection,
Educational Advances in Artificial Intelligence, EAAI 2021, Virtual insights and improvements,” IEEE Trans. Affect. Comput., vol. 14,
Event, February 2-9, 2021, 2021, pp. 14 338–14 346. no. 2, pp. 1334–1350, 2023.
[232] J. Zhao, Y. Zhao, and J. Li, “M3TR: multi-modal multi-label recog- [251] Y. Li, J. Tao, B. Schuller, S. Shan, D. Jiang, and J. Jia, “Mec
nition with transformer,” in MM ’21: ACM Multimedia Conference, 2017: Multimodal emotion recognition challenge,” in 2018 First Asian
Virtual Event, China, October 20 - 24, 2021, 2021, pp. 469–477. Conference on Affective Computing and Intelligent Interaction (ACII
[233] Z. Zhang, Z. Wang, X. Li, N. Liu, B. Guo, and Z. Yu, “Modalnet: an Asia), 2018, pp. 1–5.
aspect-level sentiment classification model by exploring multimodal [252] C. Busso, S. Parthasarathy, A. Burmania, M. AbdelWahab,
data with fusion discriminant attentional network,” World Wide Web, N. Sadoughi, and E. M. Provost, “Msp-improv: An acted corpus of
vol. 24, pp. 1957–1974, 2021. dyadic interactions to study emotion perception,” IEEE Transactions
[234] J. Yang, Y. Xiao, and X. Du, “Multi-grained fusion network on Affective Computing, vol. 8, no. 1, pp. 67–80, 2017.
with self-distillation for aspect-based multimodal sentiment analysis,” [253] M. Firdaus, H. Chauhan, A. Ekbal, and P. Bhattacharyya, “Meisd:
Knowledge-Based Systems, vol. 293, p. 111724, 2024. A multimodal multi-label emotion, intensity and sentiment dialogue
[235] W. Li, L. Zhu, R. Mao, and E. Cambria, “Skier: A symbolic knowledge dataset for emotion recognition and sentiment analysis in conversa-
integrated model for conversational emotion recognition,” in Proceed- tions,” in Proceedings of the 28th international conference on compu-
ings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 11, tational linguistics, 2020, pp. 4441–4453.
2023, pp. 13 121–13 129. [254] A. Jia, Y. He, Y. Zhang, S. Uprety, D. Song, and C. Lioma, “Beyond
[236] J. Wen, D. Jiang, G. Tu, C. Liu, and E. Cambria, “Dynamic interactive emotion: A multi-modal dataset for human desire understanding,” in
multiview memory network for emotion recognition in conversation,” Proceedings of the 2022 Conference of the North American Chapter
Information Fusion, vol. 91, pp. 123–133, 2023. of the Association for Computational Linguistics: Human Language
[237] A. Zadeh, R. Zellers, E. Pincus, and L. Morency, “Multimodal senti- Technologies, NAACL 2022, Seattle, WA, United States, July 10-15,
ment intensity analysis in videos: Facial gestures and verbal messages,” 2022, 2022, pp. 1512–1522.
IEEE Intell. Syst., vol. 31, no. 6, pp. 82–88, 2016. [255] L. Stappen, A. Baird, L. Christ, L. Schumann, B. Sertolli, E. Meßner,
[238] A. Zadeh, P. P. Liang, S. Poria, E. Cambria, and L. Morency, E. Cambria, G. Zhao, and B. W. Schuller, “The muse 2021 multi-
“Multimodal language analysis in the wild: CMU-MOSEI dataset modal sentiment analysis challenge: Sentiment, emotion, physiological-
and interpretable dynamic fusion graph,” in Proceedings of the 56th emotion, and stress,” in MuSe ’21: Proceedings of the 2nd on Multi-
Annual Meeting of the Association for Computational Linguistics, ACL modal Sentiment Analysis Challenge, Virtual Event, China, 24 October
2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, 2021, 2021, pp. 5–14.
25

[256] J. Sun, S. Han, Y.-P. Ruan, X. Zhang, S.-K. Zheng, Y. Liu, Y. Huang, [277] Z. Tao, Y. Wang, J. Lin, H. Wang, X. Mai, J. Yu, X. Tong, Z. Zhou,
and T. Li, “Layer-wise fusion with modality independence modeling for S. Yan, Q. Zhao, L. Han, and W. Zhang, “A3 lign-dfer: Pioneering com-
multi-modal emotion recognition,” in Proceedings of the 61st Annual prehensive dynamic affective alignment for dynamic facial expression
Meeting of the Association for Computational Linguistics (Volume 1: recognition with clip,” 2024.
Long Papers), 2023, pp. 658–670. [278] H. Liu, R. An, Z. Zhang, B. Ma, W. Zhang, Y. Song, Y. Hu, W. Chen,
[257] J. A. M. Correa, M. K. Abadi, N. Sebe, and I. Patras, “AMIGOS: A and Y. Ding, “Norface: Improving facial expression analysis by identity
dataset for affect, personality and mood research on individuals and normalization,” arXiv preprint arXiv:2407.15617, 2024.
groups,” IEEE Trans. Affect. Comput., vol. 12, no. 2, pp. 479–493, [279] H. Huang, X. Qiao, Z. Chen, H. Chen, B. Li, Z. Sun, M. Chen, and
2021. X. Li, “Crest: Cross-modal resonance through evidential deep learning
[258] Q. Zhang, J. Fu, X. Liu, and X. Huang, “Adaptive co-attention network for enhanced zero-shot learning,” arXiv preprint arXiv:2404.09640,
for named entity recognition in tweets,” in Proceedings of the Thirty- 2024.
Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th [280] Z. Han, C. Zhang, H. Fu, and J. T. Zhou, “Trusted multi-view
innovative Applications of Artificial Intelligence (IAAI-18), and the 8th classification with dynamic evidential fusion,” IEEE transactions on
AAAI Symposium on Educational Advances in Artificial Intelligence pattern analysis and machine intelligence, vol. 45, no. 2, pp. 2551–
(EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, 2018, 2566, 2022.
pp. 5674–5681. [281] H. Huang, Z. Liu, S. Letchmunan, M. Lin, M. Deveci, W. Pedrycz,
[259] C. Xu, X. Luo, and D. Wang, “Mcpr: A chinese product review dataset and P. Siarry, “Evidential deep partial multi-view classification with
for multimodal aspect-based sentiment analysis,” in International Con- discount fusion,” arXiv preprint arXiv:2408.13123, 2024.
ference on Cognitive Computing. Springer, 2022, pp. 83–90. [282] H. Huang, C. Qin, Z. Liu, K. Ma, J. Chen, H. Fang, C. Ban, H. Sun, and
[260] H. Yang, Y. Zhao, J. Liu, Y. Wu, and B. Qin, “MACSA: A multi- Z. He, “Trusted unified feature-neighborhood dynamics for multi-view
modal aspect-category sentiment analysis dataset with multimodal fine- classification,” arXiv preprint arXiv:2409.00755, 2024.
grained aligned annotations,” CoRR, vol. abs/2206.13969, 2022. [283] J. Gao, M. Chen, and C. Xu, “Vectorized evidential learning for
[261] J. Zhou, J. Zhao, J. X. Huang, Q. V. Hu, and L. He, “Masad: A weakly-supervised temporal action localization,” IEEE Transactions on
large-scale dataset for multimodal aspect-based sentiment analysis,” Pattern Analysis and Machine Intelligence, vol. 45, no. 12, pp. 15 949–
Neurocomputing, vol. 455, pp. 47–58, 2021. 15 963, 2023.
[262] M. Luo, H. Fei, B. Li, S. Wu, Q. Liu, S. Poria, E. Cambria, M.-L. [284] K. Ma, H. Huang, J. Chen, H. Chen, P. Ji, X. Zang, H. Fang,
Lee, and W. Hsu, “Panosent: A panoptic sextuple extraction benchmark C. Ban, H. Sun, M. Chen et al., “Beyond uncertainty: Evidential
for multimodal conversational aspect-based sentiment analysis,” arXiv deep learning for robust video temporal grounding,” arXiv preprint
preprint arXiv:2408.09481, 2024. arXiv:2408.16272, 2024.
[263] D. Borth, R. Ji, T. Chen, T. Breuel, and S.-F. Chang, “Large-scale [285] J. Gao, M. Chen, and C. Xu, “Vectorized evidential learning for
visual sentiment ontology and detectors using adjective noun pairs,” in weakly-supervised temporal action localization,” IEEE Transactions on
Proceedings of the 21st ACM international conference on Multimedia, Pattern Analysis and Machine Intelligence, vol. 45, no. 12, pp. 15 949–
2013, pp. 223–232. 15 963, 2023.
[264] S. Li and W. Deng, “Deep facial expression recognition: A survey,” [286] Z. Aldeneh and E. M. Provost, “Using regional saliency for speech
IEEE transactions on affective computing, vol. 13, no. 3, pp. 1195– emotion recognition,” in 2017 IEEE International Conference on
1215, 2020. Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans,
[265] Y. Zhang, C. Wang, and W. Deng, “Relative uncertainty learning LA, USA, March 5-9, 2017. IEEE, 2017, pp. 2741–2745.
for facial expression recognition,” Advances in Neural Information [287] P. Li, Y. Song, I. McLoughlin, W. Guo, and L. Dai, “An attention
Processing Systems, vol. 34, pp. 17 616–17 627, 2021. pooling based representation learning method for speech emotion
recognition,” in 19th Annual Conference of the International Speech
[266] J. Li, K. Jin, D. Zhou, N. Kubota, and Z. Ju, “Attention mechanism-
Communication Association, Interspeech 2018, Hyderabad, India,
based cnn for facial expression recognition,” Neurocomputing, vol. 411,
September 2-6, 2018, B. Yegnanarayana, Ed. ISCA, 2018, pp. 3087–
pp. 340–350, 2020.
3091.
[267] A. H. Farzaneh and X. Qi, “Facial expression recognition in the wild
[288] G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou,
via deep attentive center loss,” in Proceedings of the IEEE/CVF winter
B. W. Schuller, and S. Zafeiriou, “Adieu features? end-to-end speech
conference on applications of computer vision, 2021, pp. 2402–2411.
emotion recognition using a deep convolutional recurrent network,”
[268] H. Wang, B. Li, S. Wu, S. Shen, F. Liu, S. Ding, and A. Zhou, in 2016 IEEE International Conference on Acoustics, Speech and
“Rethinking the learning paradigm for dynamic facial expression Signal Processing, ICASSP 2016, Shanghai, China, March 20-25, 2016.
recognition,” in Proceedings of the IEEE/CVF conference on computer IEEE, 2016, pp. 5200–5204.
vision and pattern recognition, 2023, pp. 17 958–17 968. [289] S. Mirsamadi, E. Barsoum, and C. Zhang, “Automatic speech emotion
[269] Z. Zhao and Q. Liu, “Former-dfer: Dynamic facial expression recog- recognition using recurrent neural networks with local attention,” in
nition transformer,” in Proceedings of the 29th ACM International 2017 IEEE International Conference on Acoustics, Speech and Signal
Conference on Multimedia, 2021, pp. 1553–1561. Processing, ICASSP 2017, New Orleans, LA, USA, March 5-9, 2017.
[270] H. Li, M. Sui, Z. Zhu et al., “Nr-dfernet: Noise-robust network for dy- IEEE, 2017, pp. 2227–2231.
namic facial expression recognition,” arXiv preprint arXiv:2206.04975, [290] Z. Zhao, Y. Zheng, Z. Zhang, H. Wang, Y. Zhao, and C. Li, “Ex-
2022. ploring spatio-temporal representations by integrating attention-based
[271] Y. Wang, Y. Sun, W. Song, S. Gao, Y. Huang, Z. Chen, W. Ge, and bidirectional-lstm-rnns and fcns for speech emotion recognition,” in
W. Zhang, “Dpcnet: Dual path multi-excitation collaborative network 19th Annual Conference of the International Speech Communication
for facial expression representation learning in videos,” in Proceedings Association, Interspeech 2018, Hyderabad, India, September 2-6, 2018,
of the 30th ACM International Conference on Multimedia, 2022, pp. B. Yegnanarayana, Ed. ISCA, 2018, pp. 272–276.
101–110. [291] D. Luo, Y. Zou, and D. Huang, “Investigation on joint representation
[272] Y. Liu, W. Wang, C. Feng, H. Zhang, Z. Chen, and Y. Zhan, “Ex- learning for robust feature extraction in speech emotion recognition,”
pression snippet transformer for robust video-based facial expression in 19th Annual Conference of the International Speech Communication
recognition,” 2021. Association, Interspeech 2018, Hyderabad, India, September 2-6, 2018,
[273] F. Ma, B. Sun, and S. Li, “Logo-former: Local-global spatio-temporal B. Yegnanarayana, Ed. ISCA, 2018, pp. 152–156.
transformer for dynamic facial expression recognition,” in ICASSP [292] M. Jiménez-Guarneros and G. F. Pineda, “Cross-subject eeg-based
2023-2023 IEEE International Conference on Acoustics, Speech and emotion recognition via semisupervised multisource joint distribution
Signal Processing (ICASSP). IEEE, 2023, pp. 1–5. adaptation,” IEEE Trans. Instrum. Meas., vol. 72, pp. 1–11, 2023.
[274] H. Li, H. Niu, Z. Zhu, and F. Zhao, “Intensity-aware loss for dynamic [293] Y. Peng, Y. Zhang, W. Kong, F. Nie, B. Lu, and A. Cichocki,
facial expression recognition in the wild,” 2022. “S3 lrr: A unified model for joint discriminative subspace identification
[275] H. Chen, H. Huang, J. Dong, M. Zheng, and D. Shao, “Finecliper: and semisupervised EEG emotion recognition,” IEEE Trans. Instrum.
Multi-modal fine-grained clip for dynamic facial expression recognition Meas., vol. 71, pp. 1–13, 2022.
with adapters,” arXiv preprint arXiv:2407.02157, 2024. [294] Y. Peng, W. Kong, F. Qin, F. Nie, J. Fang, B. Lu, and A. Cichocki,
[276] Z. Zhao and I. Patras, “Prompting visual-language models for dynamic “Self-weighted semi-supervised classification for joint eeg-based emo-
facial expression recognition,” in British Machine Vision Conference tion recognition and affective activation patterns mining,” IEEE Trans.
(BMVC), 2023, pp. 1–14. Instrum. Meas., vol. 70, pp. 1–11, 2021.
26

[295] X. Quan, Z. Zeng, J. Jiang, Y. Zhang, B. Lu, and D. Wu, “Physio- Association for Computational Linguistics, ACL 2019, Florence, Italy,
logical signals based affective computing: A systematic review,” Acta July 28- August 2, 2019, Volume 1: Long Papers, 2019, pp. 4762–4779.
Automatica Sinica, vol. 47, no. 8, pp. 1769–1784, 2021. [313] M. Sap, R. L. Bras, E. Allaway, C. Bhagavatula, N. Lourie, H. Rashkin,
[296] V. Chaparro, A. Gomez, A. Salgado, O. L. Quintero, N. López, and B. Roof, N. A. Smith, and Y. Choi, “ATOMIC: an atlas of machine
L. F. Villa, “Emotion recognition from EEG and facial expressions: commonsense for if-then reasoning,” in The Thirty-Third AAAI Confer-
a multimodal approach,” in 40th Annual International Conference of ence on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative
the IEEE Engineering in Medicine and Biology Society, EMBC 2018, Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth
Honolulu, HI, USA, July 18-21, 2018. IEEE, 2018, pp. 530–533. AAAI Symposium on Educational Advances in Artificial Intelligence,
[297] Y. Huang, J. Yang, P. Liao, and J. Pan, “Fusion of facial expressions and EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019,
EEG for multimodal emotion recognition,” Comput. Intell. Neurosci., 2019, pp. 3027–3035.
vol. 2017, pp. 2 107 451:1–2 107 451:8, 2017. [314] D. Li, Y. Li, J. Zhang, K. Li, C. Wei, J. Cui, and B. Wang, “C3KG:
[298] Q. Zhu, G. Lu, and J. Yan, “Valence-arousal model based emotion A chinese commonsense conversation knowledge graph,” CoRR, vol.
recognition using eeg, peripheral physiological signals and facial abs/2204.02549, 2022.
expression,” in ICMLSC 2020: The 4th International Conference on [315] Z. Yang, D. Yang, C. Dyer, X. He, A. J. Smola, and E. H. Hovy, “Hi-
Machine Learning and Soft Computing, Haiphong City, Viet Nam, erarchical attention networks for document classification,” in NAACL
January 17-19, 2020. ACM, 2020, pp. 81–85. HLT 2016, The 2016 Conference of the North American Chapter
[299] H. Tang, W. Liu, W. Zheng, and B. Lu, “Multimodal emotion recog- of the Association for Computational Linguistics: Human Language
nition using deep neural networks,” in Neural Information Processing Technologies, San Diego California, USA, June 12-17, 2016, K. Knight,
- 24th International Conference, ICONIP 2017, Guangzhou, China, A. Nenkova, and O. Rambow, Eds. The Association for Computational
November 14-18, 2017, Proceedings, Part IV, ser. Lecture Notes in Linguistics, 2016, pp. 1480–1489.
Computer Science, D. Liu, S. Xie, Y. Li, D. Zhao, and E. M. El-Alfy, [316] J. Zhou, C. Ma, D. Long, G. Xu, N. Ding, H. Zhang, P. Xie, and
Eds., vol. 10637. Springer, 2017, pp. 811–819. G. Liu, “Hierarchy-aware global model for hierarchical text classifica-
[300] Y. Tan, Z. Sun, F. Duan, J. Solé-Casals, and C. F. Caiafa, “A mul- tion,” in Proceedings of the 58th Annual Meeting of the Association
timodal emotion recognition method based on facial expressions and for Computational Linguistics, ACL 2020, Online, July 5-10, 2020,
electroencephalography,” Biomed. Signal Process. Control., vol. 70, p. D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault, Eds. Association
103029, 2021. for Computational Linguistics, 2020, pp. 1106–1117.
[301] J. Vazquez-Rodriguez, G. Lefebvre, J. Cumin, and J. L. Crowley, [317] G. Hu, G. Lu, and Y. Zhao, “FSS-GCN: A graph convolutional
“Emotion recognition with pre-trained transformers using multimodal networks with fusion of semantic and structure for emotion cause
signals,” in 10th International Conference on Affective Computing and analysis,” Knowl. Based Syst., vol. 212, p. 106584, 2021.
Intelligent Interaction, ACII 2022, Nara, Japan, October 18-21, 2022, [318] D. Ghosal, N. Majumder, S. Poria, N. Chhaya, and A. F. Gelbukh,
2022, pp. 1–8. “Dialoguegcn: A graph convolutional neural network for emotion
[302] G. Hu, G. Lu, and Y. Zhao, “Bidirectional hierarchical attention recognition in conversation,” in Proceedings of the 2019 Conference on
networks based on document-level context for emotion cause extrac- Empirical Methods in Natural Language Processing and the 9th Inter-
tion,” in Findings of the Association for Computational Linguistics: national Joint Conference on Natural Language Processing, EMNLP-
EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16- IJCNLP 2019, Hong Kong, China, November 3-7, 2019, K. Inui,
20 November, 2021, 2021, pp. 558–568. J. Jiang, V. Ng, and X. Wan, Eds. Association for Computational
Linguistics, 2019, pp. 154–164.
[303] M. Li, H. Zhao, T. Gu, and D. Ying, “Experiencer-driven and
[319] M. Munezero, C. S. Montero, E. Sutinen, and J. Pajunen, “Are they
knowledge-aware graph model for emotion-cause pair extraction,”
different? affect, feeling, emotion, sentiment, and opinion detection in
Knowl. Based Syst., vol. 278, p. 110703, 2023.
text,” IEEE Trans. Affect. Comput., vol. 5, no. 2, pp. 101–111, 2014.
[304] W. Li, Y. Li, V. Pandelea, M. Ge, L. Zhu, and E. Cambria, “ECPEC:
[320] K. Cheng, Z. Yang, M. Zhang, and Y. Sun, “Uniker: A unified
Emotion-cause pair extraction in conversations,” IEEE Transactions on
framework for combining embedding and definite horn rule reasoning
Affective Computing, vol. 14, no. 3, pp. 1754–1765, 2023.
for knowledge graph inference,” in Proceedings of the 2021 Conference
[305] B. Li, H. Fei, F. Li, T.-s. Chua, and D. Ji, “Multimodal emotion-cause on Empirical Methods in Natural Language Processing, EMNLP 2021,
pair extraction with holistic interaction and label constraint,” ACM Virtual Event / Punta Cana, Dominican Republic, 7-11 November,
Trans. Multimedia Comput. Commun. Appl., aug 2024, just Accepted. 2021, M. Moens, X. Huang, L. Specia, and S. W. Yih, Eds., pp. 9753–
[Online]. Available: https://doi.org/10.1145/3689646 9771.
[306] J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image [321] W. Wang, H. Bao, L. Dong, J. Bjorck, Z. Peng, Q. Liu, K. Aggarwal,
pre-training for unified vision-language understanding and generation,” O. K. Mohammed, S. Singhal, S. Som, and F. Wei, “Image as a foreign
in International Conference on Machine Learning. PMLR, 2022, pp. language: Beit pretraining for all vision and vision-language tasks,”
12 888–12 900. CoRR, vol. abs/2208.10442, 2022.
[307] Y. Zeng, S. Mai, and H. Hu, “Which is making the contribution: Mod- [322] D. Hershcovich, S. Frank, H. Lent, M. de Lhoneux, M. Abdou,
ulating unimodal and cross-modal dynamics for multimodal sentiment S. Brandl, E. Bugliarello, L. Cabello Piqueras, I. Chalkidis, R. Cui,
analysis,” in Findings of the Association for Computational Linguistics: C. Fierro, K. Margatina, P. Rust, and A. Søgaard, “Challenges and
EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 strategies in cross-cultural NLP,” in Proceedings of the 60th Annual
November, 2021, 2021, pp. 1262–1274. Meeting of the Association for Computational Linguistics (Volume 1:
[308] C. Fan, H. Yan, J. Du, L. Gui, L. Bing, M. Yang, R. Xu, and Long Papers), S. Muresan, P. Nakov, and A. Villavicencio, Eds., May
R. Mao, “A knowledge regularized hierarchical approach for emotion 2022.
cause analysis,” in Proceedings of the 2019 Conference on Empirical [323] S. Hareli, K. Kafetsios, and U. Hess, “A cross-cultural study on emotion
Methods in Natural Language Processing and the 9th International expression and the learning of social norms,” Frontiers in psychology,
Joint Conference on Natural Language Processing, EMNLP-IJCNLP vol. 6, p. 1501, 2015.
2019, Hong Kong, China, November 3-7, 2019, 2019, pp. 5613–5623. [324] L. Zhu, R. Mao, E. Cambria, and B. J. Jansen, “Neurosymbolic ai for
[309] R. Speer, J. Chin, and C. Havasi, “Conceptnet 5.5: An open multilin- personalized sentiment analysis,” in Proceedings of HCII, 2024.
gual graph of general knowledge,” in Proceedings of the Thirty-First [325] M. Obrist, S. A. Seah, and S. Subramanian, “Talking about tactile
AAAI Conference on Artificial Intelligence, February 4-9, 2017, San experiences,” in Proceedings of Human Factors in Computing Systems,
Francisco, California, USA, 2017, pp. 4444–4451. 2013, pp. 1659–1668.
[310] E. Cambria, X. Zhang, R. Mao, M. Chen, and K. Kwok, “SenticNet 8:
Fusing emotion AI and commonsense AI for interpretable, trustworthy,
and explainable affective computing,” in International Conference on
Human-Computer Interaction (HCII), 2024.
[311] H. Zhang, D. Khashabi, Y. Song, and D. Roth, “Transomcs: From
linguistic graphs to commonsense knowledge,” in Proceedings of the
Twenty-Ninth International Joint Conference on Artificial Intelligence,
IJCAI 2020, 2020, pp. 4004–4010.
[312] A. Bosselut, H. Rashkin, M. Sap, C. Malaviya, A. Celikyilmaz, and
Y. Choi, “COMET: commonsense transformers for automatic knowl-
edge graph construction,” in Proceedings of the 57th Conference of the

You might also like