Recent Trends of Multimodal Affective Computing: A Survey From NLP Perspective
Recent Trends of Multimodal Affective Computing: A Survey From NLP Perspective
8, AUGUST 2021 1
Abstract—Multimodal affective computing (MAC) has gar- of affective computing predominantly focused on unimodal
nered increasing attention due to its broad applications in tasks, examining text-based, audio-based, and vision-based
analyzing human behaviors and intentions, especially in text- affective computing separately. For instance, D-MILN [9] is a
arXiv:2409.07388v1 [cs.CL] 11 Sep 2024
of multimodal affective computing leverage these parameter- affective computing tasks, the incorporation of external knowl-
efficient transfer learning methods to transfer knowledge edge, and affective computing with less-studied modalities.
from pre-trained models (e.g., unimodal pre-trained model Lastly, Section XI concludes this survey and its contribution
or multimodal pre-trained model) to downstream affective to multimodal affective computing community.
tasks to improve model performance by further fine-tuning
the pre-trained model. For instance, Zou et al. [36] design III. M ULTIMODAL A FFECTIVE C OMPUTING TASKS
a multimodal prompt Transformer (MPT) to perform cross- In this section, we show the definition of each task and
modal information fusion. UniMSE [37] proposes an adapter- discuss their application scenarios. Table I presents basic
based modal fusion method, which injects acoustic and visual information, including task input, output, type, and parent task,
signals into the T5 model to fuse them with multi-level textual for each of the four tasks.
information.
Multimodal affective computing encompasses tasks like A. Multimodal Sentiment Analysis
sentiment analysis, opinion mining, and emotion recognition
Multimodal sentiment analysis (MSA) [41] origins from
using modalities such as text, audio, images, video, physiolog-
sentiment analysis (SA) task [42] and it extends SA with the
ical signals, and haptic feedback. This survey focuses mainly
multimodal input. As a key research topics for computers to
on three key modalities: natural language, visual signals, and
understand human behaviors, the goal of multimodal sentiment
vocal signals. We highlight four main tasks in this survey:
analysis (MSA) is to predict sentiment polarity and sentiment
Multimodal Sentiment Analysis (MSA), Multimodal Emotion
intensity based on multimodal signals [43]. This task belongs
Recognition in Conversation (MERC), Multimodal Aspect-
to binary classification and regression task.
Based Sentiment Analysis (MABSA), and Multimodal Multi-
1) Task Formalization: Given a multimodal signal Ii =
label Emotion Recognition (MMER). A considerable volume
{Iit , Iia , Iiv }, we use Iim , m ∈ {t, a, v} to represent unimodal
of studies exists in the field of multimodal affective computing,
raw sequence drawn from the video fragment i, where {t, a, v}
and several reviews have been published [14], [38]–[40].
denote the three types of modalities—text, acoustic and visual.
However, these reviews primarily focus on specific affective
Multimodal sentiment analysis aims to predict the real number
computing tasks or specific single modality and overlook an
yir ∈ R, where yir ∈ [−3, 3] reflects the sentiment strength. We
overview of multimodal affective computing across multiple
feed Ii as the model input and train a model to predict yir .
tasks, and the consistencies and differences among these tasks.
2) Application Scenarios: We categorize multimodal senti-
The goal of this survey is twofold. First, this survey aims
ment analysis applications into key areas: social media mon-
to provide a comprehensive overview of multimodal affective
itoring, customer feedback, market research, content creation,
computing for beginners exploring deep learning in emotion
healthcare, and product reviews. For example, analyzing senti-
analysis, detailing tasks, inputs, outputs, and relevant datasets.
ment in text, images, and videos on social media helps gauge
Second, it also offers insights for researchers to reflect on past
public opinion and monitor brand perception, while analyzing
developments, explore future trends, and examine technical
multimedia product reviews can improve personalized recom-
approaches, challenges, and research directions in areas such
mendations and user satisfaction.
as multimodal sentiment analysis and emotion recognition.
TABLE I
T HE DETAILS OF MULTIMODAL AFFECTIVE TASKS . T,A,V DENOTE TEXT, AUDIO AND VISUAL MODALITIES RESPECTIVELY.
Gong et al. [61] propose audio spectrogram Transformer in Vision Language Models (VLMs) like GPT-4V [65] and
(AST), which converted waveform into a sequence of 128- Flamingo [67] allows the models to interpret and generate
dimensional log Mel filterbank (fbank) features to encode outputs based on combined visual and textual inputs. In con-
audio modality. trast with prompt, instruction-tuning belongs to the learning
c) Vision Feature Extractor: For image modality, re- paradigm of prompt. Also, models like InstructBLIP [70]
searchers can extract fixed T frames from each segment and FLAN [72] have demonstrated that instruction-tuning not
and use effecientNet [62] pre-trained (supervised) on VG- only improves the model’s adherence to instructions but also
Gface3 and AFEW dataset as vision initial representation. enhances its ability to generalize across tasks. In the commu-
Furthermore, Dosovitskiy et al. [63] propose to use standard nity of multimodal affective computing, researchers can lever-
Transformer directly to images, which split an image into age these parameter-efficient transfer learning methods (e.g.,
patches and provide the sequence of linear embeddings of adapter, prompt and instruction tuning) to transfer knowledge
these patches as an input to a Transformer. CLIP [64] jointly from pre-trained models (e.g., unimodal pre-trained model or
trained image and its caption with the contrastive learning, multimodal pre-trained model) to downstream affective tasks,
thereby extraction vision features that correspond to texts. further tune the pre-trained model with the affective dataset.
d) Multimodal Feature Extractor: The emergence of Considering that multimodal affective computing involves
multimodal pre-trained model (MPM) marks a significant ad- multimodal learning, therefore, we analyze multimodal affec-
vancement in integrating multimodal signals, as demonstrated tive computing works from multimodal fusion and multimodal
by groundbreaking developments like GPT-4 [65] and Gem- alignment, as shown in Fig. 1.
ini [66]. Among the open-source innovations, Flamingo [67]
represents an early effort to integrate visual features with
B. Multimodal Fusion
LLMs using cross-attention layers. BLIP-2 [68] introduces a
trainable adaptor module (Q-Former) that efficiently connects Multimodal signals are heterogeneous and derived from
a pre-trained image encoder with a pre-trained LLM, ensuring various information sources, making integrating multimodal
precise alignment of visual and textual information. Similarly, signals into one representation essential. Tasi et al. [74]
MiniGPT-4 [69] achieves visual and textual alignment through summarize multimodal fusion into early, late or intermediate
a linear projection layer. InstructBLIP [70] advances the field fusion based on the fusion stage. Early fusion combines
by focusing on vision-language instruction tuning, building features from different modalities at the input level before
upon BLIP-2, and requiring a deeper understanding and larger the model processes them. Late fusion processes features
datasets for effective training. LLaVA [71] integrates CLIP’s from different modalities separately through individual sub-
image encoder with LLaMA’s language decoder to enhance networks, and the outputs of these sub-networks are combined
instruction tuning capabilities. Akbari et al. [30] train VATT at a later stage, typically just before making the final decision.
end-to-end from scratch using multimodal contrastive losses Late fusion uses unimodal decision values and combines them
and evaluate its performance by the downstream tasks of using mechanisms such as averaging [121], voting schemes
video action recognition, audio event classification, image [122], weighting based on channel noise [123] and signal
classification, and text-to-video retrieval. Based on multimodal variance [124], or a learned model [6], [125]. The two fusion
pre-trained model, raw modal signals can be used to extract strategies face some problems. For example, early fusion at
modal features. the feature level can underrate intra-modal dynamics after
the fusion operation, while late fusion at the decision level
V. M ULTIMODAL L EARNING ON M ULTIMODAL A FFECTIVE may struggle to capture inter-modal dynamics before the
C OMPUTING fusion operation. Different from the previous two methods by
combining features from different modalities at intermediate
Multimodal learning involves learning representations from layers of the model learner, Intermediate fusion allows for
different modalities. Generally, the multimodal model should more interaction between the modalities at different processing
first align the modalities based on their semantics before fusing stages, potentially leading to richer representations [37], [126],
multimodal signals. After alignment, the model combines [127]. Based on these fusion strategies, we review multimodal
multiple modalities into one representation vector. fusion from three aspects: cross-modality learning, modal
consistency and difference, and multi-stage modal fusion. Fig.
A. Preliminary 2 illustrates the three aspects of modal fusion.
With the scaling of the pre-trained model, parameter- 1) Cross-modality Learning: Cross-modality learning fo-
efficient transfer learning emerges such as adapter [31], cuses on the incorporation of inter-modality dependencies and
prompt [32], instruction-tuning [33] and in-context learn- interactions for better modal fusion in representation learn-
ing [34], [35]. In this paradigm, instead of adapting pre- ing. Early works of multimodal fusion [73] mainly operate
trained LMs to downstream tasks via objective engineering, geometric manipulation in the feature spaces to fuse multiple
downstream tasks are reformulated to look more like those modalities. The recent common way of cross-modality learn-
solved during the original LM training with the help of prompt, ing is to introduce attention-based learning method to model
instruction-tuning and in-context learning. The use of prompts inter-modality and intra-modality interactions. For example,
MuLT [74] proposes multimodal Transformer to learn inter-
3 https://www.robots.ox.ac.uk/ vgg/software/vgg face/. modal interaction. Chen et al. [75] augment the inter-intra
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 5
TFN [73], MuLT [74], TCDN [75], CM-BERT [76], HGraph-CL [77], BAFN [78], TeFNA [79],
Cross-modal Learning
CMCF-SRNet [80], MultiEMO [81], MM-RBN [82], MAGDRA [83], AMuSE [84].
Multimodal Learning on
Multimodal Modal Consistency MMIM [85], MPT [86], MMMIE [87], MISA [88], CoolNet [89], ModalNet [90], MAN [84], TAILOR [91],
Fusion (§V-B) and Difference AMP [92], STCN [93].
Affective Computing
Multi-stage TSCL-FHFN [94], HFFN [95], CLMLF [96], RMFN [97], CTFN [98], MCM [99], FmlMSN [100],
Modal Fusion ScaleVLAD [101], MUG [102], HFCE [103], MTAG [104], CHFusion [105].
MMIN [106], CMAL [107], M2R2 [108], EMMR [109], TFR-Net [110], MRAN [111], VIGAN [112],
Miss Modality
TATE [113], IF-MMIN [114], CTFN [98], MTMSA [115], FGR [116], MMTE+AMMTD [117].
Multimodal
Alignment (§V-C)
Semantic Alignment MuLT [74], ScaleVLAD [101], Robust-MSA [118], HGraph-CL [77], SPIM [119], MA-CMU-SGRNet [120].
Fig. 1. Taxonomy of multimodal affective computing from multimodal fusion and multimodal alignment.
image encoder
vision audio
vision
audio encoder vision
text
audio audio
..felt a bit text encoder
common
frustrated text modal consistency
text cross-modality and difference multi-stage fusion
Fig. 2. Illustration of multimodal fusion from following aspects: 1) cross-modality modal fusion, 2) modal fusion based on modal consistency and difference
and 3) multi-stage modal fusion.
modal features with trimodal collaborative interaction and ponents. Modal consistency helps handle missing modalities,
unifies the characteristics of the three modals (inter-modal). while modal difference leverages complementary informa-
Yang et al. [76] propose the cross-modal BERT (CM-BERT), tion from each modality to improve overall data understand-
aiming to model the interaction of text and audio modality ing. For example, several works [86], [87] have explored
based on pre-trained BERT model. Lin et al. [77] explore the learning modal consistency and difference using contrastive
intricate relations of intra- and inter-modal representations for learning. Han et al. [85] maximized the mutual information
sentiment extraction. More recently, Tang et al. [78] propose between modalities and between each modality to explore the
the multimodal dynamic enhanced block to capture the intra- modal consistency. Another study [86] proposes a hybrid con-
modality sentiment context, which decrease the intra-modality trastive learning framework that performs intra-/inter-modal
redundancy of auxiliary modalities. Huang et al. [79] propose contrastive learning and semi-contrastive learning simultane-
a Text-centered fusion network with cross-modal attention ously, models cross-modal interactions, preserves inter-class
(TeFNA), a multimodal fusion network that uses crossmodal relationships, and reduces the modality gap. Additionally,
attention to model unaligned multimodal timing information. Zheng et al. [87] combined mutual information maximization
In the community of emotion recognition, CMCF-SRNet [80] between modal pairs with mutual information minimization
is a cross-modality context fusion and semantic refinement net- between input data and corresponding features. This method
work, which contains a cross-modal locality-constrained trans- aims to extract modal-invariant and task-related information.
former and a graph-based semantic refinement transformer, Modal consistency can also be viewed as the process of
aiming to explore the multimodal interaction and dependencies projecting multiple modalities into a common latent space
among utterances. Shi et al. [81] propose an attention-based (modality-invariant representation), while modal difference
correlation-aware multimodal fusion framework MultiEMO, refers to projecting modalities into modality-specific repre-
which captures cross-modal mapping relationships across sentation spaces. For example, Hazarika et al. [88] propose
textual, audio and visual modalities based on bidirectional a method that projects each modality into both a modality-
multi-head cross attention layers. In summary, cross-modality invariant and a modality-specific space. They implemented
learning mainly focuses on modeling the relation between a decoder to reconstruct the original modal representation
modalities. using both modality-invariant and modality-specific features.
2) Modal Consistency and Difference: Modal consistency AMuSE [84] proposes a multimodal attention network to
refers to the shared feature space across different modali- capture cross-modal interactions at various levels of spatial
ties for the same sample, while modal difference highlights abstraction by jointly learning its interactive bunch of mode-
the unique information each modality provides. Most multi- specific peripheral and central networks. For the fine-grain
modal fusion approaches separate representations into modal- sentiment analysis, Xiao et al. [89] present CoolNet to boost
invariant (consistency) and modal-specific (difference) com- the performance of visual-language models in seamlessly
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 6
C. Multimodal Alignment
(a)
Multimodal alignment involves synchronizing modal se-
mantics before fusing multimodal data. A key challenge is
handling missing modalities, which can occur due to issues
like a camera being turned off, a user being silent, or device
mask errors affecting both voice and text. Since the assumption of
always having all modalities is often unrealistic, multimodal
alignment must address these gaps. Additionally, it involves
aligning objects across images, text, and audio through se-
mask mantic alignment. Thus, we discuss multimodal alignment in
terms of managing missing modalities and achieving semantic
rustrated ..felt a bit frustrated alignment. Fig. 3 illustrates the multimodal alignment.
1) Alignment for Missing Modality: In real-world scenar-
ios, data collection can sometimes result in the simultaneous
(b)
loss of certain modalities due to unforeseen events. While mul-
Fig. 3. Illustration multimodal alignment:(a) semantic alignment and (b) timodal affective computing typically assumes the availability
alignment with missing modal fragments.
of all modalities, this assumption often fails in practice, which
can cause issues in modal fusion and alignment models when
some modalities are missing. We classify existing methods for
integrating vision and language information. Zhang et al. [90] handling missing modalities into four groups.
propose an aspect-level sentiment classification model by The first group features the data augmentation approach,
exploring modal consistency with fusion discriminant attention which randomly ablates the inputs to mimic missing modality
network. cases. Parthasarathy et al. [107] propose a strategy to randomly
3) Multi-stage Modal Fusion: Multi-stage multimodal fu- ablate visual inputs during training at the clip or frame level
sion [128], [129] refers to combine modal information ex- to mimic real world scenarios. Wang et al. [108] deal with
tracted from multiple stages or multiple scales to fuse modal the utterance-level modalities missing problem by training
representation. Li et al. [94] design a two-stage contrastive emotion recognition model with iterative data augmentation by
learning task, which learns similar features for data with the learned common representation. The second group is based on
same emotion category and learns distinguishable features for generative methods to directly predict the missing modalities
data with different emotion categories. HFFN [95] divides given the available modalities [131]. For example, Zhao et
the procession of multimodal fusion into divide, conquer and al. [106] propose a missing modality imagination network
combine, which learns local interactions at each local chunk (MMIN), which can predict the representation of any missing
and explores global interactions by conveying information modality given available modalities under different missing
across local interactions. Different from the work of HFFN, modality conditions, so as to to deal with the uncertain missing
Li et al. [96] align and fused the token-level features of text modality problem. Zeng et al. [109] propose an ensemble-
and image and designed label based contrastive learning and based missing modality reconstruction (EMMR) network to
data based contrastive learning to capture common features detect and recover semantic features of the key missing
related to sentiment in multimodal data. There are some modality. Yuan et al. [110] propose a transformer-based fea-
work [97] decomposed the fusion procession into multiple ture reconstruction network (TFR-Net), which improves the
stages, each of them focused on a subset of multimodal signals robustness of models for the random missing in non-aligned
for specialized, effective fusion. Also, CTFN [130] presents a modality sequences. Luo et al. [111] propose the multimodal
novel feature fusion strategy that proceeds in a hierarchical reconstruction and align net (MRAN) to tackle the missing
fashion, first fusing the modalities two in two and only then modality problem, especially to relieve the decline caused by
fusing all three modalities. Moreover, the modal fusion at the text modality’s absence.
multiple levels has made progress, such as Li et al. [99] The third group aims to learn the joint multimodal rep-
propose a multimodal sentiment analysis method based on resentations that can contain related information from these
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 7
modalities [132]. For example, Ma et al. [133] propose a sentations for the final extraction to align text and image.
unified deep learning framework to efficiently handle missing Lai et al. [119] propose a deep modal shared information
labels and missing modalities for audio-visual emotion recog- learning module based on the covariance matrix to capture the
nition through correlation analysis. Zeng et al. [113] propose shared information between modalities. Additionally, we use
a tag-assisted Transformer encoder (TATE) network to handle a label generation module based on a self- supervised learning
the problem of missing uncertain modalities, which designs strategy to capture the private information of the modalities.
a tag encoding module to cover both the single modality and Our module is plug-and-play in multimodal tasks, and by
multiple modalities missing cases, so as to guide the network’s changing the parameterization, it can adjust the information
attention to those missing modalities. Zuo et al. [114] propose exchange relationship between the modes and learn the private
to use invariant features for a missing modality imagination or shared information between the specified modes. We also
network (IF-MMIN), which includes an invariant feature learn- employ a multi-task learning strategy to help the model focus
ing strategy and an invariant feature based imagination module its attention on the modal differentiation training data. For
(IF-IM). Through the two strategies, IF-MMIN can alleviate model robustness, Robust-MSA [118] present an interactive
the modality gap during the missing modalities prediction, thus platform that visualizes the impact of modality noise to help
improving the robustness of multimodal joint representation. researchers improve model capacity.
Zhou et al. [116] propose a novel brain tumor segmentation
network in the case of missing one or more modalities. The VI. M ODELS ACROSS M ULTIMODAL A FFECTIVE
proposed network consists of three sub-networks: a feature- C OMPUTING
enhanced generator, a correlation constraint block and a seg- In the community of multimodal affective computing, the
mentation network. The last group is translation-base methods. works appear to significant consistency in term of development
Tang et al. [98] propose the coupled-translation fusion network technical route. For clarity, we group the these works based
(CTFN) to model bi-direction interplay via couple learning, on multitask learning, pre-trained model, enhanced knowledge,
ensuring the robustness in respect to missing modalities. Liu contextual information. Meanwhile, we briefly summarized the
et al. [115] propose a modality translation-based MSA model advancements of MSA, MERC, MABSA and MMER tasks
(MTMSA), which is robust to uncertain missing modalities. In through the above four aspects. Fig. 4 summarizes the typical
summary, the works about alignment for miss modality focus works of multimodal affective computing from these aspects
on miss modality reconstruction and learning based on the and Table II shows the taxonomy of multimodal affective
available modal information. computing.
2) Alignment for Cross-modal Semantics: Semantic align-
ment aims to find the connection between multiple modalities
in one sample, which refers to searching one modal informa- A. Multitask Learning
tion through another modal information and vice versa. In the Multitask learning trains a model on multiple related tasks
filed of MSA, Tsai et al. [74] leverage cross-modality and simultaneously, using shared information to enhance perfor-
multi-scale modal alignment to implement the modal consis- mance. The loss function combines losses from all tasks,
tency in the semantic aspects, respectively. ScaleVLAD [200] with model parameters updated via gradient descent. In mul-
proposes a fusion model to gather multi-Scale representation timodal affective computing, multitask learning helps distin-
from text, video, and audio with shared vectors of locally guish between modal-invariant and modal-specific features and
aggregated descriptors to improve unaligned multimodal senti- integrates emotion-related sub-tasks into a unified framework.
ment analysis. Yang et al. [104] convert unaligned multimodal Fig. 5 shows the learning paradigm of multitask learning in
sequence data into a graph with heterogeneous nodes and multimodal affective learning task.
edges that captures the rich interactions across modalities and 1) Multimodal Sentiment Analysis: In filed of multi-
through time. Lee et al. [201] segment the audio and the under- modal sentiment analysis, Self-MM [134] generates a pseudo-
lying text signals into equal number of steps in an aligned way label [205]–[207] for single modality and then jointly train
so that the same time steps of the sequential signals cover the unimodal and multimodal representations based on the gener-
same time span in the signals. Zong et al. [202] exploit multi- ated and original labels. Furthermore, a translation framework
ple bi-direction translations, leading to double multimodal fus- ARGF between modalities, i.e., translating from one modality
ing embeddings compared with traditional translation methods. to another is used as an auxiliary task to regualize the
Wang et al. [203] propose a multimodal encoding–decoding multimodal representation learning [135]. Akhtar et al. [136]
translation network with a transformer and adopted a joint en- leverage the interdependence of the tasks sentiment and emo-
coding–decoding method with text as the primary information tion to improve the model performance on two tasks. Chen et
and sound and image as the secondary information. Zhang al. [137] propose a video-based cross-modal auxiliary network
et al. [120] propose a novel multi-level alignment to bridge (VCAN), which is comprised of an audio features map module
the gap between acoustic and lexical modalities, which can and a cross-modal selection module to make use of auxiliary
effectively contrast both the instance-level and prototype-level information. Zheng et al. [138] propose a disentanglement
relationships, separating the multimodal features in the latent translation network (DTN) with slack reconstruction to cap-
space. Yu et al. [204] propose an unsupervised approach which ture desirable information properties, obtain a unified feature
minimizes the Wasserstein distance between both modalities, distribution and reduce redundancy. Zheng et al. [87] com-
forcing both encoders to produce more appropriate repre- bine mutual information maximization (MMMIE) between
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 8
MSA (§VI-A1) Self-MM [134], ARGF [135], MultiSE [136], VCAN [137], DTN [138], MMMIE [87], MMIM [85], MISA [88],
FacialMMT [24], MMMIE [87], AuxEmo [139], TDFNet [140], MALN [141], LGCCT [142], MultiEMO [81],
MERC (§VI-A2)
Multitask RLEMO [143],
Learning (§VI-A)
MABSA (§VI-A3) CMMT [144], AbCoRD [145], JML [146], MPT [36], MMRBN [82],
MMER (§VI-A4) AMP [92], MEGLN-LDA [147], MultiSE [136], AMP [92].
MAG-XLNet [21], UniMSE [37], AOBERT [148], SKESL [149], TEASAL [150], TO-BERT [151], SPT [152],
MSA (§VI-B1)
ALMT [153]
Multimodal Affective Computing
MERC (§VI-B2) FacialMMT [24], QAP [19], UniMSE [37], GraphSmile [154],
Pre-trained
Model (§VI-B)
MIMN [24], GMP [17], ERUP [155], VLP-MABSA [156], DR-BERT [157], DTCA [158], MSRA [159],
MABSA (§VI-B3)
AOF-ABSA [160], AD-GCFN [161], MOCOLNet [162],
MSA (§VI-C1) TETFN [163], ITP [18], SKEAFN [164], SAWFN [165], MTAG [104],
Enhanced MERC (§VI-C2) ConSK-GCN [166], DMD [167], MRST [168], SF [169], TGMFN [170], RLEMO [143], DEAN [171],
Knowledge
(§VI-C) MABSA (§VI-C3) KNIT [172], FITE [173], CoolNet [174], HIMT [175],
MMER (§VI-C4) UniVA-RoBERTa [176], CARAT [177], M3TR [178], MAGDRA [83], HHMPN [179],
MuLT [74], CIA [180], CAT-LSTM [181], CAMFNet [182], MTAG [104], CTNet [183], ScaleVLAD [101],
MSA (§VI-D1)
MMML [184], GFML [184], CHFusion [105],
CMCF-SRNet [80], MMGCN [185], MM-DFN [186], SAMGN [187], M3Net [188], M3GAT [185], RL-EMO [143],
Contextual MERC (§VI-D2)
SCMFN [189], EmoCaps [190], GA2MIF [191], MALN [141], COGMEN [46],
Information
(§VI-D)
MABSA (§VI-D3) DTCA [158], MCPR [192], Elbphilharmonie [193], M2DF [194], AoM [195], FGSN [196], MIMN [15],
Fig. 4. Taxonomy of multimodal affective computing works from aspects multitask learning, pre-trained model, enhanced knowledge and contextual information.
TABLE II
TAXONOMY OF MULTIMODAL AFFECTIVE COMPUTING TASK FROM TASKS OF MSA, MERC, MABSA AND MMER.
3) Multimodal Aspect-based Sentiment Analysis: In the video segments in the wild with both multimodal and
stduy of multimodal aspect-based sentiment analysis, Yu et independent unimodal annotations.
al. [158] propose an unsupervised approach which minimizes • CH-SIMS v2.0 [238] is an extended version of CH-
the Wasserstein distance between both modalities, forcing both SIMS that includes more data instances, spanning text,
encoders to produce more appropriate representations for the audio and visual modalities. Each modality of sample
final extraction. Xu et al. [192] design and construct a multi- is annotated with sentiment polarity, and then sample is
modal Chinese product review dataset (MCPR) to support the annotated with a concluded sentiment.
research of MABSA. Anschutz et al. [193] report the results of • CMU-MOSEAS [236] is the first large-scale multimodal
an empirical study on how semantic computing can provide in- language dataset for Spanish, Portuguese, German and
sights into user-generated content for domain experts. In addi- French, and it is collect from YouTube and its samples
tion, this work discussed different image-based aspect retrieval are 4,000 in total.
and aspect-based sentiment analysis approaches to handle and • ICT-MMMO [237] is collected from online social review
structure large datasets. Zhao et al. [194] borrow the idea videos that encompass a strong diversity in how people
of Curriculum Learning and propose a multi-grained multi- express opinions about movies and include a real-world
curriculum denoising Framework (M2DF) to adjust the order variability in video recording quality4 .
of training data, so as to obtain more contextual information. • YouTube [43] collects 47 videos from the social media
Zhou et al. [195] propose an aspect-oriented method (AoM) web site YouTube. Each video contains 3-11 utterances
to detect aspect-relevant semantic and sentiment information. with most videos having 5-6 utterances in the extracted
Specifically, an aspect-aware attention module is designed to 30 seconds.
simultaneously select textual tokens and image blocks that are
semantically related to the aspects. Zhao et al. [196] propose B. Multimodal Emotion Recognition in Conversation
a fusion with GCN and SE ResNeXt Network (FGSN), which • MELD [239] contains 13,707 video clips of multi-party
constructs a graph convolution network on the dependency tree conversations, with labels following Ekman’s six univer-
of sentences to obtain the context representation and aspects sal emotions, including joy, sadness, fear, anger, surprise
words representation by using syntactic information and word and disgust.
dependency. • IEMOCAP [240] has 13,707 video clips of multi-party
4) Multimodal Multi-label Emotion Recognition: conversations, with labels following Ekman’s six univer-
MMS2S [197] is a multimodal sequence-to-set approach to sal emotions, including joy, sadness, fear, anger, surprise
effectively model label dependence and modality dependence. and disgust.
MESGN [198] firstly proposes this task, which simultaneously • HED [241] contains happy, sad, disgust, angry and scared
models the modality-to-label and label-to-label dependencies. emotion-aligned face, body and text samples, which are
Many works consider the dependencies of multi-label based much larger than existing datasets. Moreover, the emotion
on the characteristics of co-occurrence labels. Zhao et labels were correspondingly attached to those samples by
al. [199] propose a general multimodal dialogue-aware strictly following a standard psychological paradigm.
interaction framework, named by MDI, to model the impacts • RML [242] collects video samples from eight subjects,
of dialogue context on emotion recognition. speaking six different languages. The six languages are
English, Mandarin, Urdu, Punjabi, Persian, and Italian.
VII. DATASETS OF M ULTIMODAL A FFECTIVE C OMPUTING This dataset contains 500 video samples, each delivered
with one of the six particular emotions.
In this section, we introduce the benchmark datasets of • BAUM-1 [243] contains two sets: BAUM-1a and BAUM-
MSA, MERC, MABSA, and MMER tasks. To facilitate easy 1s databases. BAUM-1a database contains clips con-
navigation and reference, the details of datasets are shown in taining expressions of five basic emotions (happiness,
Table III with a comprehensive overview of the studies that sadness, anger, disgust, fear) along with expressions
we cover. of boredom, confusion (unsure) and interest (curiosity).
BAUM-1s database contains clips reflecting six basic
A. Multimodal Sentiment Analysis emotions and also expressions of boredom, contempt,
confusion, thinking, concentrating, bothered, and neutral.
• MOSI [233] contains 2,199 utterance video segments, • MAHNOB-HCI [244] includes 527 facial video record-
and each segment is manually annotated with a sentiment ings of 27 participants engaged in various tasks and
score ranging from -3 to +3 to indicate the sentiment interactions, while their physiological signals such as 32-
polarity and relative sentiment strength of the segment. channel electroencephalogram (EEG), 3-channel electro-
• MOSEI [234] is an upgraded version of MOSI, anno- cardiogram.
tated with both sentiment and emotion. MOSEI contains • Deap [245] contains data from 32 participants, aged
22,856 movie review clips from YouTube. Each sample between 19 and 37 (50% female), who were recorded
in MOSEI includes sentiment annotations ranging from watching 40 one-minute music videos. Each participant
-3 to +3 and multi-label emotion annotations. was asked to evaluate each video by assigning values
• CH-SIMS [235] is a Chinese single- and multimodal
sentiment analysis dataset, which contains 2,281 refined 4 http://multicomp.ict.usc.edu
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 14
TABLE III
L IST OF MULTIMODAL AFFECTIVE COMPUTING DATASETS . T,A,V DENOTE THE TEXT, AUDIO AND VISION MODALITIES , RESPECTIVELY. E MOTION
DENOTES THE SAMPLE IN DATASET IS LABELED WITH EMOTION CATEGORY AND SENTIMENT DENOTES THE SAMPLE IN DATESET IS LABELED WITH
SENTIMENT POLARITY.
from 1 to 9 for arousal, valence, dominance, like/dislike, • AMIGOS [253] is collected in two experimental set-
and familiarity. tings. In the first setting, 40 participants viewed 16
• MuSe-CaR [246] focuses on the tasks of emotion, short emotional videos. In the second setting, participants
emotion-target engagement, and trustworthiness recogni- watched 4 longer videos, some individually and others
tion by means of comprehensively integrating the audio- in groups. During these sessions, participants’ phys-
visual and language modalities. iological signals—Electroencephalogram (EEG), Elec-
• CHEAVD 2.0 [247] is selected from Chinese movies, trocardiogram (ECG), and Galvanic Skin Response
soap operas and TV shows, which contains noise in the (GSR)—were recorded using wearable sensors.
background to mimic real-world conditions.
• MSP-IMPROV [248] is a multimodal emotional
database comprised of spontaneous dyadic interactions, C. Multimodal Aspect-based Sentiment Analysis
designed to study audiovisual perception of expressive • Twitter2015 and Twitter2017 are originally provided by
behaviors. the work [254] for multimodal named entity recognition
• MEISD [249] is a large-scale balanced multimodal multi- and annotated with the sentiment polarity for each aspect
label emotion, intensity, and sentiment dialogue dataset by the work [13].
(MEISD) collected from different TV series that has • MCPR [255] has 2,719 text-image pairs and 610 distinct
textual, audio, and visual features. aspects in total, which collects 1.5k product reviews
• MESD [250] is the first multimodal and multi-task senti- involving clothing and furniture departments, from the e-
ment, emotion, and desire dataset, which contains 9,190 commercial platform JD.com. It is the first aspect-based
text-image pairs, with English text. multimodal Chinese product review dataset.
• Ulm-TSST [251] is a multimodal dataset, where partic- • Multi-ZOL [47] consists of reviews of mobile phones
ipants were recorded in a stressful situation emulating a collected from ZOL.com. It contains 5,288 sets of multi-
job interview, following the TSST protocol. modal data points that cover various models of mobile
• CHERMA [252] provides uni-modal labels for each indi- phones from multiple brands. These data points are
vidual modality, and multi-modal labels for all modalities annotated with a sentiment intensity rating from 1 to 10
jointly observed. It is collected from various source, for six aspects.
including 148 TV series, 7 variety shows, and 2 movies. • MACSA [256] contains more than 21K text-image pairs,
and provides fine-grained annotations for both textual and
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 15
visual content and firstly uses the aspect category as the IX. D ISCUSS
pivot to align the fine-grained elements between the two In this section, we briefly discuss the works of multimodal
modalities. affective computing based on facial expression, acoustic sig-
• MASAD [257] selects 38,532 samples from a partial nal, physiological signals, and emotion cause. Furthermore, we
VSO visual dataset [259] (approximately 120,000 sam- discuss the technical routes across multiple multimodal affec-
ples) that can clearly express sentiments and categorized tive computing tasks to track their consistency and difference.
them into seven domains: food, goods, buildings, animal,
human, plant, scenery, with a total of 57 predefined
aspects. A. Other Multimodal Affective Computing
• PanoSent [258] is annotated both manually and au- a) Multimodal Affective Computing Based on Facial
tomatically, featuring high quality, large scale (10,000 Expression Recognition: Facial expression recognition has
dialogues), multimodality (text, image, audio and video), significantly evolved over the years, progressing from static to
multilingualism (English, Chinese and Spanish), multi- dynamic methods. Initially, static facial expression recognition
scenarios (over 100 domains), and covering both im- (SFER) relied on single-frame images, utilizing traditional
plicit&explicit sentiment elements. image processing techniques such as Local Binary Patterns
(LBP) and Gabor filters to extract features for classification.
The advent of deep learning brought Convolutional Neural
D. Multimodal Multi-label Emotion Recognition Networks (CNNs), which markedly improved the accuracy
• CMU-MOSEI [234] contains 22,856 movie review clips of SFER [260]–[263]. However, static methods were lim-
from Youtube videos. Each video intrinsically contains ited in capturing the temporal dynamics of facial expres-
three modalities: text, audio, and visual, and each movie sions [264]. Some methods attempt to approach the problem
review clip is annotated with at least one emotion cate- from a local-global feature perspective, extracting more fine-
gory of the set: angry, disgust, fear, happy, sad, surprise. grained visual representations and identifying key informative
• M3 ED [199] is a multimodal emotional dialogue dataset segments [265]–[270]. These approaches enhance robustness
in Chinese, which contains a total of 9,082 turns and against noisy frames, enabling uncertainty-aware inference. To
24,449 utterances, and each utterance is annotated with further enhance accuracy, recent advancements in DFER focus
the seven emotion categories (happy, surprise, sad, dis- on integrating multimodal data and employing parameter-
gust, anger, fear, and neutral). efficient fine-tuning (PEFT) to adapt large pre-trained models
for enhanced performance. Liu et al. [271] introduces the
concept of expression reenactment (i.e. normalization), har-
VIII. E VALUATION M ETRICS nessing generative AI to mitigate noise in in-the-wild datasets.
Moreover, the burgeoning evidential deep learning (EDL) has
In this section, we report the mainstream evaluation metrics shown considerable promise by enabling explicit uncertainty
for each multimodal affective computing task. quantification through the distributional measurement in la-
a) Multimodal Sentiment Analysis: Previous works adopt tent spaces for improved interpretability, with demonstrated
mean absolute error (MAE), Pearson correlation (Corr), seven- efficacy in zero-shot learning [272], multi-view classifica-
class classification accuracy (ACC-7), binary classification tion [273]–[275], video understanding [276]–[278] and multi-
accuracy (ACC-2) and F1 score computed for positive/negative modal named entity recognition.
and non-negative/negative classification as evaluation metrics. b) Multimodal Affective Computing Based on Acoustic
b) Multimodal Emotion Recognition in a conversation: Signal: The model based on single-sentence single-task is
Accuracy (ACC) and weighted F1 (WF1) are used for evalua- the most common model in speech emotion recognition. For
tion. Additionally, the imbalance label distribution results in a example, Aldeneh et al. [279] use CNN to perform convolu-
phenomenon that the trained model performs better on some tions in the time direction of handcrafted temporal features
categories and perform poorly on others. In order to verify the (40-dimensional MFSC) to identify emotion-salient regions
impacts of data distribution on model performance, researchers and used global max pooling to capture important temporal
also provide ACC and F1 on each emotion category to measure areas. Li et al. [280] apply two different convolution kernels
the model performance. on spectrograms to extract temporal and frequency domain
c) Multimodal Aspect-based Sentiment Analysis: With features, concatenated them, and input them into a CNN
the previous methods, for multimodal aspect term extrac- for learning, followed by attention mechanism pooling for
tion (MATE) and joint multimodal aspect sentiment analysis classification. Trigeorgis et al. [281] use CNN for end-to-end
(JMASA) tasks, researchers use precision (P), recall (R) and learning directly on speech signals, avoiding the problem of
micro-F1 (F1) as the evaluation metrics. For the multimodal feature extraction not being robust for all speakers. Mirsamadi
aspect sentiment classification(MASC) task, accuracy (ACC) et al. [282] combine Bidirectional LSTM (Bi-LSTM) with
and macro-F1 are as evaluation metrics. a novel pooling strategy, utilizing attention mechanisms to
d) Multimodal Multi-label Emotion Recognition: Ac- enable the network to focus on emotionally prominent parts of
cording to the prior work, multi-label classification works sentences. Zhao et al. [283] consider the temporal and spatial
mostly adopt accuracy (ACC), micro-F1, precision (P) and characteristics of the spectrum in the attention mechanism to
recall (R) as evaluation metrics. learn time-related features in spectrograms, and using CNN
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 16
to learn frequency-related features in spectrograms. Luo et tasks such as image captioning [7], [299], the impact of vision
al. [284] propose a dual-channel speech emotion recognition is more significant than language. In contrast, multimodal
model that uses CNN and RNN to learn from spectrograms affective computing tasks place a greater emphasis on lan-
on one hand, and separately learns HSFs features on the other, guage [37], [300].
finally concatenating the obtained features for classification. b) Pre-trained model: Generally, pre-trained models are
c) Multimodal Affective Computing Based on Physiolog- used to encode raw modal information into vectors. From this
ical Signals: In medical measurements and health monitoring, perspective, multimodal affective computing tasks adopt pre-
EEG-based emotion recognition (EER) is one of the most trained models as the backbone and then fine-tune them for
promising directions within emotion recognition and has at- downstream tasks. For example, UniMSE [37] uses T5 as the
tracted substantial research attention [285]–[287]. Notably, the backbone, while GMP [17] utilizes BART. These approaches
field of affective computing has seen nearly 1,000 publica- aim to transfer the general knowledge embedded in pre-trained
tions related to EER since 2010 [288]. Numerous EEG-based language models to the field of affective computing.
multimodal emotion recognition (EMER) methods have been c) Enhanced knowledge: Commonsense knowledge en-
proposed [289]–[293], leveraging the complementarity and re- compasses facts and judgments about our natural world. In
dundancy between EEG and other physiological signals in ex- the field of affective computing, this knowledge is crucial for
pressing emotions. For example, Vazquez et al. [294] address enabling machines to understand human emotions and their
the problem of multimodal emotion recognition from multiple underlying causes. Researchers enhance affective computing
physiological signal, which demonstrates Transformer-based by integrating external knowledge sources such as sentiment
approach is suitable for emotion recognition based on physi- lexicons [301], English knowledge bases [302]–[306], and
ological signal. Chinese knowledge bases [307] as the external knowledge to
d) Multimodal Affective Computing Based on Emotion enhance affective computing.
Cause: Apart from focusing on the emotions themselves, the d) Contextual information: Affective computing tasks
capacity of machine for understanding the cause that triggers require an understanding of contextual information. In MERC,
an emotion is essential for comprehending human behaviors, contextual information encompasses the entire conversation,
which makes emotion-cause pair extraction (ECPE) crucial. including both previous and subsequent utterances relative to
Over the years, text-based ECPE has made significant progress the current utterance. For MABSA, contextual information
[295], [296]. Based on ECPE, Li et al. [297] propose multi- refers to the full sentence containing customer opinions. Re-
modal emotion-cause pair extraction (MECPE), which aims searchers integrate contextual information using hierarchical
to extract emotion-cause pairs with multimodal information. approaches [308], [309], self-attention mechanisms [55], and
Initially, Li et al. [297] construct a joint training architecture, graph-based dependency modeling [310], [311]. Additionally,
which contains the main task, i.e., multimodal emotion-cause affective computing tasks can enhance understanding by incor-
pair extraction and two subtasks, i.e., multimodal emotion porating non-verbal cues such as facial expressions and vocal
detection and cause detection. To solve MECPE, researchers tone, alongside textual information.
borrowed the multitask learning framework to train the model
using multiple training objectives of sub-tasks, aiming to
C. Difference among Multimodal Affective Computing
enhance the knowledge sharing among them. For example,
Li et al. [298] propose a novel model that captures holistic We examine the differences among multimodal affective
interaction and label constraint (HiLo) features for the MECPE computing tasks by considering the type of downstream tasks,
task. HiLo enables cross-modality and cross-utterance feature sentiment granularity, and application contexts to identify the
interactions through various attention mechanisms, providing unique characteristics of each task.
a strong foundation for accurate cause extraction. For downstream tasks, MSA predicts sentiment strength as
a regression task. MERC is a multi-class classification task for
identifying emotion categories. MMER performs multilabel
B. Consistency among Multimodal Affective Computing emotion recognition, detecting multiple emotions simultane-
We categorize the multimodal affective computing tasks ously. MABSA involves extracting aspects and opinions to
into several key areas: multimodal alignment and fusion, determine sentiment polarity, categorizing it as information ex-
multi-task learning, pre-trained models, enhanced knowledge, traction. In terms of analysis granularity, MERC and MECPE
and contextual information. To ensure clarity, we discuss the focus on utterances and speakers within a conversation, while
consistencies across these aspects. MSA and MMER concentrate on sentence-level information
a) Multimodal alignment and fusion: Among MSA, within a document. MABSA, on the other hand, focuses
MERC, MABSA and MMER tasks, each is fundamentally a on aspects within comments. Some studies infer fine-grained
multimodal task that involves considering and combining at sentiment from coarse-grained sentiment [207], [312] or in-
least two modalities to make decisions. This process includes tegrate tasks of different granularities into a unified training
extracting features from each modality and integrating them framework [300]. Due to these differences in granularity,
into a unified representation vector. In multimodal represen- the contextual information varies as well. For instance, in
tation learning, modal alignment and fusion are two critical MABSA, the context includes the comment along with any
issues that must be addressed to advance the field of multi- associated images and short descriptions of aspects, whereas
modal affective computing. For vision-dominated multimodal in MERC, the context encompasses the entire conversation
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 17
EAAI 2020, New York, NY, USA, February 7-12, 2020, 2020, pp. 8992– [28] H. Bao, W. Wang, L. Dong, Q. Liu, O. K. Mohammed, K. Aggarwal,
8999. S. Som, S. Piao, and F. Wei, “Vlmo: Unified vision-language pre-
[12] A. Zadeh, C. Mao, K. Shi, Y. Zhang, P. P. Liang, S. Poria, and training with mixture-of-modality-experts,” in NeurIPS, 2022.
L. Morency, “Factorized multimodal transformer for multimodal se- [29] J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, and
quential learning,” CoRR, vol. abs/1911.09826, 2019. Y. Wu, “Coca: Contrastive captioners are image-text foundation mod-
[13] D. Lu, L. Neves, V. Carvalho, N. Zhang, and H. Ji, “Visual attention els,” Trans. Mach. Learn. Res., vol. 2022, 2022.
model for name tagging in multimodal social media,” in Proceedings [30] H. Akbari, L. Yuan, R. Qian, W. Chuang, S. Chang, Y. Cui, and
of the 56th Annual Meeting of the Association for Computational B. Gong, “VATT: transformers for multimodal self-supervised learning
Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume from raw video, audio and text,” in Advances in Neural Information
1: Long Papers, 2018, pp. 1990–1999. Processing Systems 34: Annual Conference on Neural Information
[14] A. Gandhi, K. Adhvaryu, S. Poria, E. Cambria, and A. Hussain, “Mul- Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual,
timodal sentiment analysis: A systematic review of history, datasets, 2021, pp. 24 206–24 221.
multimodal fusion methods, applications, challenges and future direc- [31] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. de Laroussilhe,
tions,” Inf. Fusion, vol. 91, pp. 424–444, 2023. A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer
[15] N. Xu, W. Mao, and G. Chen, “Multi-interactive memory network learning for NLP,” in Proceedings of the 36th International Conference
for aspect based multimodal sentiment analysis,” in The Thirty-Third on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach,
AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty- California, USA, 2019, pp. 2790–2799.
First Innovative Applications of Artificial Intelligence Conference, [32] X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts
IAAI 2019, The Ninth AAAI Symposium on Educational Advances in for generation,” in Proceedings of the 59th Annual Meeting of the
Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 Association for Computational Linguistics and the 11th International
- February 1, 2019, 2019, pp. 371–378. Joint Conference on Natural Language Processing, ACL/IJCNLP 2021,
[16] F. Wang, Z. Ding, R. Xia, Z. Li, and J. Yu, “Multimodal emotion-cause (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, C. Zong,
pair extraction in conversations,” CoRR, vol. abs/2110.08020, 2021. F. Xia, W. Li, and R. Navigli, Eds. Association for Computational
Linguistics, 2021, pp. 4582–4597.
[17] X. Yang, S. Feng, D. Wang, Q. Sun, W. Wu, Y. Zhang, P. Hong, and
[33] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du,
S. Poria, “Few-shot joint multimodal aspect-sentiment analysis based
A. M. Dai, and Q. V. Le, “Finetuned language models are zero-shot
on generative multimodal prompt,” in Findings of the Association for
learners,” arXiv preprint arXiv:2109.01652, 2021.
Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14,
2023, 2023, pp. 11 575–11 589. [34] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhari-
wal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal,
[18] S. Rahmani, S. Hosseini, R. Zall, M. R. Kangavari, S. Kamran,
A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh,
and W. Hua, “Transfer-based adaptive tree for multimodal sentiment
D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler,
analysis based on user latent aspects,” Knowl. Based Syst., vol. 261, p.
M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish,
110219, 2023.
A. Radford, I. Sutskever, and D. Amodei, “Language models are few-
[19] Z. Li, Y. Zhou, Y. Liu, F. Zhu, C. Yang, and S. Hu, “QAP: shot learners,” in Advances in Neural Information Processing Systems
A quantum-inspired adaptive-priority-learning model for multimodal 33: Annual Conference on Neural Information Processing Systems
emotion recognition,” in Findings of the Association for Computational 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle,
Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, 2023, pp. M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., 2020.
12 191–12 204. [35] Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun,
[20] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, J. Xu, and Z. Sui, “A survey on in-context learning,” arXiv preprint
“Multimodal deep learning,” in Proceedings of the 28th International arXiv:2301.00234, 2022.
Conference on Machine Learning, ICML 2011, Bellevue, Washington, [36] S. Zou, X. Huang, and X. Shen, “Multimodal prompt transformer with
USA, June 28 - July 2, 2011, L. Getoor and T. Scheffer, Eds. hybrid contrastive learning for emotion recognition in conversation,”
Omnipress, 2011, pp. 689–696. CoRR, vol. abs/2310.04456, 2023.
[21] W. Rahman, M. K. Hasan, S. Lee, A. B. Zadeh, C. Mao, L. Morency, [37] G. Hu, T. Lin, Y. Zhao, G. Lu, Y. Wu, and Y. Li, “Unimse: Towards
and M. E. Hoque, “Integrating multimodal information in large pre- unified multimodal sentiment analysis and emotion recognition,” in
trained transformers,” in Proceedings of the 58th Annual Meeting of Proceedings of the 2022 Conference on Empirical Methods in Nat-
the Association for Computational Linguistics, ACL 2020, Online, July ural Language Processing, EMNLP 2022, Abu Dhabi, United Arab
5-10, 2020, 2020, pp. 2359–2369. Emirates, December 7-11, 2022, 2022, pp. 7837–7851.
[22] Y. Zhang and Q. Yang, “A survey on multi-task learning,” IEEE Trans. [38] Y. Zhang, X. Yang, X. Xu, Z. Gao, Y. Huang, S. Mu, S. Feng, D. Wang,
Knowl. Data Eng., vol. 34, no. 12, pp. 5586–5609, 2022. Y. Zhang, K. Song et al., “Affective computing in the era of large
[23] Y. Xie, K. Yang, C. Sun, B. Liu, and Z. Ji, “Knowledge-interactive language models: A survey from the nlp perspective,” arXiv preprint
network with sentiment polarity intensity-aware multi-task learning arXiv:2408.04638, 2024.
for emotion recognition in conversations,” in Findings of the Asso- [39] B. Pan, K. Hirota, Z. Jia, and Y. Dai, “A review of multimodal emotion
ciation for Computational Linguistics: EMNLP 2021, Virtual Event / recognition from datasets, preprocessing, features, and fusion methods,”
Punta Cana, Dominican Republic, 16-20 November, 2021, M. Moens, Neurocomputing, vol. 561, p. 126866, 2023.
X. Huang, L. Specia, and S. W. Yih, Eds. Association for Computa- [40] K. Ezzameli and H. Mahersia, “Emotion recognition from unimodal to
tional Linguistics, 2021, pp. 2879–2889. multimodal analysis: A review,” Inf. Fusion, vol. 99, p. 101847, 2023.
[24] W. Zheng, J. Yu, R. Xia, and S. Wang, “A facial expression-aware [41] L. Zhu, Z. Zhu, C. Zhang, Y. Xu, and X. Kong, “Multimodal sentiment
multimodal multi-task learning framework for emotion recognition in analysis based on fusion methods: A survey,” Inf. Fusion, vol. 95, pp.
multi-party conversations,” in Proceedings of the 61st Annual Meeting 306–325, 2023.
of the Association for Computational Linguistics (Volume 1: Long [42] T. Thongtan and T. Phienthrakul, “Sentiment classification using doc-
Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, 2023, pp. ument embeddings trained with cosine similarity,” in Proceedings of
15 445–15 459. the 57th Conference of the Association for Computational Linguistics,
[25] Z. Chen, L. Chen, B. Chen, L. Qin, Y. Liu, S. Zhu, J. Lou, and ACL 2019, Florence, Italy, July 28 - August 2, 2019, Volume 2: Student
K. Yu, “Unidu: Towards A unified generative dialogue understanding Research Workshop, F. Alva-Manchego, E. Choi, and D. Khashabi, Eds.
framework,” CoRR, vol. abs/2204.04637, 2022. Association for Computational Linguistics, 2019, pp. 407–414.
[26] H. Luo, L. Ji, B. Shi, H. Huang, N. Duan, T. Li, X. Chen, and M. Zhou, [43] L. Morency, R. Mihalcea, and P. Doshi, “Towards multimodal sentiment
“Univilm: A unified video and language pre-training model for mul- analysis: harvesting opinions from the web,” in Proceedings of the
timodal understanding and generation,” CoRR, vol. abs/2002.06353, 13th International Conference on Multimodal Interfaces, ICMI 2011,
2020. Alicante, Spain, November 14-18, 2011, 2011, pp. 169–176.
[27] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, [44] A. G. A. and V. Vetriselvi, “Survey on multimodal approaches to
G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, emotion recognition,” Neurocomputing, vol. 556, p. 126693, 2023.
“Learning transferable visual models from natural language supervi- [45] Y. Sun, N. Yu, and G. Fu, “A discourse-aware graph neural network
sion,” in Proceedings of the 38th International Conference on Machine for emotion recognition in multi-party conversation,” in Findings of
Learning, ICML 2021, 18-24 July 2021, Virtual Event, 2021, pp. 8748– the Association for Computational Linguistics: EMNLP 2021, Virtual
8763. Event / Punta Cana, Dominican Republic, 16-20 November, 2021,
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 19
M. Moens, X. Huang, L. Specia, and S. W. Yih, Eds. Association for A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan,
Computational Linguistics, 2021, pp. 2949–2958. B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov,
[46] A. Joshi, A. Bhat, A. Jain, A. V. Singh, and A. Modi, “COGMEN: Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic,
contextualized GNN based multimodal emotion recognition,” CoRR, S. Edunov, and T. Scialom, “Llama 2: Open foundation and fine-tuned
vol. abs/2205.02455, 2022. chat models,” CoRR, vol. abs/2307.09288, 2023.
[47] N. Xu, W. Mao, and G. Chen, “Multi-interactive memory network for [60] A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with
aspect based multimodal sentiment analysis,” in Proceedings of the selective state spaces,” CoRR, vol. abs/2312.00752, 2023.
AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. [61] Y. Gong, Y. Chung, and J. R. Glass, “AST: audio spectrogram
371–378. transformer,” in 22nd Annual Conference of the International Speech
[48] Z. Chen and T. Qian, “Transfer capsule network for aspect level Communication Association, Interspeech 2021, Brno, Czechia, August
sentiment classification,” in Proceedings of the 57th Conference of 30 - September 3, 2021, H. Hermansky, H. Cernocký, L. Burget,
the Association for Computational Linguistics, ACL 2019, Florence, L. Lamel, O. Scharenborg, and P. Motlı́cek, Eds. ISCA, 2021, pp.
Italy, July 28- August 2, 2019, Volume 1: Long Papers, A. Korhonen, 571–575.
D. R. Traum, and L. Màrquez, Eds. Association for Computational [62] M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling for
Linguistics, 2019, pp. 547–556. convolutional neural networks,” in Proceedings of the 36th Interna-
[49] H. Yan, J. Dai, T. Ji, X. Qiu, and Z. Zhang, “A unified generative tional Conference on Machine Learning, ICML 2019, 9-15 June 2019,
framework for aspect-based sentiment analysis,” in Proceedings of the Long Beach, California, USA, ser. Proceedings of Machine Learning
59th Annual Meeting of the Association for Computational Linguistics Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR,
and the 11th International Joint Conference on Natural Language 2019, pp. 6105–6114.
Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual [63] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
Event, August 1-6, 2021, C. Zong, F. Xia, W. Li, and R. Navigli, Eds. T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly,
Association for Computational Linguistics, 2021, pp. 2416–2429. J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words:
[50] C. Li, F. Gao, J. Bu, L. Xu, X. Chen, Y. Gu, Z. Shao, Q. Zheng, Transformers for image recognition at scale,” in 9th International
N. Zhang, Y. Wang, and Z. Yu, “Sentiprompt: Sentiment knowledge Conference on Learning Representations, ICLR 2021, Virtual Event,
enhanced prompt-tuning for aspect-based sentiment analysis,” CoRR, Austria, May 3-7, 2021. OpenReview.net, 2021.
vol. abs/2109.08306, 2021. [64] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal,
[51] P. Yang, X. Sun, W. Li, S. Ma, W. Wu, and H. Wang, “SGM: sequence G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable
generation model for multi-label classification,” in Proceedings of the visual models from natural language supervision,” in International
27th International Conference on Computational Linguistics, COLING conference on machine learning, 2021, pp. 8748–8763.
2018, Santa Fe, New Mexico, USA, August 20-26, 2018, 2018, pp. [65] “Gpt-4v(ision) system card,” 2023. [Online]. Available: https:
3915–3926. //api.semanticscholar.org/CorpusID:263218031
[52] Q. Ma, C. Yuan, W. Zhou, and S. Hu, “Label-specific dual graph [66] G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut,
neural network for multi-label text classification,” in Proceedings of the J. Schalkwyk, A. M. Dai, A. Hauth et al., “Gemini: a family of highly
59th Annual Meeting of the Association for Computational Linguistics capable multimodal models,” arXiv preprint arXiv:2312.11805, 2023.
and the 11th International Joint Conference on Natural Language [67] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson,
Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo:
Event, August 1-6, 2021, 2021, pp. 3855–3864. a visual language model for few-shot learning,” Advances in Neural
[53] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, Information Processing Systems, vol. 35, pp. 23 716–23 736, 2022.
“Distributed representations of words and phrases and their compo- [68] J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-
sitionality,” in Advances in Neural Information Processing Systems 26: image pre-training with frozen image encoders and large language
27th Annual Conference on Neural Information Processing Systems models,” arXiv preprint arXiv:2301.12597, 2023.
2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, [69] D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: En-
Nevada, United States, 2013, pp. 3111–3119. hancing vision-language understanding with advanced large language
[54] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors models,” arXiv preprint arXiv:2304.10592, 2023.
for word representation,” in Proceedings of the 2014 Conference on [70] W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li,
Empirical Methods in Natural Language Processing, EMNLP 2014, P. Fung, and S. Hoi, “Instructblip: Towards general-purpose vision-
October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special language models with instruction tuning,” 2023.
Interest Group of the ACL, 2014, pp. 1532–1543. [71] H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,”
[55] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Advances in neural information processing systems, vol. 36, 2024.
Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” [72] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li,
in Advances in Neural Information Processing Systems 30: Annual X. Wang, M. Dehghani, S. Brahma et al., “Scaling instruction-finetuned
Conference on Neural Information Processing Systems 2017, December language models,” Journal of Machine Learning Research, vol. 25,
4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, no. 70, pp. 1–53, 2024.
H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, Eds., [73] A. Zadeh, M. Chen, S. Poria, E. Cambria, and L. Morency, “Tensor
2017, pp. 5998–6008. fusion network for multimodal sentiment analysis,” in Proceedings
[56] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, of the 2017 Conference on Empirical Methods in Natural Language
V. Stoyanov, and L. Zettlemoyer, “BART: denoising sequence-to- Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11,
sequence pre-training for natural language generation, translation, and 2017, M. Palmer, R. Hwa, and S. Riedel, Eds. Association for
comprehension,” in Proceedings of the 58th Annual Meeting of the Computational Linguistics, 2017, pp. 1103–1114.
Association for Computational Linguistics, ACL 2020, Online, July 5- [74] Y. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L. Morency, and
10, 2020, 2020, pp. 7871–7880. R. Salakhutdinov, “Multimodal transformer for unaligned multimodal
[57] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, language sequences,” in Proceedings of the 57th Conference of the
Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning Association for Computational Linguistics, ACL 2019, Florence, Italy,
with a unified text-to-text transformer,” J. Mach. Learn. Res., vol. 21, July 28- August 2, 2019, Volume 1: Long Papers, A. Korhonen,
pp. 140:1–140:67, 2020. D. R. Traum, and L. Màrquez, Eds. Association for Computational
[58] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, Linguistics, 2019, pp. 6558–6569.
B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, [75] C. Chen, H. Hong, J. Guo, and B. Song, “Inter-intra modal representa-
E. Grave, and G. Lample, “Llama: Open and efficient foundation tion augmentation with trimodal collaborative disentanglement network
language models,” CoRR, vol. abs/2302.13971, 2023. for multimodal sentiment analysis,” IEEE ACM Trans. Audio Speech
[59] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, Lang. Process., vol. 31, pp. 1476–1488, 2023.
N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, [76] K. Yang, H. Xu, and K. Gao, “CM-BERT: cross-modal BERT for text-
C. Canton-Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, audio sentiment analysis,” in MM ’20: The 28th ACM International
J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, Conference on Multimedia, Virtual Event / Seattle, WA, USA, October
S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, 12-16, 2020, 2020, pp. 521–528.
I. Kloumann, A. Korenev, P. S. Koura, M. Lachaux, T. Lavril, J. Lee, [77] Z. Lin, B. Liang, Y. Long, Y. Dang, M. Yang, M. Zhang, and
D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, R. Xu, “Modeling intra- and inter-modal relations: Hierarchical graph
I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, contrastive learning for multimodal sentiment analysis,” in Proceedings
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 20
of the 29th International Conference on Computational Linguistics, timodal affective computing,” in Proceedings of the 57th Conference
COLING 2022, Gyeongju, Republic of Korea, October 12-17, 2022, of the Association for Computational Linguistics, ACL 2019, Florence,
2022, pp. 7124–7135. Italy, July 28- August 2, 2019, Volume 1: Long Papers, 2019, pp. 481–
[78] J. Tang, D. Liu, X. Jin, Y. Peng, Q. Zhao, Y. Ding, and W. Kong, 492.
“BAFN: bi-direction attention based fusion network for multimodal [96] Z. Li, B. Xu, C. Zhu, and T. Zhao, “CLMLF: A contrastive learning
sentiment analysis,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, and multi-layer fusion method for multimodal sentiment detection,”
no. 4, pp. 1966–1978, 2023. in Findings of the Association for Computational Linguistics: NAACL
[79] C. Huang, J. Zhang, X. Wu, Y. Wang, M. Li, and X. Huang, “Tefna: 2022, Seattle, WA, United States, July 10-15, 2022, 2022, pp. 2282–
Text-centered fusion network with crossmodal attention for multimodal 2294.
sentiment analysis,” Knowl. Based Syst., vol. 269, p. 110502, 2023. [97] P. P. Liang, Z. Liu, A. Zadeh, and L. Morency, “Multimodal language
[80] X. Zhang and Y. Li, “A cross-modality context fusion and analysis with recurrent multistage fusion,” in Proceedings of the 2018
semantic refinement network for emotion recognition in conversation,” Conference on Empirical Methods in Natural Language Processing,
in Proceedings of the 61st Annual Meeting of the Association for Brussels, Belgium, October 31 - November 4, 2018, 2018, pp. 150–
Computational Linguistics (Volume 1: Long Papers). Toronto, Canada: 161.
Association for Computational Linguistics, Jul. 2023, pp. 13 099– [98] J. Tang, K. Li, X. Jin, A. Cichocki, Q. Zhao, and W. Kong, “CTFN:
13 110. [Online]. Available: https://aclanthology.org/2023.acl-long.732 hierarchical learning for multimodal sentiment analysis using coupled-
[81] T. Shi and S. Huang, “Multiemo: An attention-based correlation-aware translation fusion network,” in Proceedings of the 59th Annual Meeting
multimodal fusion framework for emotion recognition in conversa- of the Association for Computational Linguistics and the 11th Interna-
tions,” in Proceedings of the 61st Annual Meeting of the Association tional Joint Conference on Natural Language Processing, ACL/IJCNLP
for Computational Linguistics (Volume 1: Long Papers), ACL 2023, 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, 2021,
Toronto, Canada, July 9-14, 2023, 2023, pp. 14 752–14 766. pp. 5301–5311.
[82] X. Chen, “Mmrbn: Rule-based network for multimodal emotion recog- [99] Z. Li, Q. Guo, Y. Pan, W. Ding, J. Yu, Y. Zhang, W. Liu, H. Chen,
nition,” in ICASSP 2024-2024 IEEE International Conference on H. Wang, and Y. Xie, “Multi-level correlation mining framework with
Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. self-supervised label generation for multimodal sentiment analysis,” Inf.
8200–8204. Fusion, vol. 99, p. 101891, 2023.
[83] X. Li, J. Liu, Y. Xie, P. Gong, X. Zhang, and H. He, “Magdra: a multi- [100] J. Peng, T. Wu, W. Zhang, F. Cheng, S. Tan, F. Yi, and Y. Huang,
modal attention graph network with dynamic routing-by-agreement for “A fine-grained modal label-based multi-stage network for multimodal
multi-label emotion recognition,” Knowledge-Based Systems, vol. 283, sentiment analysis,” Expert Syst. Appl., vol. 221, p. 119721, 2023.
p. 111126, 2024. [101] H. Luo, L. Ji, Y. Huang, B. Wang, S. Ji, and T. Li, “Scalevlad:
[84] N. K. Devulapally, S. Anand, S. D. Bhattacharjee, J. Yuan, and Improving multimodal sentiment analysis via multi-scale fusion of
Y. Chang, “Amuse: Adaptive multimodal analysis for speaker emotion locally descriptors,” CoRR, vol. abs/2112.01368, 2021.
recognition in group conversations,” CoRR, vol. abs/2401.15164, 2024. [102] S. Mai, Y. Zhao, Y. Zeng, J. Yao, and H. Hu, “Meta-learn unimodal
[85] W. Han, H. Chen, and S. Poria, “Improving multimodal fusion with signals with weak supervision for multimodal sentiment analysis,”
hierarchical mutual information maximization for multimodal senti- arXiv preprint arXiv:2408.16029, 2024.
ment analysis,” in Proceedings of the 2021 Conference on Empirical
[103] S. Minglong, O. Chunping, L. Yongbin, and R. Lin, “Multimodal emo-
Methods in Natural Language Processing, EMNLP 2021, Virtual Event
tion recognition based on hierarchical fusion strategy and contextual
/ Punta Cana, Dominican Republic, 7-11 November, 2021, M. Moens,
information embedding,” Beijing Da Xue Xue Bao, vol. 60, no. 3, pp.
X. Huang, L. Specia, and S. W. Yih, Eds., pp. 9180–9192.
393–402, 2024.
[86] S. Mai, Y. Zeng, S. Zheng, and H. Hu, “Hybrid contrastive learning
[104] J. Yang, Y. Wang, R. Yi, Y. Zhu, A. Rehman, A. Zadeh, S. Poria, and
of tri-modal representation for multimodal sentiment analysis,” CoRR,
L. Morency, “MTAG: modal-temporal attention graph for unaligned
vol. abs/2109.01797, 2021.
human multimodal language sequences,” in Proceedings of the 2021
[87] J. Zheng, S. Zhang, X. Wang, and Z. Zeng, “Multimodal representa-
Conference of the North American Chapter of the Association for
tions learning based on mutual information maximization and mini-
Computational Linguistics: Human Language Technologies, NAACL-
mization and identity embedding for multimodal sentiment analysis,”
HLT 2021, Online, June 6-11, 2021, 2021, pp. 1009–1021.
arXiv preprint arXiv:2201.03969, 2022.
[88] D. Hazarika, R. Zimmermann, and S. Poria, “MISA: modality-invariant [105] N. Majumder, D. Hazarika, A. Gelbukh, E. Cambria, and S. Poria,
and -specific representations for multimodal sentiment analysis,” in “Multimodal sentiment analysis using hierarchical fusion with context
MM ’20: The 28th ACM International Conference on Multimedia, modeling,” Knowledge-based systems, vol. 161, pp. 124–133, 2018.
Virtual Event / Seattle, WA, USA, October 12-16, 2020, C. W. Chen, [106] J. Zhao, R. Li, and Q. Jin, “Missing modality imagination network for
R. Cucchiara, X. Hua, G. Qi, E. Ricci, Z. Zhang, and R. Zimmermann, emotion recognition with uncertain missing modalities,” in Proceedings
Eds. ACM, 2020, pp. 1122–1131. of the 59th Annual Meeting of the Association for Computational
[89] L. Xiao, X. Wu, S. Yang, J. Xu, J. Zhou, and L. He, “Cross-modal Linguistics and the 11th International Joint Conference on Natural
fine-grained alignment and fusion network for multimodal aspect-based Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers),
sentiment analysis,” Inf. Process. Manag., vol. 60, no. 6, p. 103508, Virtual Event, August 1-6, 2021, 2021, pp. 2608–2618.
2023. [107] S. Parthasarathy and S. Sundaram, “Training strategies to handle miss-
[90] Z. Zhang, Z. Wang, X. Li, N. Liu, B. Guo, and Z. Yu, “Modalnet: an ing modalities for audio-visual expression recognition,” in Compan-
aspect-level sentiment classification model by exploring multimodal ion Publication of the 2020 International Conference on Multimodal
data with fusion discriminant attentional network,” World Wide Web, Interaction, ICMI Companion 2020, Virtual Event, The Netherlands,
vol. 24, no. 6, pp. 1957–1974, 2021. October, 2020, 2020, pp. 400–404.
[91] Y. Zhang, M. Chen, J. Shen, and C. Wang, “Tailor versatile multi- [108] N. Wang, H. Cao, J. Zhao, R. Chen, D. Yan, and J. Zhang, “M2R2:
modal learning for multi-label emotion recognition,” in Proceedings of missing-modality robust emotion recognition framework with iterative
the AAAI Conference on Artificial Intelligence, vol. 36, no. 8, 2022, data augmentation,” IEEE Trans. Artif. Intell., vol. 4, no. 5, pp. 1305–
pp. 9100–9108. 1316, 2023.
[92] S. Ge, Z. Jiang, Z. Cheng, C. Wang, Y. Yin, and Q. Gu, “Learning [109] J. Zeng, J. Zhou, and T. Liu, “Mitigating inconsistencies in multimodal
robust multi-modal representation for multi-label emotion recognition sentiment analysis under uncertain missing modalities,” in Proceedings
via adversarial masking and perturbation,” in Proceedings of the ACM of the 2022 Conference on Empirical Methods in Natural Language
Web Conference 2023, WWW 2023, Austin, TX, USA, 30 April 2023 - Processing, 2022, pp. 2924–2934.
4 May 2023, 2023, pp. 1510–1518. [110] Z. Yuan, W. Li, H. Xu, and W. Yu, “Transformer-based feature
[93] M. Sharafi, M. Yazdchi, R. Rasti, and F. Nasimi, “A novel spatio- reconstruction network for robust multimodal sentiment analysis,” in
temporal convolutional neural framework for multimodal emotion Proceedings of the 29th ACM International Conference on Multimedia,
recognition,” Biomed. Signal Process. Control., vol. 78, p. 103970, 2021, pp. 4400–4407.
2022. [111] W. Luo, M. Xu, and H. Lai, “Multimodal reconstruct and align net for
[94] Y. Li, W. Weng, and C. Liu, “Tscl-fhfn: two-stage contrastive learning missing modality problem in sentiment analysis,” in MultiMedia Mod-
and feature hierarchical fusion network for multimodal sentiment eling - 29th International Conference, MMM 2023, Bergen, Norway,
analysis,” Neural Computing and Applications, pp. 1–15, 2024. January 9-12, 2023, Proceedings, Part II, 2023, pp. 411–422.
[95] S. Mai, H. Hu, and S. Xing, “Divide, conquer and combine: Hierarchi- [112] C. Shang, A. Palmer, J. Sun, K. Chen, J. Lu, and J. Bi, “VIGAN:
cal feature fusion network with local and global perspectives for mul- missing view imputation with generative adversarial networks,” in 2017
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 21
IEEE International Conference on Big Data (IEEE BigData 2017), [131] C. Du, C. Du, H. Wang, J. Li, W.-L. Zheng, B.-L. Lu, and H. He,
Boston, MA, USA, December 11-14, 2017, 2017, pp. 766–775. “Semi-supervised deep generative modelling of incomplete multi-
[113] J. Zeng, T. Liu, and J. Zhou, “Tag-assisted multimodal sentiment modality emotional data,” in Proceedings of the 26th ACM interna-
analysis under uncertain missing modalities,” in SIGIR ’22: The 45th tional conference on Multimedia, 2018, pp. 108–116.
International ACM SIGIR Conference on Research and Development [132] Z. Wang, Z. Wan, and X. Wan, “Transmodality: An end2end fusion
in Information Retrieval, Madrid, Spain, July 11 - 15, 2022, 2022, pp. method with transformer for multimodal sentiment analysis,” in WWW
1545–1554. ’20: The Web Conference 2020, Taipei, Taiwan, April 20-24, 2020,
[114] H. Zuo, R. Liu, J. Zhao, G. Gao, and H. Li, “Exploiting modality- 2020, pp. 2514–2520.
invariant feature for robust multimodal emotion recognition with miss- [133] F. Ma, S. Huang, and L. Zhang, “An efficient approach for audio-visual
ing modalities,” in IEEE International Conference on Acoustics, Speech emotion recognition with missing labels and missing modalities,” in
and Signal Processing ICASSP 2023, Rhodes Island, Greece, June 4- 2021 IEEE International Conference on Multimedia and Expo, ICME
10, 2023, 2023, pp. 1–5. 2021, Shenzhen, China, July 5-9, 2021, 2021, pp. 1–6.
[115] Z. Liu, B. Zhou, D. Chu, Y. Sun, and L. Meng, “Modality translation- [134] W. Yu, H. Xu, Z. Yuan, and J. Wu, “Learning modality-specific
based multimodal sentiment analysis under uncertain missing modali- representations with self-supervised multi-task learning for multimodal
ties,” Inf. Fusion, vol. 101, p. 101973, 2024. sentiment analysis,” in Thirty-Fifth AAAI Conference on Artificial
[116] T. Zhou, S. Canu, P. Vera, and S. Ruan, “Feature-enhanced generation Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Appli-
and multi-modality fusion based deep neural network for brain tumor cations of Artificial Intelligence, IAAI 2021, The Eleventh Symposium
segmentation with missing MR modalities,” Neurocomputing, vol. 466, on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual
pp. 102–112, 2021. Event, February 2-9, 2021, pp. 10 790–10 797.
[117] J. Vazquez-Rodriguez, G. Lefebvre, J. Cumin, and J. L. Crowley, [135] S. Mai, H. Hu, and S. Xing, “Modality to modality translation:
“Accommodating missing modalities in time-continuous multimodal An adversarial representation learning and graph fusion network for
emotion recognition,” in 11th International Conference on Affective multimodal fusion,” in The Thirty-Fourth AAAI Conference on Artificial
Computing and Intelligent Interaction, ACII 2023, Cambridge, MA, Intelligence, AAAI 2020, The Thirty-Second Innovative Applications
USA, September 10-13, 2023. IEEE, 2023, pp. 1–8. of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI
[118] H. Mao, B. Zhang, H. Xu, Z. Yuan, and Y. Liu, “Robust-msa: Symposium on Educational Advances in Artificial Intelligence, EAAI
Understanding the impact of modality noise on multimodal sentiment 2020, New York, NY, USA, February 7-12, 2020. AAAI Press, 2020,
analysis,” in Thirty-Seventh AAAI Conference on Artificial Intelligence, pp. 164–172.
AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Ar- [136] M. S. Akhtar, D. S. Chauhan, D. Ghosal, S. Poria, A. Ekbal, and
tificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational P. Bhattacharyya, “Multi-task learning for multi-modal emotion recog-
Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, nition and sentiment analysis,” in Proceedings of the 2019 Conference
February 7-14, 2023, 2023, pp. 16 458–16 460. of the North American Chapter of the Association for Computa-
[119] S. Lai, X. Hu, Y. Li, Z. Ren, Z. Liu, and D. Miao, “Shared and tional Linguistics: Human Language Technologies, NAACL-HLT 2019,
private information learning in multimodal sentiment analysis with Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short
deep modal alignment and self-supervised multi-task learning,” arXiv Papers), J. Burstein, C. Doran, and T. Solorio, Eds. Association for
preprint arXiv:2305.08473, 2023. Computational Linguistics, 2019, pp. 370–379.
[120] X. Zhang, W. Cui, B. Hu, and Y. Li, “A multi-level alignment and cross- [137] R. Chen, W. Zhou, Y. Li, and H. Zhou, “Video-based cross-modal
modal unified semantic graph refinement network for conversational auxiliary network for multimodal sentiment analysis,” IEEE Trans.
emotion recognition,” IEEE Transactions on Affective Computing, Circuits Syst. Video Technol., vol. 32, no. 12, pp. 8703–8716, 2022.
2024. [138] Y. Zeng, W. Yan, S. Mai, and H. Hu, “Disentanglement translation
[121] E. Shutova, D. Kiela, and J. Maillard, “Black holes and white rabbits: network for multimodal sentiment analysis,” Inf. Fusion, vol. 102, p.
Metaphor identification with visual features,” in NAACL HLT 2016, The 102031, 2024.
2016 Conference of the North American Chapter of the Association for
[139] D. Sun, Y. He, and J. Han, “Using auxiliary tasks in multimodal fusion
Computational Linguistics: Human Language Technologies, San Diego
of wav2vec 2.0 and BERT for multimodal emotion recognition,” CoRR,
California, USA, June 12-17, 2016, 2016, pp. 160–170.
vol. abs/2302.13661, 2023.
[122] S. Moon, S. Kim, and H. Wang, “Multimodal transfer deep learning
for audio visual recognition,” CoRR, vol. abs/1412.3121, 2014. [140] Z. Zhao, Y. Wang, G. Shen, Y. Xu, and J. Zhang, “Tdfnet: Transformer-
[123] G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. W. Senior, based deep-scale fusion network for multimodal emotion recognition,”
“Recent advances in the automatic recognition of audiovisual speech,” IEEE ACM Trans. Audio Speech Lang. Process., vol. 31, pp. 3771–
Proceedings of the IEEE, vol. 91, no. 9, pp. 1306–1326, 2003. 3782, 2023.
[124] G. Evangelopoulos, A. Zlatintsi, A. Potamianos, P. Maragos, K. Ra- [141] M. Ren, X. Huang, J. Liu, M. Liu, X. Li, and A. Liu, “MALN:
pantzikos, G. Skoumas, and Y. Avrithis, “Multimodal saliency and multimodal adversarial learning network for conversational emotion
fusion for movie summarization based on aural, visual, and textual recognition,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 11,
attention,” IEEE Trans. Multim., vol. 15, no. 7, pp. 1553–1568, 2013. pp. 6965–6980, 2023.
[125] M. Glodek, S. Tschechne, G. Layher, M. Schels, T. Brosch, S. Scherer, [142] F. Liu, S. Shen, Z. Fu, H. Wang, A. Zhou, and J. Qi, “LGCCT: A light
M. Kächele, M. Schmidt, H. Neumann, G. Palm, and F. Schwenker, gated and crossed complementation transformer for multimodal speech
“Multiple classifier systems for the classification of audio-visual emo- emotion recognition,” Entropy, vol. 24, no. 7, p. 1010, 2022.
tional states,” in Affective Computing and Intelligent Interaction - [143] C. Zhang, Y. Zhang, and B. Cheng, “Rl-emo: A reinforcement learning
Fourth International Conference, ACII 2011, Memphis, TN, USA, framework for multimodal emotion recognition,” in ICASSP 2024 -
October 9-12, 2011, Proceedings, Part II, 2011, pp. 359–368. 2024 IEEE International Conference on Acoustics, Speech and Signal
[126] S. Poria, E. Cambria, D. Hazarika, N. Majumder, A. Zadeh, and L.- Processing (ICASSP), 2024, pp. 10 246–10 250.
P. Morency, “Context-dependent sentiment analysis in user-generated [144] L. Yang, J. Na, and J. Yu, “Cross-modal multitask transformer for
videos,” in Proceedings of the 55th annual meeting of the association end-to-end multimodal aspect-based sentiment analysis,” Inf. Process.
for computational linguistics (volume 1: Long papers), 2017, pp. 873– Manag., vol. 59, no. 5, p. 103038, 2022.
883. [145] R. Jain, A. Singh, V. K. Gangwar, and S. Saha, “Abcord: Exploit-
[127] A. I. Middya, B. Nag, and S. Roy, “Deep learning based multimodal ing multimodal generative approach for aspect-based complaint and
emotion recognition using model-level fusion of audio-visual modali- rationale detection,” in Proceedings of the 31st ACM International
ties,” Knowl. Based Syst., vol. 244, p. 108580, 2022. Conference on Multimedia, MM 2023, Ottawa, ON, Canada, 29
[128] T. Baltrušaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine October 2023- 3 November 2023, 2023, pp. 8571–8579.
learning: A survey and taxonomy,” IEEE transactions on pattern [146] X. Ju, D. Zhang, R. Xiao, J. Li, S. Li, M. Zhang, and G. Zhou,
analysis and machine intelligence, vol. 41, no. 2, pp. 423–443, 2018. “Joint multi-modal aspect-sentiment analysis with auxiliary cross-
[129] Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang, “A survey of affect modal relation detection,” in Proceedings of the 2021 Conference on
recognition methods: Audio, visual, and spontaneous expressions,” Empirical Methods in Natural Language Processing, EMNLP 2021,
IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 1, pp. 39–58, Virtual Event / Punta Cana, Dominican Republic, 7-11 November,
2009. 2021, 2021, pp. 4395–4405.
[130] N. Majumder, D. Hazarika, A. F. Gelbukh, E. Cambria, and S. Poria, [147] H. Lian, C. Lu, S. Li, Y. Zhao, C. Tang, Y. Zong, and W. Zheng,
“Multimodal sentiment analysis using hierarchical fusion with context “Label distribution adaptation for multimodal emotion recognition with
modeling,” Knowl. Based Syst., vol. 161, pp. 124–133, 2018. multi-label learning,” in Proceedings of the 1st International Workshop
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 22
on Multimodal and Responsible Affective Computing, MRAC 2023, [165] M. Chen and X. Li, “Swafn: Sentimental words aware fusion network
Ottawa, ON, Canada, 29 October 2023, 2023, pp. 51–58. for multimodal sentiment analysis,” in Proceedings of the 28th interna-
[148] “Aobert: All-modalities-in-one bert for multimodal sentiment analysis,” tional conference on computational linguistics, 2020, pp. 1067–1077.
Information Fusion, vol. 92, pp. 37–45, 2023. [166] Y. Fu, S. Okada, L. Wang, L. Guo, Y. Song, J. Liu, and J. Dang,
[149] F. Qian, J. Han, Y. He, T. Zheng, and G. Zheng, “Sentiment knowledge “Context- and knowledge-aware graph convolutional network for mul-
enhanced self-supervised learning for multimodal sentiment analysis,” timodal emotion recognition,” IEEE Multim., vol. 29, no. 3, pp. 91–100,
in Findings of the Association for Computational Linguistics: ACL 2022.
2023, 2023, pp. 12 966–12 978. [167] Y. Li, Y. Wang, and Z. Cui, “Decoupled multimodal distilling for
[150] M. Arjmand, M. J. Dousti, and H. Moradi, “TEASEL: A transformer- emotion recognition,” in IEEE/CVF Conference on Computer Vision
based speech-prefixed language model,” CoRR, vol. abs/2109.05522, and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June
2021. 17-24, 2023, 2023, pp. 6631–6640.
[168] X. Sun, H. He, H. Tang, K. Zeng, and T. Shen, “Multimodal rough
[151] J. Yu and J. Jiang, “Adapting BERT for target-oriented multimodal
set transformer for sentiment analysis and emotion recognition,” in 9th
sentiment classification,” in Proceedings of the Twenty-Eighth Interna-
IEEE International Conference on Cloud Computing and Intelligent
tional Joint Conference on Artificial Intelligence, IJCAI 2019, Macao,
Systems, CCIS 2023, Dali, China, August 12-13, 2023, 2023, pp. 250–
China, August 10-16, 2019, 2019, pp. 5408–5414.
259.
[152] J. Cheng, I. Fostiropoulos, B. W. Boehm, and M. Soleymani, “Mul- [169] P. Wang, S. Zeng, J. Chen, L. Fan, M. Chen, Y. Wu, and X. He,
timodal phased transformer for sentiment analysis,” in Proceedings “Leveraging label information for multimodal emotion recognition,”
of the 2021 Conference on Empirical Methods in Natural Language CoRR, vol. abs/2309.02106, 2023.
Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican [170] P. Yuan, G. Cai, M. Chen, and X. Tang, “Topics guided multimodal fu-
Republic, 7-11 November, 2021, M. Moens, X. Huang, L. Specia, and sion network for conversational emotion recognition,” in International
S. W. Yih, Eds. Association for Computational Linguistics, 2021, pp. Conference on Intelligent Computing. Springer, 2024, pp. 250–262.
2447–2458. [171] F. Zhang, X. Li, C. P. Lim, Q. Hua, C. Dong, and J. Zhai, “Deep
[153] H. Zhang, Y. Wang, G. Yin, K. Liu, Y. Liu, and T. Yu, “Learn- emotional arousal network for multimodal sentiment analysis and
ing language-guided adaptive hyper-modality representation for mul- emotion recognition,” Inf. Fusion, vol. 88, pp. 296–304, 2022.
timodal sentiment analysis,” CoRR, vol. abs/2310.05804, 2023. [172] Z. Xu, Q. Su, and J. Xiao, “Multimodal aspect-based sentiment clas-
[154] J. Li, X. Wang, and Z. Zeng, “Tracing intricate cues in dialogue: sification with knowledge-injected transformer,” in IEEE International
Joint graph structure and sentiment dynamics for multimodal emotion Conference on Multimedia and Expo, ICME 2023, Brisbane, Australia,
recognition,” arXiv preprint arXiv:2407.21536, 2024. July 10-14, 2023, 2023, pp. 1379–1384.
[155] K. Liu, J. Wang, and X. Zhang, “Entity-related unsupervised pretraining [173] H. Yang, Y. Zhao, and B. Qin, “Face-sensitive image-to-emotional-
with visual prompts for multimodal aspect-based sentiment analysis,” in text cross-modal translation for multimodal aspect-based sentiment
Natural Language Processing and Chinese Computing - 12th National analysis,” in Proceedings of the 2022 Conference on Empirical Methods
CCF Conference, NLPCC 2023, Foshan, China, October 12-15, 2023, in Natural Language Processing, EMNLP 2022, Abu Dhabi, United
Proceedings, Part II, 2023, pp. 481–493. Arab Emirates, December 7-11, 2022, 2022, pp. 3324–3335.
[156] Y. Ling, J. Yu, and R. Xia, “Vision-language pre-training for multi- [174] L. Xiao, X. Wu, S. Yang, J. Xu, J. Zhou, and L. He, “Cross-modal
modal aspect-based sentiment analysis,” in Proceedings of the 60th fine-grained alignment and fusion network for multimodal aspect-based
Annual Meeting of the Association for Computational Linguistics sentiment analysis,” Information Processing & Management, vol. 60,
(Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, no. 6, p. 103508, 2023.
2022, pp. 2149–2159. [175] J. Yu, K. Chen, and R. Xia, “Hierarchical interactive multimodal
[157] K. Zhang, K. Zhang, M. Zhang, H. Zhao, Q. Liu, W. Wu, and E. Chen, transformer for aspect-based multimodal sentiment analysis,” IEEE
“Incorporating dynamic semantics into pre-trained language model for Transactions on Affective Computing, 2022.
aspect-based sentiment analysis,” in Findings of the Association for [176] W. Zheng, J. Yu, and R. Xia, “A unimodal valence-arousal driven
Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, contrastive learning framework for multimodal multi-label emotion
2022, 2022, pp. 3599–3610. recognition,” in ACM Multimedia 2024.
[158] Z. Yu, J. Wang, L. Yu, and X. Zhang, “Dual-encoder transformers [177] C. Peng, K. Chen, L. Shou, and G. Chen, “Carat: Contrastive feature
with cross-modal alignment for multimodal aspect-based sentiment reconstruction and aggregation for multi-modal multi-label emotion
analysis,” in Proceedings of the 2nd Conference of the Asia-Pacific recognition,” in Proceedings of the AAAI Conference on Artificial
Chapter of the Association for Computational Linguistics and the Intelligence, vol. 38, no. 13, 2024, pp. 14 581–14 589.
12th International Joint Conference on Natural Language Processing, [178] J. Zhao, Y. Zhao, and J. Li, “M3tr: Multi-modal multi-label recognition
AACL/IJCNLP 2022 - Volume 1: Long Papers, Online Only, November with transformer,” in Proceedings of the 29th ACM international
20-23, 2022, 2022, pp. 414–423. conference on multimedia, 2021, pp. 469–477.
[159] H. Jin, J. Tan, L. Liu, L. Qiu, S. Yao, X. Chen, and X. Zeng, “MSRA: [179] D. Zhang, X. Ju, W. Zhang, J. Li, S. Li, Q. Zhu, and G. Zhou, “Multi-
A multi-aspect semantic relevance approach for e-commerce via mul- modal multi-label emotion recognition with heterogeneous hierarchical
timodal pre-training,” in Proceedings of the 32nd ACM International message passing,” in Proceedings of the AAAI Conference on Artificial
Conference on Information and Knowledge Management, CIKM 2023, Intelligence, vol. 35, no. 16, 2021, pp. 14 338–14 346.
Birmingham, United Kingdom, October 21-25, 2023, 2023, pp. 3988– [180] D. S. Chauhan, M. S. Akhtar, A. Ekbal, and P. Bhattacharyya, “Context-
3992. aware interactive attention for multi-modal sentiment and emotion
analysis,” in Proceedings of the 2019 Conference on Empirical Meth-
[160] Q. Wang, H. Xu, Z. Wen, B. Liang, M. Yang, B. Qin, and R. Xu,
ods in Natural Language Processing and the 9th International Joint
“Image-to-text conversion and aspect-oriented filtration for multimodal
Conference on Natural Language Processing, EMNLP-IJCNLP 2019,
aspect-based sentiment analysis,” IEEE Transactions on Affective Com-
Hong Kong, China, November 3-7, 2019, K. Inui, J. Jiang, V. Ng, and
puting, no. 01, pp. 1–15, 2023.
X. Wan, Eds. Association for Computational Linguistics, 2019, pp.
[161] C. Wang, Y. Luo, C. Meng, and F. Yuan, “An adaptive dual graph 5646–5656.
convolution fusion network for aspect-based sentiment analysis,” ACM [181] S. Poria, E. Cambria, D. Hazarika, N. Majumder, A. Zadeh, and
Transactions on Asian and Low-Resource Language Information Pro- L. Morency, “Multi-level multiple attentions for contextual multimodal
cessing, 2024. sentiment analysis,” in 2017 IEEE International Conference on Data
[162] J. Mu, F. Nie, W. Wang, J. Xu, J. Zhang, and H. Liu, “Mocolnet: Mining, ICDM 2017, New Orleans, LA, USA, November 18-21, 2017,
A momentum contrastive learning network for multimodal aspect- 2017, pp. 1033–1038.
level sentiment analysis,” IEEE Transactions on Knowledge and Data [182] M. Huang, C. Qing, J. Tan, and X. Xu, “Context-based adaptive multi-
Engineering, 2023. modal fusion network for continuous frame-level sentiment prediction,”
[163] D. Wang, X. Guo, Y. Tian, J. Liu, L. He, and X. Luo, “TETFN: A IEEE ACM Trans. Audio Speech Lang. Process., vol. 31, pp. 3468–
text enhanced transformer fusion network for multimodal sentiment 3477, 2023.
analysis,” Pattern Recognit., vol. 136, p. 109259, 2023. [183] Z. Li, Y. Sun, L. Zhang, and J. Tang, “Ctnet: Context-based tandem
[164] C. Zhu, M. Chen, S. Zhang, C. Sun, H. Liang, Y. Liu, and J. Chen, network for semantic segmentation,” IEEE Trans. Pattern Anal. Mach.
“SKEAFN: sentiment knowledge enhanced attention fusion network Intell., vol. 44, no. 12, pp. 9904–9917, 2022.
for multimodal sentiment analysis,” Inf. Fusion, vol. 100, p. 101958, [184] X. Sun, X. Ren, and X. Xie, “A novel multimodal sentiment analysis
2023. model based on gated fusion and multi-task learning,” in ICASSP 2024-
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 23
2024 IEEE International Conference on Acoustics, Speech and Signal Event, August 1-6, 2021. Association for Computational Linguistics,
Processing (ICASSP). IEEE, 2024, pp. 8336–8340. 2021.
[185] J. Hu, Y. Liu, J. Zhao, and Q. Jin, “MMGCN: multimodal fusion via [203] F. Wang, S. Tian, L. Yu, J. Liu, J. Wang, K. Li, and Y. Wang,
deep graph convolution network for emotion recognition in conversa- “TEDT: transformer-based encoding-decoding translation network for
tion,” in Proceedings of the 59th Annual Meeting of the Association for multimodal sentiment analysis,” Cogn. Comput., vol. 15, no. 1, pp.
Computational Linguistics and the 11th International Joint Conference 289–303, 2023.
on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long [204] Z. Yu, J. Wang, L.-C. Yu, and X. Zhang, “Dual-encoder transformers
Papers), Virtual Event, August 1-6, 2021, C. Zong, F. Xia, W. Li, and with cross-modal alignment for multimodal aspect-based sentiment
R. Navigli, Eds. Association for Computational Linguistics, 2021, pp. analysis,” in Proceedings of the 2nd Conference of the Asia-Pacific
5666–5675. Chapter of the Association for Computational Linguistics and the
[186] D. Hu, X. Hou, L. Wei, L. Jiang, and Y. Mo, “MM-DFN: multimodal 12th International Joint Conference on Natural Language Processing
dynamic fusion network for emotion recognition in conversations,” (Volume 1: Long Papers), 2022, pp. 414–423.
in IEEE International Conference on Acoustics, Speech and Signal [205] Y. Ge, D. Chen, and H. Li, “Mutual mean-teaching: Pseudo label re-
Processing, ICASSP 2022, Virtual and Singapore, 23-27 May 2022, finery for unsupervised domain adaptation on person re-identification,”
2022, pp. 7037–7041. arXiv preprint arXiv:2001.01526, 2020.
[187] D. Zhang, F. Chen, J. Chang, X. Chen, and Q. Tian, “Structure
[206] H. Pham, Z. Dai, Q. Xie, and Q. V. Le, “Meta pseudo labels,” in IEEE
aware multi-graph network for multi-modal emotion recognition in
Conference on Computer Vision and Pattern Recognition, CVPR 2021,
conversations,” IEEE Trans. Multim., vol. 26, pp. 3987–3997, 2024.
virtual, June 19-25, 2021. Computer Vision Foundation / IEEE, 2021,
[188] F. Chen, J. Shao, S. Zhu, and H. T. Shen, “Multivariate, multi-frequency
pp. 11 557–11 568.
and multimodal: Rethinking graph neural networks for emotion recog-
nition in conversation,” in IEEE/CVF Conference on Computer Vision [207] Y. Zhang, M. Zhang, S. Wu, and J. Zhao, “Towards unifying the label
and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June space for aspect- and sentence-based sentiment analysis,” in Findings
17-24, 2023, 2023, pp. 10 761–10 770. of the Association for Computational Linguistics: ACL 2022, Dublin,
[189] B. Yao and W. Shi, “Speaker-centric multimodal fusion networks for Ireland, May 22-27, 2022, S. Muresan, P. Nakov, and A. Villavicencio,
emotion recognition in conversations,” in ICASSP 2024 - 2024 IEEE Eds. Association for Computational Linguistics, 2022, pp. 20–30.
International Conference on Acoustics, Speech and Signal Processing [208] Y. Zhang, J. Wang, Y. Liu, L. Rong, Q. Zheng, D. Song, P. Tiwari, and
(ICASSP), 2024, pp. 8441–8445. J. Qin, “A multitask learning model for multimodal sarcasm, sentiment
[190] Z. Li, F. Tang, M. Zhao, and Y. Zhu, “Emocaps: Emotion capsule and emotion recognition in conversations,” Inf. Fusion, vol. 93, pp.
based model for conversational emotion recognition,” arXiv preprint 282–301, 2023.
arXiv:2203.13504, 2022. [209] M. S. Akhtar, D. S. Chauhan, D. Ghosal, S. Poria, A. Ekbal, and
[191] J. Li, X. Wang, G. Lv, and Z. Zeng, “GA2MIF: graph and attention P. Bhattacharyya, “Multi-task learning for multi-modal emotion recog-
based two-stage multi-source information fusion for conversational nition and sentiment analysis,” arXiv preprint arXiv:1905.05812, 2019.
emotion detection,” IEEE Trans. Affect. Comput., vol. 15, no. 1, pp. [210] T. Xie, C. H. Wu, P. Shi, R. Zhong, T. Scholak, M. Yasunaga, C. Wu,
130–143, 2024. M. Zhong, P. Yin, S. I. Wang, V. Zhong, B. Wang, C. Li, C. Boyle,
[192] C. Xu, X. Luo, and D. Wang, “MCPR: A chinese product review A. Ni, Z. Yao, D. R. Radev, C. Xiong, L. Kong, R. Zhang, N. A. Smith,
dataset for multimodal aspect-based sentiment analysis,” in Cognitive L. Zettlemoyer, and T. Yu, “Unifiedskg: Unifying and multi-tasking
Computing - ICCC 2022 - 6th International Conference, Held as Part structured knowledge grounding with text-to-text language models,”
of the Services Conference Federation, SCF 2022, Honolulu, HI, USA, CoRR, vol. abs/2201.05966, 2022.
December 10-14, 2022, Proceedings, 2022, pp. 83–90. [211] J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-
[193] M. Anschütz, T. Eder, and G. Groh, “Retrieving users’ opinions on agnostic visiolinguistic representations for vision-and-language tasks,”
social media with multimodal aspect-based sentiment analysis,” in 17th in Advances in Neural Information Processing Systems 32: Annual
IEEE International Conference on Semantic Computing, ICSC 2023, Conference on Neural Information Processing Systems 2019, NeurIPS
Laguna Hills, CA, USA, February 1-3, 2023, 2023, pp. 1–8. 2019, December 8-14, 2019, Vancouver, BC, Canada, H. M. Wallach,
[194] F. Zhao, C. Li, Z. Wu, Y. Ouyang, J. Zhang, and X. Dai, “M2DF: H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Gar-
multi-grained multi-curriculum denoising framework for multimodal nett, Eds., 2019, pp. 13–23.
aspect-based sentiment analysis,” CoRR, vol. abs/2310.14605, 2023. [212] W. Wang, H. Bao, L. Dong, and F. Wei, “Vlmo: Unified vision-
[195] R. Zhou, W. Guo, X. Liu, S. Yu, Y. Zhang, and X. Yuan, “Aom: language pre-training with mixture-of-modality-experts,” CoRR, vol.
Detecting aspect-oriented information for multimodal aspect-based abs/2111.02358, 2021.
sentiment analysis,” in Findings of the Association for Computational [213] Z. Zhang, X. Meng, Y. Wang, X. Jiang, Q. Liu, and Z. Yang, “Unims:
Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, 2023, pp. A unified framework for multimodal summarization with knowledge
8184–8196. distillation,” in Thirty-Sixth AAAI Conference on Artificial Intelligence,
[196] J. Zhao and F. Yang, “Fusion with gcn and se-resnext network for AAAI 2022, Thirty-Fourth Conference on Innovative Applications of
aspect based multimodal sentiment analysis,” in 2023 IEEE 6th In- Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Edu-
formation Technology, Networking, Electronic and Automation Control cational Advances in Artificial Intelligence, EAAI 2022 Virtual Event,
Conference (ITNEC), vol. 6, 2023, pp. 336–340. February 22 - March 1, 2022, 2022, pp. 11 757–11 764.
[197] D. Zhang, X. Ju, J. Li, S. Li, Q. Zhu, and G. Zhou, “Multi-modal [214] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis,
multi-label emotion detection with modality and label dependence,” in L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized BERT
Proceedings of the 2020 Conference on Empirical Methods in Natural pretraining approach,” CoRR, vol. abs/1907.11692, 2019.
Language Processing, EMNLP 2020, Online, November 16-20, 2020,
[215] S. Qiu, N. Sekhar, and P. Singhal, “Topic and style-aware transformer
2020, pp. 3584–3593.
for multimodal emotion recognition,” in Findings of the Association
[198] X. Ju, D. Zhang, J. Li, and G. Zhou, “Transformer-based label set
for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-
generation for multi-modal multi-label emotion detection,” in MM ’20:
14, 2023, 2023, pp. 2074–2082.
The 28th ACM International Conference on Multimedia, Virtual Event
/ Seattle, WA, USA, October 12-16, 2020, 2020, pp. 512–520. [216] C. Xi, G. Lu, and J. Yan, “Multimodal sentiment analysis based on
[199] J. Zhao, T. Zhang, J. Hu, Y. Liu, Q. Jin, X. Wang, and H. Li, “M3ED: multi-head attention mechanism,” in Proceedings of the 4th interna-
multi-modal multi-scene multi-label emotional dialogue database,” in tional conference on machine learning and soft computing, 2020, pp.
ACL 2022, 2022, pp. 5699–5710. 34–39.
[200] H. Luo, L. Ji, Y. Huang, B. Wang, S. Ji, and T. Li, “Scalevlad: [217] Y. Zhang, D. Song, P. Zhang, P. Wang, J. Li, X. Li, and B. Wang, “A
Improving multimodal sentiment analysis via multi-scale fusion of quantum-inspired multimodal sentiment analysis framework,” Theoret-
locally descriptors,” arXiv preprint arXiv:2112.01368, 2021. ical Computer Science, vol. 752, pp. 21–40, 2018.
[201] Y. Lee, S. Yoon, and K. Jung, “Multimodal speech emotion recog- [218] A. Metallinou, M. Wollmer, A. Katsamanis, F. Eyben, B. Schuller, and
nition using cross attention with aligned audio and text,” CoRR, vol. S. Narayanan, “Context-sensitive learning for enhanced audiovisual
abs/2207.12895, 2022. emotion classification,” IEEE Transactions on Affective Computing,
[202] C. Zong, F. Xia, W. Li, and R. Navigli, Eds., Proceedings of the vol. 3, no. 2, pp. 184–198, 2012.
59th Annual Meeting of the Association for Computational Linguistics [219] Y. Li, K. Zhang, J. Wang, and X. Gao, “A cognitive brain model for
and the 11th International Joint Conference on Natural Language multimodal sentiment analysis based on attention neural networks,”
Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Neurocomputing, vol. 430, pp. 159–173, 2021.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 24
[220] I. Chaturvedi, R. Satapathy, S. Cavallari, and E. Cambria, “Fuzzy [237] M. Wöllmer, F. Weninger, T. Knaup, B. W. Schuller, C. Sun, K. Sagae,
commonsense reasoning for multimodal sentiment analysis,” Pattern and L. Morency, “Youtube movie reviews: Sentiment analysis in an
Recognition Letters, vol. 125, pp. 264–270, 2019. audio-visual context,” IEEE Intell. Syst., vol. 28, no. 3, pp. 46–53,
[221] W. Wu, Y. Wang, S. Xu, and K. Yan, “Sfnn: semantic features 2013.
fusion neural network for multimodal sentiment analysis,” in 2020 [238] Y. Liu, Z. Yuan, H. Mao, Z. Liang, W. Yang, Y. Qiu, T. Cheng, X. Li,
5th International Conference on Automation, Control and Robotics H. Xu, and K. Gao, “Make acoustic and visual cues matter: Ch-sims
Engineering (CACRE), 2020, pp. 661–665. v2. 0 dataset and av-mixup consistent module,” in Proceedings of the
[222] W. Zheng, J. Yu, R. Xia, and S. Wang, “A facial expression-aware 2022 International Conference on Multimodal Interaction, 2022, pp.
multimodal multi-task learning framework for emotion recognition in 247–258.
multi-party conversations,” in Proceedings of the 61st Annual Meeting [239] S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and
of the Association for Computational Linguistics (Volume 1: Long R. Mihalcea, “MELD: A multimodal multi-party dataset for emotion
Papers), 2023, pp. 15 445–15 459. recognition in conversations,” in Proceedings of the 57th Conference
[223] H. Ma, J. Wang, H. Lin, B. Zhang, Y. Zhang, and B. Xu, “A of the Association for Computational Linguistics, ACL 2019, Florence,
transformer-based model with self-distillation for multimodal emotion Italy, July 28- August 2, 2019, Volume 1: Long Papers, A. Korhonen,
recognition in conversations,” CoRR, vol. abs/2310.20494, 2023. D. R. Traum, and L. Màrquez, Eds., pp. 527–536.
[224] Y. Wang, Y. Li, P. Bell, and C. Lai, “Cross-attention is not enough: [240] C. Busso, M. Bulut, C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N.
Incongruity-aware multimodal sentiment analysis and emotion recog- Chang, S. Lee, and S. S. Narayanan, “IEMOCAP: interactive emotional
nition,” CoRR, vol. abs/2305.13583, 2023. dyadic motion capture database,” Lang. Resour. Evaluation, vol. 42,
[225] Y. Zhang, A. Jia, B. Wang, P. Zhang, D. Zhao, P. Li, Y. Hou, X. Jin, no. 4, pp. 335–359, 2008.
D. Song, and J. Qin, “M3GAT: A multi-modal, multi-task interactive [241] Z. Fang, A. He, Q. Yu, B. Gao, W. Ding, T. Zhang, and L. Ma, “FAF: A
graph attention network for conversational sentiment analysis and novel multimodal emotion recognition approach integrating face, body
emotion recognition,” ACM Trans. Inf. Syst., vol. 42, no. 1, pp. 13:1– and text,” CoRR, vol. abs/2211.15425, 2022.
13:32, 2024. [242] Y. Wang and L. Guan, “Recognizing human emotional state from
[226] Y.-P. Ruan, S. Han, T. Li, and Y. Wu, “Fusing modality-specific audiovisual signals,” IEEE Trans. Multim., vol. 10, no. 4, pp. 659–
representations and decisions for multimodal emotion recognition,” 668, 2008.
in ICASSP 2024-2024 IEEE International Conference on Acoustics, [243] S. Zhalehpour, O. Onder, Z. Akhtar, and C. E. Erdem, “BAUM-1: A
Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 7925–7929. spontaneous audio-visual face database of affective and mental states,”
[227] F. Zhao, C. Li, Z. Wu, Y. Ouyang, J. Zhang, and X. Dai, “M2DF: multi- IEEE Trans. Affect. Comput., vol. 8, no. 3, pp. 300–313, 2017.
grained multi-curriculum denoising framework for multimodal aspect- [244] M. Soleymani, J. Lichtenauer, T. Pun, and M. Pantic, “A multimodal
based sentiment analysis,” in Proceedings of the 2023 Conference on database for affect recognition and implicit tagging,” IEEE Trans.
Empirical Methods in Natural Language Processing, EMNLP 2023, Affect. Comput., vol. 3, no. 1, pp. 42–55, 2012.
Singapore, December 6-10, 2023, 2023, pp. 9057–9070. [245] S. Koelstra, C. Muhl, M. Soleymani, J.-S. Lee, A. Yazdani, T. Ebrahimi,
[228] Y. Zhang, M. Chen, J. Shen, and C. Wang, “Tailor versatile multi- T. Pun, A. Nijholt, and I. Patras, “Deap: A database for emotion
modal learning for multi-label emotion recognition,” in Thirty-Sixth analysis; using physiological signals,” IEEE transactions on affective
AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth computing, vol. 3, no. 1, pp. 18–31, 2011.
Conference on Innovative Applications of Artificial Intelligence, IAAI [246] L. Stappen, A. Baird, L. Schumann, and B. W. Schuller, “The multi-
2022, The Twelveth Symposium on Educational Advances in Artificial modal sentiment analysis in car reviews (muse-car) dataset: Collection,
Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022, insights and improvements,” IEEE Trans. Affect. Comput., vol. 14,
2022, pp. 9100–9108. no. 2, pp. 1334–1350, 2023.
[229] D. Zhang, X. Ju, W. Zhang, J. Li, S. Li, Q. Zhu, and G. Zhou, “Multi- [247] Y. Li, J. Tao, B. Schuller, S. Shan, D. Jiang, and J. Jia, “Mec
modal multi-label emotion recognition with heterogeneous hierarchical 2017: Multimodal emotion recognition challenge,” in 2018 First Asian
message passing,” in Thirty-Fifth AAAI Conference on Artificial Intelli- Conference on Affective Computing and Intelligent Interaction (ACII
gence, AAAI 2021, Thirty-Third Conference on Innovative Applications Asia), 2018, pp. 1–5.
of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on [248] C. Busso, S. Parthasarathy, A. Burmania, M. AbdelWahab,
Educational Advances in Artificial Intelligence, EAAI 2021, Virtual N. Sadoughi, and E. M. Provost, “Msp-improv: An acted corpus of
Event, February 2-9, 2021, 2021, pp. 14 338–14 346. dyadic interactions to study emotion perception,” IEEE Transactions
[230] J. Zhao, Y. Zhao, and J. Li, “M3TR: multi-modal multi-label recog- on Affective Computing, vol. 8, no. 1, pp. 67–80, 2017.
nition with transformer,” in MM ’21: ACM Multimedia Conference, [249] M. Firdaus, H. Chauhan, A. Ekbal, and P. Bhattacharyya, “Meisd:
Virtual Event, China, October 20 - 24, 2021, 2021, pp. 469–477. A multimodal multi-label emotion, intensity and sentiment dialogue
[231] Z. Zhang, Z. Wang, X. Li, N. Liu, B. Guo, and Z. Yu, “Modalnet: an dataset for emotion recognition and sentiment analysis in conversa-
aspect-level sentiment classification model by exploring multimodal tions,” in Proceedings of the 28th international conference on compu-
data with fusion discriminant attentional network,” World Wide Web, tational linguistics, 2020, pp. 4441–4453.
vol. 24, pp. 1957–1974, 2021. [250] A. Jia, Y. He, Y. Zhang, S. Uprety, D. Song, and C. Lioma, “Beyond
[232] J. Yang, Y. Xiao, and X. Du, “Multi-grained fusion network emotion: A multi-modal dataset for human desire understanding,” in
with self-distillation for aspect-based multimodal sentiment analysis,” Proceedings of the 2022 Conference of the North American Chapter
Knowledge-Based Systems, vol. 293, p. 111724, 2024. of the Association for Computational Linguistics: Human Language
[233] A. Zadeh, R. Zellers, E. Pincus, and L. Morency, “Multimodal senti- Technologies, NAACL 2022, Seattle, WA, United States, July 10-15,
ment intensity analysis in videos: Facial gestures and verbal messages,” 2022, 2022, pp. 1512–1522.
IEEE Intell. Syst., vol. 31, no. 6, pp. 82–88, 2016. [251] L. Stappen, A. Baird, L. Christ, L. Schumann, B. Sertolli, E. Meßner,
[234] A. Zadeh, P. P. Liang, S. Poria, E. Cambria, and L. Morency, E. Cambria, G. Zhao, and B. W. Schuller, “The muse 2021 multi-
“Multimodal language analysis in the wild: CMU-MOSEI dataset modal sentiment analysis challenge: Sentiment, emotion, physiological-
and interpretable dynamic fusion graph,” in Proceedings of the 56th emotion, and stress,” in MuSe ’21: Proceedings of the 2nd on Multi-
Annual Meeting of the Association for Computational Linguistics, ACL modal Sentiment Analysis Challenge, Virtual Event, China, 24 October
2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, 2021, 2021, pp. 5–14.
I. Gurevych and Y. Miyao, Eds. Association for Computational [252] J. Sun, S. Han, Y.-P. Ruan, X. Zhang, S.-K. Zheng, Y. Liu, Y. Huang,
Linguistics, 2018, pp. 2236–2246. and T. Li, “Layer-wise fusion with modality independence modeling for
[235] W. Yu, H. Xu, F. Meng, Y. Zhu, Y. Ma, J. Wu, J. Zou, and K. Yang, multi-modal emotion recognition,” in Proceedings of the 61st Annual
“Ch-sims: A chinese multimodal sentiment analysis dataset with fine- Meeting of the Association for Computational Linguistics (Volume 1:
grained annotation of modality,” in Proceedings of the 58th annual Long Papers), 2023, pp. 658–670.
meeting of the association for computational linguistics, 2020, pp. [253] J. A. M. Correa, M. K. Abadi, N. Sebe, and I. Patras, “AMIGOS: A
3718–3727. dataset for affect, personality and mood research on individuals and
[236] A. Zadeh, Y. S. Cao, S. Hessner, P. P. Liang, S. Poria, and L.-P. groups,” IEEE Trans. Affect. Comput., vol. 12, no. 2, pp. 479–493,
Morency, “Cmu-moseas: A multimodal language dataset for spanish, 2021.
portuguese, german and french,” in Proceedings of the Conference on [254] Q. Zhang, J. Fu, X. Liu, and X. Huang, “Adaptive co-attention network
Empirical Methods in Natural Language Processing. Conference on for named entity recognition in tweets,” in Proceedings of the Thirty-
Empirical Methods in Natural Language Processing, vol. 2020, 2020, Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th
p. 1801. innovative Applications of Artificial Intelligence (IAAI-18), and the 8th
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 25
AAAI Symposium on Educational Advances in Artificial Intelligence [277] K. Ma, H. Huang, J. Chen, H. Chen, P. Ji, X. Zang, H. Fang,
(EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, 2018, C. Ban, H. Sun, M. Chen et al., “Beyond uncertainty: Evidential
pp. 5674–5681. deep learning for robust video temporal grounding,” arXiv preprint
[255] C. Xu, X. Luo, and D. Wang, “Mcpr: A chinese product review dataset arXiv:2408.16272, 2024.
for multimodal aspect-based sentiment analysis,” in International Con- [278] J. Gao, M. Chen, and C. Xu, “Vectorized evidential learning for
ference on Cognitive Computing. Springer, 2022, pp. 83–90. weakly-supervised temporal action localization,” IEEE Transactions on
[256] H. Yang, Y. Zhao, J. Liu, Y. Wu, and B. Qin, “MACSA: A multi- Pattern Analysis and Machine Intelligence, vol. 45, no. 12, pp. 15 949–
modal aspect-category sentiment analysis dataset with multimodal fine- 15 963, 2023.
grained aligned annotations,” CoRR, vol. abs/2206.13969, 2022. [279] Z. Aldeneh and E. M. Provost, “Using regional saliency for speech
[257] J. Zhou, J. Zhao, J. X. Huang, Q. V. Hu, and L. He, “Masad: A emotion recognition,” in 2017 IEEE International Conference on
large-scale dataset for multimodal aspect-based sentiment analysis,” Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans,
Neurocomputing, vol. 455, pp. 47–58, 2021. LA, USA, March 5-9, 2017. IEEE, 2017, pp. 2741–2745.
[258] M. Luo, H. Fei, B. Li, S. Wu, Q. Liu, S. Poria, E. Cambria, M.-L. [280] P. Li, Y. Song, I. McLoughlin, W. Guo, and L. Dai, “An attention
Lee, and W. Hsu, “Panosent: A panoptic sextuple extraction benchmark pooling based representation learning method for speech emotion
for multimodal conversational aspect-based sentiment analysis,” arXiv recognition,” in 19th Annual Conference of the International Speech
preprint arXiv:2408.09481, 2024. Communication Association, Interspeech 2018, Hyderabad, India,
[259] D. Borth, R. Ji, T. Chen, T. Breuel, and S.-F. Chang, “Large-scale September 2-6, 2018, B. Yegnanarayana, Ed. ISCA, 2018, pp. 3087–
visual sentiment ontology and detectors using adjective noun pairs,” in 3091.
Proceedings of the 21st ACM international conference on Multimedia,
[281] G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou,
2013, pp. 223–232.
B. W. Schuller, and S. Zafeiriou, “Adieu features? end-to-end speech
[260] S. Li and W. Deng, “Deep facial expression recognition: A survey,”
emotion recognition using a deep convolutional recurrent network,”
IEEE transactions on affective computing, vol. 13, no. 3, pp. 1195–
in 2016 IEEE International Conference on Acoustics, Speech and
1215, 2020.
Signal Processing, ICASSP 2016, Shanghai, China, March 20-25, 2016.
[261] Y. Zhang, C. Wang, and W. Deng, “Relative uncertainty learning
IEEE, 2016, pp. 5200–5204.
for facial expression recognition,” Advances in Neural Information
Processing Systems, vol. 34, pp. 17 616–17 627, 2021. [282] S. Mirsamadi, E. Barsoum, and C. Zhang, “Automatic speech emotion
[262] J. Li, K. Jin, D. Zhou, N. Kubota, and Z. Ju, “Attention mechanism- recognition using recurrent neural networks with local attention,” in
based cnn for facial expression recognition,” Neurocomputing, vol. 411, 2017 IEEE International Conference on Acoustics, Speech and Signal
pp. 340–350, 2020. Processing, ICASSP 2017, New Orleans, LA, USA, March 5-9, 2017.
[263] A. H. Farzaneh and X. Qi, “Facial expression recognition in the wild IEEE, 2017, pp. 2227–2231.
via deep attentive center loss,” in Proceedings of the IEEE/CVF winter [283] Z. Zhao, Y. Zheng, Z. Zhang, H. Wang, Y. Zhao, and C. Li, “Ex-
conference on applications of computer vision, 2021, pp. 2402–2411. ploring spatio-temporal representations by integrating attention-based
[264] H. Wang, B. Li, S. Wu, S. Shen, F. Liu, S. Ding, and A. Zhou, bidirectional-lstm-rnns and fcns for speech emotion recognition,” in
“Rethinking the learning paradigm for dynamic facial expression 19th Annual Conference of the International Speech Communication
recognition,” in Proceedings of the IEEE/CVF conference on computer Association, Interspeech 2018, Hyderabad, India, September 2-6, 2018,
vision and pattern recognition, 2023, pp. 17 958–17 968. B. Yegnanarayana, Ed. ISCA, 2018, pp. 272–276.
[265] Z. Zhao and Q. Liu, “Former-dfer: Dynamic facial expression recog- [284] D. Luo, Y. Zou, and D. Huang, “Investigation on joint representation
nition transformer,” in Proceedings of the 29th ACM International learning for robust feature extraction in speech emotion recognition,”
Conference on Multimedia, 2021, pp. 1553–1561. in 19th Annual Conference of the International Speech Communication
[266] H. Li, M. Sui, Z. Zhu et al., “Nr-dfernet: Noise-robust network for dy- Association, Interspeech 2018, Hyderabad, India, September 2-6, 2018,
namic facial expression recognition,” arXiv preprint arXiv:2206.04975, B. Yegnanarayana, Ed. ISCA, 2018, pp. 152–156.
2022. [285] M. Jiménez-Guarneros and G. F. Pineda, “Cross-subject eeg-based
[267] Y. Wang, Y. Sun, W. Song, S. Gao, Y. Huang, Z. Chen, W. Ge, and emotion recognition via semisupervised multisource joint distribution
W. Zhang, “Dpcnet: Dual path multi-excitation collaborative network adaptation,” IEEE Trans. Instrum. Meas., vol. 72, pp. 1–11, 2023.
for facial expression representation learning in videos,” in Proceedings [286] Y. Peng, Y. Zhang, W. Kong, F. Nie, B. Lu, and A. Cichocki,
of the 30th ACM International Conference on Multimedia, 2022, pp. “S3 lrr: A unified model for joint discriminative subspace identification
101–110. and semisupervised EEG emotion recognition,” IEEE Trans. Instrum.
[268] Y. Liu, W. Wang, C. Feng, H. Zhang, Z. Chen, and Y. Zhan, “Ex- Meas., vol. 71, pp. 1–13, 2022.
pression snippet transformer for robust video-based facial expression [287] Y. Peng, W. Kong, F. Qin, F. Nie, J. Fang, B. Lu, and A. Cichocki,
recognition,” 2021. “Self-weighted semi-supervised classification for joint eeg-based emo-
[269] F. Ma, B. Sun, and S. Li, “Logo-former: Local-global spatio-temporal tion recognition and affective activation patterns mining,” IEEE Trans.
transformer for dynamic facial expression recognition,” in ICASSP Instrum. Meas., vol. 70, pp. 1–11, 2021.
2023-2023 IEEE International Conference on Acoustics, Speech and [288] X. Quan, Z. Zeng, J. Jiang, Y. Zhang, B. Lu, and D. Wu, “Physio-
Signal Processing (ICASSP). IEEE, 2023, pp. 1–5. logical signals based affective computing: A systematic review,” Acta
[270] H. Li, H. Niu, Z. Zhu, and F. Zhao, “Intensity-aware loss for dynamic Automatica Sinica, vol. 47, no. 8, pp. 1769–1784, 2021.
facial expression recognition in the wild,” 2022.
[289] V. Chaparro, A. Gomez, A. Salgado, O. L. Quintero, N. López, and
[271] H. Liu, R. An, Z. Zhang, B. Ma, W. Zhang, Y. Song, Y. Hu, W. Chen,
L. F. Villa, “Emotion recognition from EEG and facial expressions:
and Y. Ding, “Norface: Improving facial expression analysis by identity
a multimodal approach,” in 40th Annual International Conference of
normalization,” arXiv preprint arXiv:2407.15617, 2024.
the IEEE Engineering in Medicine and Biology Society, EMBC 2018,
[272] H. Huang, X. Qiao, Z. Chen, H. Chen, B. Li, Z. Sun, M. Chen, and
Honolulu, HI, USA, July 18-21, 2018. IEEE, 2018, pp. 530–533.
X. Li, “Crest: Cross-modal resonance through evidential deep learning
for enhanced zero-shot learning,” arXiv preprint arXiv:2404.09640, [290] Y. Huang, J. Yang, P. Liao, and J. Pan, “Fusion of facial expressions and
2024. EEG for multimodal emotion recognition,” Comput. Intell. Neurosci.,
[273] Z. Han, C. Zhang, H. Fu, and J. T. Zhou, “Trusted multi-view vol. 2017, pp. 2 107 451:1–2 107 451:8, 2017.
classification with dynamic evidential fusion,” IEEE transactions on [291] Q. Zhu, G. Lu, and J. Yan, “Valence-arousal model based emotion
pattern analysis and machine intelligence, vol. 45, no. 2, pp. 2551– recognition using eeg, peripheral physiological signals and facial
2566, 2022. expression,” in ICMLSC 2020: The 4th International Conference on
[274] H. Huang, Z. Liu, S. Letchmunan, M. Lin, M. Deveci, W. Pedrycz, Machine Learning and Soft Computing, Haiphong City, Viet Nam,
and P. Siarry, “Evidential deep partial multi-view classification with January 17-19, 2020. ACM, 2020, pp. 81–85.
discount fusion,” arXiv preprint arXiv:2408.13123, 2024. [292] H. Tang, W. Liu, W. Zheng, and B. Lu, “Multimodal emotion recog-
[275] H. Huang, C. Qin, Z. Liu, K. Ma, J. Chen, H. Fang, C. Ban, H. Sun, and nition using deep neural networks,” in Neural Information Processing
Z. He, “Trusted unified feature-neighborhood dynamics for multi-view - 24th International Conference, ICONIP 2017, Guangzhou, China,
classification,” arXiv preprint arXiv:2409.00755, 2024. November 14-18, 2017, Proceedings, Part IV, ser. Lecture Notes in
[276] J. Gao, M. Chen, and C. Xu, “Vectorized evidential learning for Computer Science, D. Liu, S. Xie, Y. Li, D. Zhao, and E. M. El-Alfy,
weakly-supervised temporal action localization,” IEEE Transactions on Eds., vol. 10637. Springer, 2017, pp. 811–819.
Pattern Analysis and Machine Intelligence, vol. 45, no. 12, pp. 15 949– [293] Y. Tan, Z. Sun, F. Duan, J. Solé-Casals, and C. F. Caiafa, “A mul-
15 963, 2023. timodal emotion recognition method based on facial expressions and
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 26
electroencephalography,” Biomed. Signal Process. Control., vol. 70, p. [309] J. Zhou, C. Ma, D. Long, G. Xu, N. Ding, H. Zhang, P. Xie, and
103029, 2021. G. Liu, “Hierarchy-aware global model for hierarchical text classifica-
[294] J. Vazquez-Rodriguez, G. Lefebvre, J. Cumin, and J. L. Crowley, tion,” in Proceedings of the 58th Annual Meeting of the Association
“Emotion recognition with pre-trained transformers using multimodal for Computational Linguistics, ACL 2020, Online, July 5-10, 2020,
signals,” in 10th International Conference on Affective Computing and D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault, Eds. Association
Intelligent Interaction, ACII 2022, Nara, Japan, October 18-21, 2022, for Computational Linguistics, 2020, pp. 1106–1117.
2022, pp. 1–8. [310] G. Hu, G. Lu, and Y. Zhao, “FSS-GCN: A graph convolutional
[295] G. Hu, G. Lu, and Y. Zhao, “Bidirectional hierarchical attention networks with fusion of semantic and structure for emotion cause
networks based on document-level context for emotion cause extrac- analysis,” Knowl. Based Syst., vol. 212, p. 106584, 2021.
tion,” in Findings of the Association for Computational Linguistics: [311] D. Ghosal, N. Majumder, S. Poria, N. Chhaya, and A. F. Gelbukh,
EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16- “Dialoguegcn: A graph convolutional neural network for emotion
20 November, 2021, 2021, pp. 558–568. recognition in conversation,” in Proceedings of the 2019 Conference on
[296] M. Li, H. Zhao, T. Gu, and D. Ying, “Experiencer-driven and Empirical Methods in Natural Language Processing and the 9th Inter-
knowledge-aware graph model for emotion-cause pair extraction,” national Joint Conference on Natural Language Processing, EMNLP-
Knowl. Based Syst., vol. 278, p. 110703, 2023. IJCNLP 2019, Hong Kong, China, November 3-7, 2019, K. Inui,
[297] W. Li, Y. Li, V. Pandelea, M. Ge, L. Zhu, and E. Cambria, “Ecpec: J. Jiang, V. Ng, and X. Wan, Eds. Association for Computational
Emotion-cause pair extraction in conversations,” IEEE Transactions on Linguistics, 2019, pp. 154–164.
Affective Computing, pp. 1–12, 2022. [312] M. Munezero, C. S. Montero, E. Sutinen, and J. Pajunen, “Are they
[298] B. Li, H. Fei, F. Li, T.-s. Chua, and D. Ji, “Multimodal emotion-cause different? affect, feeling, emotion, sentiment, and opinion detection in
pair extraction with holistic interaction and label constraint,” ACM text,” IEEE Trans. Affect. Comput., vol. 5, no. 2, pp. 101–111, 2014.
Trans. Multimedia Comput. Commun. Appl., aug 2024, just Accepted. [313] K. Cheng, Z. Yang, M. Zhang, and Y. Sun, “Uniker: A unified
[Online]. Available: https://doi.org/10.1145/3689646 framework for combining embedding and definite horn rule reasoning
[299] J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image for knowledge graph inference,” in Proceedings of the 2021 Conference
pre-training for unified vision-language understanding and generation,” on Empirical Methods in Natural Language Processing, EMNLP 2021,
in International Conference on Machine Learning. PMLR, 2022, pp. Virtual Event / Punta Cana, Dominican Republic, 7-11 November,
12 888–12 900. 2021, M. Moens, X. Huang, L. Specia, and S. W. Yih, Eds., pp. 9753–
9771.
[300] Y. Zeng, S. Mai, and H. Hu, “Which is making the contribution: Mod-
[314] W. Wang, H. Bao, L. Dong, J. Bjorck, Z. Peng, Q. Liu, K. Aggarwal,
ulating unimodal and cross-modal dynamics for multimodal sentiment
O. K. Mohammed, S. Singhal, S. Som, and F. Wei, “Image as a foreign
analysis,” in Findings of the Association for Computational Linguistics:
language: Beit pretraining for all vision and vision-language tasks,”
EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20
CoRR, vol. abs/2208.10442, 2022.
November, 2021, 2021, pp. 1262–1274.
[315] D. Hershcovich, S. Frank, H. Lent, M. de Lhoneux, M. Abdou,
[301] C. Fan, H. Yan, J. Du, L. Gui, L. Bing, M. Yang, R. Xu, and
S. Brandl, E. Bugliarello, L. Cabello Piqueras, I. Chalkidis, R. Cui,
R. Mao, “A knowledge regularized hierarchical approach for emotion
C. Fierro, K. Margatina, P. Rust, and A. Søgaard, “Challenges and
cause analysis,” in Proceedings of the 2019 Conference on Empirical
strategies in cross-cultural NLP,” in Proceedings of the 60th Annual
Methods in Natural Language Processing and the 9th International
Meeting of the Association for Computational Linguistics (Volume 1:
Joint Conference on Natural Language Processing, EMNLP-IJCNLP
Long Papers), S. Muresan, P. Nakov, and A. Villavicencio, Eds., May
2019, Hong Kong, China, November 3-7, 2019, 2019, pp. 5613–5623.
2022.
[302] R. Speer, J. Chin, and C. Havasi, “Conceptnet 5.5: An open multilin- [316] S. Hareli, K. Kafetsios, and U. Hess, “A cross-cultural study on emotion
gual graph of general knowledge,” in Proceedings of the Thirty-First expression and the learning of social norms,” Frontiers in psychology,
AAAI Conference on Artificial Intelligence, February 4-9, 2017, San vol. 6, p. 1501, 2015.
Francisco, California, USA, 2017, pp. 4444–4451. [317] M. Obrist, S. A. Seah, and S. Subramanian, “Talking about tactile
[303] E. Cambria, Q. Liu, S. Decherchi, F. Xing, and K. Kwok, “Senticnet 7: experiences,” in Proceedings of Human Factors in Computing Systems,
A commonsense-based neurosymbolic AI framework for explainable 2013, pp. 1659–1668.
sentiment analysis,” in Proceedings of the Thirteenth Language Re-
sources and Evaluation Conference, LREC 2022, Marseille, France, 20-
25 June 2022, N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri,
T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo,
J. Odijk, and S. Piperidis, Eds. European Language Resources
Association, 2022, pp. 3829–3839.
[304] H. Zhang, D. Khashabi, Y. Song, and D. Roth, “Transomcs: From
linguistic graphs to commonsense knowledge,” in Proceedings of the
Twenty-Ninth International Joint Conference on Artificial Intelligence,
IJCAI 2020, 2020, pp. 4004–4010.
[305] A. Bosselut, H. Rashkin, M. Sap, C. Malaviya, A. Celikyilmaz, and
Y. Choi, “COMET: commonsense transformers for automatic knowl-
edge graph construction,” in Proceedings of the 57th Conference of the
Association for Computational Linguistics, ACL 2019, Florence, Italy,
July 28- August 2, 2019, Volume 1: Long Papers, 2019, pp. 4762–4779.
[306] M. Sap, R. L. Bras, E. Allaway, C. Bhagavatula, N. Lourie, H. Rashkin,
B. Roof, N. A. Smith, and Y. Choi, “ATOMIC: an atlas of machine
commonsense for if-then reasoning,” in The Thirty-Third AAAI Confer-
ence on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative
Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth
AAAI Symposium on Educational Advances in Artificial Intelligence,
EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019,
2019, pp. 3027–3035.
[307] D. Li, Y. Li, J. Zhang, K. Li, C. Wei, J. Cui, and B. Wang, “C3KG:
A chinese commonsense conversation knowledge graph,” CoRR, vol.
abs/2204.02549, 2022.
[308] Z. Yang, D. Yang, C. Dyer, X. He, A. J. Smola, and E. H. Hovy, “Hi-
erarchical attention networks for document classification,” in NAACL
HLT 2016, The 2016 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language
Technologies, San Diego California, USA, June 12-17, 2016, K. Knight,
A. Nenkova, and O. Rambow, Eds. The Association for Computational
Linguistics, 2016, pp. 1480–1489.