clsdd
clsdd
Abstract—Multimodal affective computing has garnered in- seeks to develop models capable of interpreting and reasoning
creasing attention due to its broad applications in analyzing sentiment or emotional state over multiple modalities.
human behaviors and intentions, especially in text-dominated In its early stages, researchers of affective computing pre-
arXiv:2409.07388v2 [cs.CL] 30 Oct 2024
such as adapter [32], prompt [33], instruction-tuning [34] and tasks from an NLP perspective. Section X looks ahead to
in-context learning [35], [36]. future work from three aspects of the unification of multimodal
More and more works of multimodal affective computing affective computing tasks, the incorporation of external knowl-
leverage these parameter-efficient transfer learning methods edge, and affective computing with less-studied modalities.
to transfer knowledge from pre-trained models (e.g., uni- Lastly, Section XI concludes this survey and its contribution
modal pre-trained model or multimodal pre-trained model) to to multimodal affective computing community.
downstream affective tasks to improve model performance by
further fine-tuning the pre-trained model. For instance, Zou III. M ULTIMODAL A FFECTIVE C OMPUTING TASKS
et al. [37] design a multimodal prompt Transformer (MPT) In this section, we show the definition of each task and
to perform cross-modal information fusion. UniMSE [38] discuss their application scenarios. Table I presents basic
proposes an adapter-based modal fusion method, which injects information, including task input, output, type, and parent task,
acoustic and visual signals into the T5 model to fuse them with for each of the four tasks.
multi-level textual information.
Multimodal affective computing encompasses tasks like A. Multimodal Sentiment Analysis
sentiment analysis, opinion mining, and emotion recognition
using modalities such as text, audio, images, video, physiolog- Multimodal sentiment analysis (MSA) [44] origins from
ical signals, and haptic feedback. This survey focuses mainly sentiment analysis (SA) task [45] and it extends SA with the
on three key modalities: natural language, visual signals, and multimodal input. As a key research topics for computers to
vocal signals. We highlight four main tasks in this survey: understand human behaviors, the goal of multimodal sentiment
Multimodal Sentiment Analysis (MSA), Multimodal Emotion analysis (MSA) is to predict sentiment polarity and sentiment
Recognition in Conversation (MERC), Multimodal Aspect- intensity based on multimodal signals [46]. This task belongs
Based Sentiment Analysis (MABSA), and Multimodal Multi- to binary classification and regression task.
label Emotion Recognition (MMER). A considerable volume 1) Task Formalization: Given a multimodal signal Ii =
of studies exists in the field of multimodal affective computing, {Iit , Iia , Iiv }, we use Iim , m ∈ {t, a, v} to represent unimodal
and several reviews have been published [15], [39]–[43]. raw sequence drawn from the video fragment i, where {t, a, v}
However, these reviews primarily focus on specific affective denote the three types of modalities—text, acoustic and visual.
computing tasks or specific single modality and overlook an Multimodal sentiment analysis aims to predict the real number
overview of multimodal affective computing across multiple yir ∈ R, where yir ∈ [−3, 3] reflects the sentiment strength. We
tasks, and the consistencies and differences among these tasks. feed Ii as the model input and train a model to predict yir .
The goal of this survey is twofold. First, this survey aims 2) Application Scenarios: We categorize multimodal senti-
to provide a comprehensive overview of multimodal affective ment analysis applications into key areas: social media mon-
computing for beginners exploring deep learning in emotion itoring, customer feedback, market research, content creation,
analysis, detailing tasks, inputs, outputs, and relevant datasets. healthcare, and product reviews. For example, analyzing senti-
Second, it also offers insights for researchers to reflect on past ment in text, images, and videos on social media helps gauge
developments, explore future trends, and examine technical public opinion and monitor brand perception, while analyzing
approaches, challenges, and research directions in areas such multimedia product reviews can improve personalized recom-
as multimodal sentiment analysis and emotion recognition. mendations and user satisfaction.
TABLE I
T HE DETAILS OF MULTIMODAL AFFECTIVE TASKS . T,A,V DENOTE TEXT, AUDIO AND VISUAL MODALITIES RESPECTIVELY.
Here, y i indicates ith utterance’s emotion category that is D. Multimodal Multi-label Emotion Recognition
predefined before. Multimodal signals may show more than one emotion label,
2) Application Scenarios: Multimodal Emotion Recogni- which boosts the rise of a new task: multimodal multi-label
tion in Conversation (MERC) has broad applications across emotion recognition (MMER). MMER inherits the charac-
key areas: human-computer interaction, virtual assistants, teristic of multimodal emotion recognition and multi-label
healthcare, and customer service. (i) In Human-Computer classification [54], [55]. MMER is developed from multi-label
Interaction, MERC enhances user experience by enabling emotion recognition, which predicts two or more basic emo-
systems to recognize and respond to emotional states, lead- tion categories to analysis the given multimodal information
ing to more personalized interactions. (ii) Virtual Assistants and it is a multi-label multi-class classification.
and Chatbots benefit from improved emotional understand- 1) Task Formalization: Given a multimodal signal Ii =
ing, making conversations more natural and engaging. (iii) {Iit , Iia , Iiv }, Ii contains three types of modalities—text, audio
In Customer Service, MERC helps agents better respond to and visual. Formally, we use Iim ∈ Rdm ×lm , m ∈ {t, a, v} to
customer emotions, enhancing satisfaction. Additionally, bio- represent the raw sequence of text, audio, and visual modalities
sensing systems measuring physiological signals like ECG, from the sample i. dm and lm denote the feature dimension
PPG, EEG, and GSR expand MERC applications in robotics, and sequence length of modality m. The goal of MMER is
healthcare, and virtual reality. to recognize at least one emotion categories from |L| pre-
defined label space Y = {y1 , y2 , · · · , y|L| } according to the
multimodal signal Ii .
C. Multimodal Aspect-based Sentiment Analysis
2) Application Scenarios: Multimodal multi-label emotion
Xu et al. [50] are among the first to put forward the new recognition seeks to create AI systems that can understand
task, aspect based multimodal sentiment analysis. Multimodal and categorize emotions expressed through various modalities
aspect-based sentiment analysis (MABSA) is contructed based simultaneously. This task is challenging due to the complexity
on aspect-based sentiment analysis in texts [51], [52]. In con- and variability of human emotions, differences in emotional
trast with MSA and ERC, multimodal aspect-based sentiment expression across individuals and cultures, and the need for
analysis performs on fined granularity multimodal signals. effective integration of diverse modalities.
MABSA receives texts and vision (image) modalitie as the
inputs and outputs the tuple including aspect and its sentiment IV. M ODAL F EATURE E XTRACTOR
polarity. This task can be viewed as the classification, tuple
For multimodal affective computing tasks, the model input
extraction and triple extraction tasks. Recently, MABSA has
typically includes at least two modalities. In this section, we
attracted increasing attention. Given an image and correspond-
introduce the common feature extractors that transform raw
ing text, MABSA is defined as jointly extracting all aspect
sequences into a feature vectors.
terms from image-text pairs and predicting their sentiment
a) Text Feature Extractor: For text modality, researchers
polarities, i.e., positive, negative and neutral.
adopt static word embedding methods like Word2Vec [56]
1) Task Formalization: Suppose the multimodal inputs in- and GloVec [57] to initialize word representation. Also, text
clude a textual content T = {w1 , w2 , ..., wL } and an im- modality can be encoded into feature vector through pre-
age set I = {I1 , I2 , · · · , IK }, the goal of MABSA is to trained language models like BERT [58], BART [59], and
predict the sentiment polarities with a given aspect phrase T5 [60] to extract the text representation. More recently, a
A = {a1 , a2 , · · · , aN }, where ai denotes the ith aspect (e.g., collection of foundation language models like LLaMA [61],
food), L is the length of textual context, K is the number of [62], Mamba [63] emerge and are used for encoding text
images, and N is the length of aspect phrase. modality.
2) Application Scenarios: Multimodal aspect-based senti- b) Audio Feature Extractor: For audio modality, raw
ment analysis (MABSA) focuses on improving products and acoustic input needs to be processed into numerical sequential
services by analyzing reviews across text, images, and videos vectors. The common way is to use librosa2 to extract Mel-
to identify customer opinions on specific aspects. For example, spectrogram as audio features. It is the short-term power
MABSA can assess dining experiences, like food quality or spectrum of sound and is widely used in modern audio
service, to enhance restaurant operations. It also applies to processing. Transformer structure has achieved tremendous
social media, where analyzing mixed content provides deeper success of in the field of NLP and computer vision. Gong et
insights into public opinion, aiding better decision-making and
marketing strategies. 2 https://github.com/librosa/librosa.
4
al. [64] propose audio spectrogram Transformer (AST), which Flamingo [70] allows the models to interpret and generate
converted waveform into a sequence of 128-dimensional log outputs based on combined visual and textual inputs. In con-
Mel filterbank (fbank) features to encode audio modality. trast with prompt, instruction-tuning belongs to the learning
c) Vision Feature Extractor: For image modality, re- paradigm of prompt. Also, models like InstructBLIP [73]
searchers can extract fixed T frames from each segment and FLAN [75] have demonstrated that instruction-tuning not
and use effecientNet [65] pre-trained (supervised) on VG- only improves the model’s adherence to instructions but also
Gface3 and AFEW dataset as vision initial representation. enhances its ability to generalize across tasks. In the commu-
Furthermore, Dosovitskiy et al. [66] propose to use standard nity of multimodal affective computing, researchers can lever-
Transformer directly to images, which split an image into age these parameter-efficient transfer learning methods (e.g.,
patches and provide the sequence of linear embeddings of adapter, prompt and instruction tuning) to transfer knowledge
these patches as an input to a Transformer. CLIP [67] jointly from pre-trained models (e.g., unimodal pre-trained model or
trained image and its caption with the contrastive learning, multimodal pre-trained model) to downstream affective tasks,
thereby extraction vision features that correspond to texts. further tune the pre-trained model with the affective dataset.
d) Multimodal Feature Extractor: The emergence of Considering that multimodal affective computing involves
multimodal pre-trained model (MPM) marks a significant ad- multimodal learning, therefore, we analyze multimodal affec-
vancement in integrating multimodal signals, as demonstrated tive computing works from multimodal fusion and multimodal
by groundbreaking developments like GPT-4 [68] and Gem- alignment, as shown in Fig. 1.
ini [69]. Among the open-source innovations, Flamingo [70]
represents an early effort to integrate visual features with
B. Multimodal Fusion
LLMs using cross-attention layers. BLIP-2 [71] introduces a
trainable adaptor module (Q-Former) that efficiently connects Multimodal signals are heterogeneous and derived from
a pre-trained image encoder with a pre-trained LLM, ensuring various information sources, making integrating multimodal
precise alignment of visual and textual information. Similarly, signals into one representation essential. Tasi et al. [77] sum-
MiniGPT-4 [72] achieves visual and textual alignment through marize multimodal fusion into early, late or intermediate fusion
a linear projection layer. InstructBLIP [73] advances the field based on the fusion stage. Early fusion combines features
by focusing on vision-language instruction tuning, building from different modalities at the input level before the model
upon BLIP-2, and requiring a deeper understanding and larger processes them. Late fusion processes features from different
datasets for effective training. LLaVA [74] integrates CLIP’s modalities separately through individual sub-networks, and the
image encoder with LLaMA’s language decoder to enhance outputs of these sub-networks are combined at a later stage,
instruction tuning capabilities. Akbari et al. [31] train VATT typically just before making the final decision. Late fusion uses
end-to-end from scratch using multimodal contrastive losses unimodal decision values and combines them using mecha-
and evaluate its performance by the downstream tasks of nisms such as averaging [124], voting schemes [125], weight-
video action recognition, audio event classification, image ing based on channel noise [126] and signal variance [127],
classification, and text-to-video retrieval. Based on multimodal or a learned model [6], [128]. The two fusion strategies face
pre-trained model, raw modal signals can be used to extract some problems. For example, early fusion at the feature level
modal features. can underrate intra-modal dynamics after the fusion operation,
while late fusion at the decision level may struggle to capture
V. M ULTIMODAL L EARNING ON M ULTIMODAL A FFECTIVE inter-modal dynamics before the fusion operation. Different
C OMPUTING from the previous two methods by combining features from
different modalities at intermediate layers of the model learner,
Multimodal learning involves learning representations from Intermediate fusion allows for more interaction between the
different modalities. Generally, the multimodal model should modalities at different processing stages, potentially leading to
first align the modalities based on their semantics before fusing richer representations [38], [129], [130]. Based on these fusion
multimodal signals. After alignment, the model combines strategies, we review multimodal fusion from three aspects:
multiple modalities into one representation vector. cross-modality learning, modal consistency and difference, and
multi-stage modal fusion. Fig. 2 illustrates the three aspects
A. Preliminary of modal fusion.
With the scaling of the pre-trained model, parameter- 1) Cross-modality Learning: Cross-modality learning fo-
efficient transfer learning emerges such as adapter [32], cuses on the incorporation of inter-modality dependencies and
prompt [33], instruction-tuning [34] and in-context learn- interactions for better modal fusion in representation learn-
ing [35], [36]. In this paradigm, instead of adapting pre- ing. Early works of multimodal fusion [76] mainly operate
trained LMs to downstream tasks via objective engineering, geometric manipulation in the feature spaces to fuse multiple
downstream tasks are reformulated to look more like those modalities. The recent common way of cross-modality learn-
solved during the original LM training with the help of prompt, ing is to introduce attention-based learning method to model
instruction-tuning and in-context learning. The use of prompts inter-modality and intra-modality interactions. For example,
in Vision Language Models (VLMs) like GPT-4V [68] and MuLT [77] proposes multimodal Transformer to learn inter-
modal interaction. Chen et al. [78] augment the inter-intra
3 https://www.robots.ox.ac.uk/ vgg/software/vgg face/. modal features with trimodal collaborative interaction and
5
TFN [76], MuLT [77], TCDN [78], CM-BERT [79], HGraph-CL [80], BAFN [81], TeFNA [82],
Cross-modal Learning
CMCF-SRNet [83], MultiEMO [84], MM-RBN [85], MAGDRA [86], AMuSE [87].
Multimodal Learning on
Multimodal Modal Consistency MMIM [88], MPT [89], MMMIE [90], MISA [91], CoolNet [92], ModalNet [93], MAN [87], TAILOR [94],
Fusion (§V-B) and Difference AMP [95], STCN [96].
Affective Computing
Multi-stage TSCL-FHFN [97], HFFN [98], CLMLF [99], RMFN [100], CTFN [101], MCM [102], FmlMSN [103],
Modal Fusion ScaleVLAD [104], MUG [105], HFCE [106], MTAG [107], CHFusion [108].
MMIN [109], CMAL [110], M2R2 [111], EMMR [112], TFR-Net [113], MRAN [114], VIGAN [115],
Miss Modality
TATE [116], IF-MMIN [117], CTFN [101], MTMSA [118], FGR [119], MMTE+AMMTD [120].
Multimodal
Alignment (§V-C)
Semantic Alignment MuLT [77], ScaleVLAD [104], Robust-MSA [121], HGraph-CL [80], SPIM [122], MA-CMU-SGRNet [123].
Fig. 1. Taxonomy of multimodal affective computing from multimodal fusion and multimodal alignment.
image encoder
vision audio
vision
audio encoder vision
text
audio audio
..felt a bit text encoder
common
frustrated text modal consistency
text cross-modality and difference multi-stage fusion
Fig. 2. Illustration of multimodal fusion from following aspects: 1) cross-modality modal fusion, 2) modal fusion based on modal consistency and difference
and 3) multi-stage modal fusion.
unifies the characteristics of the three modals (inter-modal). while modal difference leverages complementary informa-
Yang et al. [79] propose the cross-modal BERT (CM-BERT), tion from each modality to improve overall data understand-
aiming to model the interaction of text and audio modality ing. For example, several works [89], [90] have explored
based on pre-trained BERT model. Lin et al. [80] explore the learning modal consistency and difference using contrastive
intricate relations of intra- and inter-modal representations for learning. Han et al. [88] maximized the mutual information
sentiment extraction. More recently, Tang et al. [81] propose between modalities and between each modality to explore the
the multimodal dynamic enhanced block to capture the intra- modal consistency. Another study [89] proposes a hybrid con-
modality sentiment context, which decrease the intra-modality trastive learning framework that performs intra-/inter-modal
redundancy of auxiliary modalities. Huang et al. [82] propose contrastive learning and semi-contrastive learning simultane-
a Text-centered fusion network with cross-modal attention ously, models cross-modal interactions, preserves inter-class
(TeFNA), a multimodal fusion network that uses crossmodal relationships, and reduces the modality gap. Additionally,
attention to model unaligned multimodal timing information. Zheng et al. [90] combined mutual information maximization
In the community of emotion recognition, CMCF-SRNet [83] between modal pairs with mutual information minimization
is a cross-modality context fusion and semantic refinement net- between input data and corresponding features. This method
work, which contains a cross-modal locality-constrained trans- aims to extract modal-invariant and task-related information.
former and a graph-based semantic refinement transformer, Modal consistency can also be viewed as the process of
aiming to explore the multimodal interaction and dependencies projecting multiple modalities into a common latent space
among utterances. Shi et al. [84] propose an attention-based (modality-invariant representation), while modal difference
correlation-aware multimodal fusion framework MultiEMO, refers to projecting modalities into modality-specific repre-
which captures cross-modal mapping relationships across sentation spaces. For example, Hazarika et al. [91] propose
textual, audio and visual modalities based on bidirectional a method that projects each modality into both a modality-
multi-head cross attention layers. In summary, cross-modality invariant and a modality-specific space. They implemented
learning mainly focuses on modeling the relation between a decoder to reconstruct the original modal representation
modalities. using both modality-invariant and modality-specific features.
2) Modal Consistency and Difference: Modal consistency AMuSE [87] proposes a multimodal attention network to
refers to the shared feature space across different modali- capture cross-modal interactions at various levels of spatial
ties for the same sample, while modal difference highlights abstraction by jointly learning its interactive bunch of mode-
the unique information each modality provides. Most multi- specific peripheral and central networks. For the fine-grain
modal fusion approaches separate representations into modal- sentiment analysis, Xiao et al. [92] present CoolNet to boost
invariant (consistency) and modal-specific (difference) com- the performance of visual-language models in seamlessly
ponents. Modal consistency helps handle missing modalities, integrating vision and language information. Zhang et al. [93]
6
unified deep learning framework to efficiently handle missing Lai et al. [122] propose a deep modal shared information
labels and missing modalities for audio-visual emotion recog- learning module based on the covariance matrix to capture the
nition through correlation analysis. Zeng et al. [116] propose shared information between modalities. Additionally, we use
a tag-assisted Transformer encoder (TATE) network to handle a label generation module based on a self- supervised learning
the problem of missing uncertain modalities, which designs strategy to capture the private information of the modalities.
a tag encoding module to cover both the single modality and Our module is plug-and-play in multimodal tasks, and by
multiple modalities missing cases, so as to guide the network’s changing the parameterization, it can adjust the information
attention to those missing modalities. Zuo et al. [117] propose exchange relationship between the modes and learn the private
to use invariant features for a missing modality imagination or shared information between the specified modes. We also
network (IF-MMIN), which includes an invariant feature learn- employ a multi-task learning strategy to help the model focus
ing strategy and an invariant feature based imagination module its attention on the modal differentiation training data. For
(IF-IM). Through the two strategies, IF-MMIN can alleviate model robustness, Robust-MSA [121] present an interactive
the modality gap during the missing modalities prediction, thus platform that visualizes the impact of modality noise to help
improving the robustness of multimodal joint representation. researchers improve model capacity.
Zhou et al. [119] propose a novel brain tumor segmentation
network in the case of missing one or more modalities. VI. M ODELS ACROSS M ULTIMODAL A FFECTIVE
The proposed network consists of three sub-networks: a C OMPUTING
feature-enhanced generator, a correlation constraint block and In the community of multimodal affective computing, the
a segmentation network. The last group is translation-base works appear to significant consistency in term of development
methods. Tang et al. [101] propose the coupled-translation technical route. For clarity, we group the these works based
fusion network (CTFN) to model bi-direction interplay via on multitask learning, pre-trained model, enhanced knowledge,
couple learning, ensuring the robustness in respect to missing contextual information. Meanwhile, we briefly summarized the
modalities. Liu et al. [118] propose a modality translation- advancements of MSA, MERC, MABSA and MMER tasks
based MSA model (MTMSA), which is robust to uncertain through the above four aspects. Fig. 4 summarizes the typical
missing modalities. In summary, the works about alignment works of multimodal affective computing from these aspects
for miss modality focus on miss modality reconstruction and and Table II shows the taxonomy of multimodal affective
learning based on the available modal information. computing.
2) Alignment for Cross-modal Semantics: Semantic align-
ment aims to find the connection between multiple modalities
in one sample, which refers to searching one modal informa- A. Multitask Learning
tion through another modal information and vice versa. In the Multitask learning trains a model on multiple related tasks
filed of MSA, Tsai et al. [77] leverage cross-modality and simultaneously, using shared information to enhance perfor-
multi-scale modal alignment to implement the modal consis- mance. The loss function combines losses from all tasks,
tency in the semantic aspects, respectively. ScaleVLAD [202] with model parameters updated via gradient descent. In mul-
proposes a fusion model to gather multi-Scale representation timodal affective computing, multitask learning helps distin-
from text, video, and audio with shared vectors of locally guish between modal-invariant and modal-specific features and
aggregated descriptors to improve unaligned multimodal senti- integrates emotion-related sub-tasks into a unified framework.
ment analysis. Yang et al. [107] convert unaligned multimodal Fig. 5 shows the learning paradigm of multitask learning in
sequence data into a graph with heterogeneous nodes and multimodal affective learning task.
edges that captures the rich interactions across modalities and 1) Multimodal Sentiment Analysis: In filed of multi-
through time. Lee et al. [203] segment the audio and the under- modal sentiment analysis, Self-MM [136] generates a pseudo-
lying text signals into equal number of steps in an aligned way label [207]–[209] for single modality and then jointly train
so that the same time steps of the sequential signals cover the unimodal and multimodal representations based on the gener-
same time span in the signals. Zong et al. [204] exploit multi- ated and original labels. Furthermore, a translation framework
ple bi-direction translations, leading to double multimodal fus- ARGF between modalities, i.e., translating from one modality
ing embeddings compared with traditional translation methods. to another is used as an auxiliary task to regualize the
Wang et al. [205] propose a multimodal encoding–decoding multimodal representation learning [137]. Akhtar et al. [138]
translation network with a transformer and adopted a joint en- leverage the interdependence of the tasks sentiment and emo-
coding–decoding method with text as the primary information tion to improve the model performance on two tasks. Chen et
and sound and image as the secondary information. Zhang al. [139] propose a video-based cross-modal auxiliary network
et al. [123] propose a novel multi-level alignment to bridge (VCAN), which is comprised of an audio features map module
the gap between acoustic and lexical modalities, which can and a cross-modal selection module to make use of auxiliary
effectively contrast both the instance-level and prototype-level information. Zheng et al. [140] propose a disentanglement
relationships, separating the multimodal features in the latent translation network (DTN) with slack reconstruction to cap-
space. Yu et al. [206] propose an unsupervised approach which ture desirable information properties, obtain a unified feature
minimizes the Wasserstein distance between both modalities, distribution and reduce redundancy. Zheng et al. [90] com-
forcing both encoders to produce more appropriate repre- bine mutual information maximization (MMMIE) between
sentations for the final extraction to align text and image. modal pairs with mutual information minimization between
8
MSA (§VI-A1) Self-MM [136], ARGF [137], MultiSE [138], VCAN [139], DTN [140], MMMIE [90], MMIM [88], MISA [91],
FacialMMT [25], MMMIE [90], AuxEmo [141], TDFNet [142], MALN [143], LGCCT [144], MultiEMO [84],
MERC (§VI-A2)
Multitask RLEMO [145],
Learning (§VI-A)
MABSA (§VI-A3) CMMT [146], AbCoRD [147], JML [148], MPT [37], MMRBN [85],
MMER (§VI-A4) AMP [95], MEGLN-LDA [149], MultiSE [138], AMP [95].
MAG-XLNet [22], UniMSE [38], AOBERT [150], SKESL [151], TEASAL [152], TO-BERT [153], SPT [154],
MSA (§VI-B1)
ALMT [155]
Multimodal Affective Computing
MERC (§VI-B2) FacialMMT [25], QAP [20], UniMSE [38], GraphSmile [156],
Pre-trained
Model (§VI-B)
MIMN [25], GMP [18], ERUP [157], VLP-MABSA [158], DR-BERT [159], DTCA [160], MSRA [161],
MABSA (§VI-B3)
AOF-ABSA [162], AD-GCFN [163], MOCOLNet [164],
MSA (§VI-C1) TETFN [165], ITP [19], SKEAFN [166], SAWFN [167], MTAG [107],
Enhanced MERC (§VI-C2) ConSK-GCN [168], DMD [169], MRST [170], SF [171], TGMFN [172], RLEMO [145], DEAN [173],
Knowledge
(§VI-C) MABSA (§VI-C3) KNIT [174], FITE [175], CoolNet [176], HIMT [177],
MMER (§VI-C4) UniVA-RoBERTa [178], CARAT [179], M3TR [180], MAGDRA [86], HHMPN [181],
MuLT [77], CIA [182], CAT-LSTM [183], CAMFNet [184], MTAG [107], CTNet [185], ScaleVLAD [104],
MSA (§VI-D1)
MMML [186], GFML [186], CHFusion [108],
CMCF-SRNet [83], MMGCN [187], MM-DFN [188], SAMGN [189], M3Net [190], M3GAT [187], RL-EMO [145],
Contextual MERC (§VI-D2)
SCMFN [191], EmoCaps [192], GA2MIF [193], MALN [143], COGMEN [49],
Information
(§VI-D)
MABSA (§VI-D3) DTCA [160], MCPR [194], Elbphilharmonie [195], M2DF [196], AoM [197], FGSN [198], MIMN [16],
Fig. 4. Taxonomy of multimodal affective computing works from aspects multitask learning, pre-trained model, enhanced knowledge and contextual information.
TABLE II
TAXONOMY OF MULTIMODAL AFFECTIVE COMPUTING TASK FROM TASKS OF MSA, MERC, MABSA AND MMER.
3) Multimodal Aspect-based Sentiment Analysis: In the video segments in the wild with both multimodal and
stduy of multimodal aspect-based sentiment analysis, Yu et independent unimodal annotations.
al. [160] propose an unsupervised approach which minimizes • CH-SIMS v2.0 [242] is an extended version of CH-
the Wasserstein distance between both modalities, forcing both SIMS that includes more data instances, spanning text,
encoders to produce more appropriate representations for the audio and visual modalities. Each modality of sample
final extraction. Xu et al. [194] design and construct a multi- is annotated with sentiment polarity, and then sample is
modal Chinese product review dataset (MCPR) to support the annotated with a concluded sentiment.
research of MABSA. Anschutz et al. [195] report the results of • CMU-MOSEAS [240] is the first large-scale multimodal
an empirical study on how semantic computing can provide in- language dataset for Spanish, Portuguese, German and
sights into user-generated content for domain experts. In addi- French, and it is collect from YouTube and its samples
tion, this work discussed different image-based aspect retrieval are 4,000 in total.
and aspect-based sentiment analysis approaches to handle and • ICT-MMMO [241] is collected from online social review
structure large datasets. Zhao et al. [196] borrow the idea videos that encompass a strong diversity in how people
of Curriculum Learning and propose a multi-grained multi- express opinions about movies and include a real-world
curriculum denoising Framework (M2DF) to adjust the order variability in video recording quality4 .
of training data, so as to obtain more contextual information. • YouTube [46] collects 47 videos from the social media
Zhou et al. [197] propose an aspect-oriented method (AoM) web site YouTube. Each video contains 3-11 utterances
to detect aspect-relevant semantic and sentiment information. with most videos having 5-6 utterances in the extracted
Specifically, an aspect-aware attention module is designed to 30 seconds.
simultaneously select textual tokens and image blocks that are
semantically related to the aspects. Zhao et al. [198] propose B. Multimodal Emotion Recognition in Conversation
a fusion with GCN and SE ResNeXt Network (FGSN), which • MELD [243] contains 13,707 video clips of multi-party
constructs a graph convolution network on the dependency tree conversations, with labels following Ekman’s six univer-
of sentences to obtain the context representation and aspects sal emotions, including joy, sadness, fear, anger, surprise
words representation by using syntactic information and word and disgust.
dependency. • IEMOCAP [244] has 7,532 video clips of multi-party
4) Multimodal Multi-label Emotion Recognition: conversations, with labels following Ekman’s six univer-
MMS2S [199] is a multimodal sequence-to-set approach to sal emotions, including joy, sadness, fear, anger, surprise
effectively model label dependence and modality dependence. and disgust.
MESGN [200] firstly proposes this task, which simultaneously • HED [245] contains happy, sad, disgust, angry and scared
models the modality-to-label and label-to-label dependencies. emotion-aligned face, body and text samples, which are
Many works consider the dependencies of multi-label based much larger than existing datasets. Moreover, the emotion
on the characteristics of co-occurrence labels. Zhao et labels were correspondingly attached to those samples by
al. [201] propose a general multimodal dialogue-aware strictly following a standard psychological paradigm.
interaction framework, named by MDI, to model the impacts • RML [246] collects video samples from eight subjects,
of dialogue context on emotion recognition. speaking six different languages. The six languages are
English, Mandarin, Urdu, Punjabi, Persian, and Italian.
VII. DATASETS OF M ULTIMODAL A FFECTIVE C OMPUTING This dataset contains 500 video samples, each delivered
with one of the six particular emotions.
In this section, we introduce the benchmark datasets of • BAUM-1 [247] contains two sets: BAUM-1a and BAUM-
MSA, MERC, MABSA, and MMER tasks. To facilitate easy 1s databases. BAUM-1a database contains clips con-
navigation and reference, the details of datasets are shown in taining expressions of five basic emotions (happiness,
Table III with a comprehensive overview of the studies that sadness, anger, disgust, fear) along with expressions
we cover. of boredom, confusion (unsure) and interest (curiosity).
BAUM-1s database contains clips reflecting six basic
A. Multimodal Sentiment Analysis emotions and also expressions of boredom, contempt,
confusion, thinking, concentrating, bothered, and neutral.
• MOSI [237] contains 2,199 utterance video segments, • MAHNOB-HCI [248] includes 527 facial video record-
and each segment is manually annotated with a sentiment ings of 27 participants engaged in various tasks and
score ranging from -3 to +3 to indicate the sentiment interactions, while their physiological signals such as 32-
polarity and relative sentiment strength of the segment. channel electroencephalogram (EEG), 3-channel electro-
• MOSEI [238] is an upgraded version of MOSI, anno- cardiogram.
tated with both sentiment and emotion. MOSEI contains • Deap [249] contains data from 32 participants, aged
22,856 movie review clips from YouTube. Each sample between 19 and 37 (50% female), who were recorded
in MOSEI includes sentiment annotations ranging from watching 40 one-minute music videos. Each participant
-3 to +3 and multi-label emotion annotations. was asked to evaluate each video by assigning values
• CH-SIMS [239] is a Chinese single- and multimodal
sentiment analysis dataset, which contains 2,281 refined 4 http://multicomp.ict.usc.edu
14
TABLE III
L IST OF MULTIMODAL AFFECTIVE COMPUTING DATASETS . T,A,V DENOTE THE TEXT, AUDIO AND VISION MODALITIES , RESPECTIVELY. E MOTION
DENOTES THE SAMPLE IN DATASET IS LABELED WITH EMOTION CATEGORY AND SENTIMENT DENOTES THE SAMPLE IN DATESET IS LABELED WITH
SENTIMENT POLARITY.
from 1 to 9 for arousal, valence, dominance, like/dislike, • AMIGOS [257] is collected in two experimental set-
and familiarity. tings. In the first setting, 40 participants viewed 16
• MuSe-CaR [250] focuses on the tasks of emotion, short emotional videos. In the second setting, participants
emotion-target engagement, and trustworthiness recogni- watched 4 longer videos, some individually and others
tion by means of comprehensively integrating the audio- in groups. During these sessions, participants’ phys-
visual and language modalities. iological signals—Electroencephalogram (EEG), Elec-
• CHEAVD 2.0 [251] is selected from Chinese movies, trocardiogram (ECG), and Galvanic Skin Response
soap operas and TV shows, which contains noise in the (GSR)—were recorded using wearable sensors.
background to mimic real-world conditions.
• MSP-IMPROV [252] is a multimodal emotional
database comprised of spontaneous dyadic interactions, C. Multimodal Aspect-based Sentiment Analysis
designed to study audiovisual perception of expressive • Twitter2015 and Twitter2017 are originally provided by
behaviors. the work [258] for multimodal named entity recognition
• MEISD [253] is a large-scale balanced multimodal multi- and annotated with the sentiment polarity for each aspect
label emotion, intensity, and sentiment dialogue dataset by the work [14].
(MEISD) collected from different TV series that has • MCPR [259] has 2,719 text-image pairs and 610 distinct
textual, audio, and visual features. aspects in total, which collects 1.5k product reviews
• MESD [254] is the first multimodal and multi-task senti- involving clothing and furniture departments, from the e-
ment, emotion, and desire dataset, which contains 9,190 commercial platform JD.com. It is the first aspect-based
text-image pairs, with English text. multimodal Chinese product review dataset.
• Ulm-TSST [255] is a multimodal dataset, where partic- • Multi-ZOL [50] consists of reviews of mobile phones
ipants were recorded in a stressful situation emulating a collected from ZOL.com. It contains 5,288 sets of multi-
job interview, following the TSST protocol. modal data points that cover various models of mobile
• CHERMA [256] provides uni-modal labels for each indi- phones from multiple brands. These data points are
vidual modality, and multi-modal labels for all modalities annotated with a sentiment intensity rating from 1 to 10
jointly observed. It is collected from various source, for six aspects.
including 148 TV series, 7 variety shows, and 2 movies. • MACSA [260] contains more than 21K text-image pairs,
and provides fine-grained annotations for both textual and
15
visual content and firstly uses the aspect category as the IX. D ISCUSS
pivot to align the fine-grained elements between the two In this section, we briefly discuss the works of multimodal
modalities. affective computing based on facial expression, acoustic sig-
• MASAD [261] selects 38,532 samples from a partial nal, physiological signals, and emotion cause. Furthermore, we
VSO visual dataset [263] (approximately 120,000 sam- discuss the technical routes across multiple multimodal affec-
ples) that can clearly express sentiments and categorized tive computing tasks to track their consistency and difference.
them into seven domains: food, goods, buildings, animal,
human, plant, scenery, with a total of 57 predefined
aspects. A. Other Multimodal Affective Computing
• PanoSent [262] is annotated both manually and au- a) Multimodal Affective Computing Based on Facial
tomatically, featuring high quality, large scale (10,000 Expression Recognition: Facial expression recognition has
dialogues), multimodality (text, image, audio and video), significantly evolved over the years, progressing from static to
multilingualism (English, Chinese and Spanish), multi- dynamic methods. Initially, static facial expression recognition
scenarios (over 100 domains), and covering both im- (SFER) relied on single-frame images, utilizing traditional
plicit&explicit sentiment elements. image processing techniques such as Local Binary Patterns
(LBP) and Gabor filters to extract features for classification.
The advent of deep learning brought Convolutional Neural
D. Multimodal Multi-label Emotion Recognition Networks (CNNs), which markedly improved the accuracy
• CMU-MOSEI [238] contains 22,856 movie review clips of SFER [264]–[267]. However, static methods were lim-
from Youtube videos. Each video intrinsically contains ited in capturing the temporal dynamics of facial expres-
three modalities: text, audio, and visual, and each movie sions [268]. Some methods attempt to approach the problem
review clip is annotated with at least one emotion cate- from a local-global feature perspective, extracting more fine-
gory of the set: angry, disgust, fear, happy, sad, surprise. grained visual representations and identifying key informative
• M3 ED [201] is a multimodal emotional dialogue dataset segments [269]–[274]. These approaches enhance robustness
in Chinese, which contains a total of 9,082 turns and against noisy frames, enabling uncertainty-aware inference. To
24,449 utterances, and each utterance is annotated with further enhance accuracy, recent advancements in DFER focus
the seven emotion categories (happy, surprise, sad, dis- on integrating multimodal data and employing parameter-
gust, anger, fear, and neutral). efficient fine-tuning (PEFT) to adapt large pre-trained models
for enhanced performance [275]–[277], while Liu et al. [278]
introduces the concept of expression reenactment (i.e. nor-
VIII. E VALUATION M ETRICS malization), harnessing generative AI to mitigate noise in in-
the-wild datasets. Moreover, the burgeoning evidential deep
In this section, we report the mainstream evaluation metrics learning (EDL) has shown considerable promise by enabling
for each multimodal affective computing task. explicit uncertainty quantification through the distributional
a) Multimodal Sentiment Analysis: Previous works adopt measurement in latent spaces for improved interpretability,
mean absolute error (MAE), Pearson correlation (Corr), seven- with demonstrated efficacy in zero-shot learning [279], multi-
class classification accuracy (ACC-7), binary classification view classification [280]–[282], video understanding [283]–
accuracy (ACC-2) and F1 score computed for positive/negative [285] and multi-modal named entity recognition.
and non-negative/negative classification as evaluation metrics. b) Multimodal Affective Computing Based on Acoustic
b) Multimodal Emotion Recognition in a conversation: Signal: The model based on single-sentence single-task is
Accuracy (ACC) and weighted F1 (WF1) are used for evalua- the most common model in speech emotion recognition. For
tion. Additionally, the imbalance label distribution results in a example, Aldeneh et al. [286] use CNN to perform convolu-
phenomenon that the trained model performs better on some tions in the time direction of handcrafted temporal features
categories and perform poorly on others. In order to verify the (40-dimensional MFSC) to identify emotion-salient regions
impacts of data distribution on model performance, researchers and used global max pooling to capture important temporal
also provide ACC and F1 on each emotion category to measure areas. Li et al. [287] apply two different convolution kernels
the model performance. on spectrograms to extract temporal and frequency domain
c) Multimodal Aspect-based Sentiment Analysis: With features, concatenated them, and input them into a CNN
the previous methods, for multimodal aspect term extrac- for learning, followed by attention mechanism pooling for
tion (MATE) and joint multimodal aspect sentiment analysis classification. Trigeorgis et al. [288] use CNN for end-to-end
(JMASA) tasks, researchers use precision (P), recall (R) and learning directly on speech signals, avoiding the problem of
micro-F1 (F1) as the evaluation metrics. For the multimodal feature extraction not being robust for all speakers. Mirsamadi
aspect sentiment classification(MASC) task, accuracy (ACC) et al. [289] combine Bidirectional LSTM (Bi-LSTM) with
and macro-F1 are as evaluation metrics. a novel pooling strategy, utilizing attention mechanisms to
d) Multimodal Multi-label Emotion Recognition: Ac- enable the network to focus on emotionally prominent parts of
cording to the prior work, multi-label classification works sentences. Zhao et al. [290] consider the temporal and spatial
mostly adopt accuracy (ACC), micro-F1, precision (P) and characteristics of the spectrum in the attention mechanism to
recall (R) as evaluation metrics. learn time-related features in spectrograms, and using CNN
16
to learn frequency-related features in spectrograms. Luo et tasks such as image captioning [8], [306], the impact of vision
al. [291] propose a dual-channel speech emotion recognition is more significant than language. In contrast, multimodal
model that uses CNN and RNN to learn from spectrograms affective computing tasks place a greater emphasis on lan-
on one hand, and separately learns HSFs features on the other, guage [38], [307].
finally concatenating the obtained features for classification. b) Pre-trained model: Generally, pre-trained models are
c) Multimodal Affective Computing Based on Physiolog- used to encode raw modal information into vectors. From this
ical Signals: In medical measurements and health monitoring, perspective, multimodal affective computing tasks adopt pre-
EEG-based emotion recognition (EER) is one of the most trained models as the backbone and then fine-tune them for
promising directions within emotion recognition and has at- downstream tasks. For example, UniMSE [38] uses T5 as the
tracted substantial research attention [292]–[294]. Notably, the backbone, while GMP [18] utilizes BART. These approaches
field of affective computing has seen nearly 1,000 publica- aim to transfer the general knowledge embedded in pre-trained
tions related to EER since 2010 [295]. Numerous EEG-based language models to the field of affective computing.
multimodal emotion recognition (EMER) methods have been c) Enhanced knowledge: Commonsense knowledge en-
proposed [296]–[300], leveraging the complementarity and re- compasses facts and judgments about our natural world. In
dundancy between EEG and other physiological signals in ex- the field of affective computing, this knowledge is crucial for
pressing emotions. For example, Vazquez et al. [301] address enabling machines to understand human emotions and their
the problem of multimodal emotion recognition from multiple underlying causes. Researchers enhance affective computing
physiological signal, which demonstrates Transformer-based by integrating external knowledge sources such as sentiment
approach is suitable for emotion recognition based on physi- lexicons [308], English knowledge bases [309]–[313], and
ological signal. Chinese knowledge bases [314] as the external knowledge to
d) Multimodal Affective Computing Based on Emotion enhance affective computing.
Cause: Apart from focusing on the emotions themselves, d) Contextual information: Affective computing tasks
the capacity of machine for understanding the cause that require an understanding of contextual information. In MERC,
triggers an emotion is essential for comprehending human contextual information encompasses the entire conversation,
behaviors, which makes emotion-cause pair extraction (ECPE) including both previous and subsequent utterances relative to
crucial. Over the years, text-based ECPE has made signifi- the current utterance. For MABSA, contextual information
cant progress [302], [303]. Based on ECPE, Li et al. [304] refers to the full sentence containing customer opinions. Re-
propose multimodal emotion-cause pair extraction (MECPE), searchers integrate contextual information using hierarchical
which aims to extract emotion-cause pairs with multimodal approaches [315], [316], self-attention mechanisms [58], and
information. Initially, Li et al. [304] construct a joint train- graph-based dependency modeling [317], [318]. Additionally,
ing architecture, which contains the main task, i.e., multi- affective computing tasks can enhance understanding by incor-
modal emotion-cause pair extraction and two subtasks, i.e., porating non-verbal cues such as facial expressions and vocal
multimodal emotion detection and cause detection. To solve tone, alongside textual information.
MECPE, researchers borrowed the multitask learning frame-
work to train the model using multiple training objectives of
sub-tasks, aiming to enhance the knowledge sharing among C. Difference among Multimodal Affective Computing
them. For example, Li et al. [305] propose a novel model We examine the differences among multimodal affective
that captures holistic interaction and label constraint (HiLo) computing tasks by considering the type of downstream tasks,
features for the MECPE task. HiLo enables cross-modality and sentiment granularity, and application contexts to identify the
cross-utterance feature interactions through various attention unique characteristics of each task.
mechanisms, providing a strong foundation for accurate cause For downstream tasks, MSA predicts sentiment strength as
extraction. a regression task. MERC is a multi-class classification task for
identifying emotion categories. MMER performs multilabel
B. Consistency among Multimodal Affective Computing emotion recognition, detecting multiple emotions simultane-
We categorize the multimodal affective computing tasks ously. MABSA involves extracting aspects and opinions to
into several key areas: multimodal alignment and fusion, determine sentiment polarity, categorizing it as information ex-
multi-task learning, pre-trained models, enhanced knowledge, traction. In terms of analysis granularity, MERC and MECPE
and contextual information. To ensure clarity, we discuss the focus on utterances and speakers within a conversation, while
consistencies across these aspects. MSA and MMER concentrate on sentence-level information
a) Multimodal alignment and fusion: Among MSA, within a document. MABSA, on the other hand, focuses
MERC, MABSA and MMER tasks, each is fundamentally a on aspects within comments. Some studies infer fine-grained
multimodal task that involves considering and combining at sentiment from coarse-grained sentiment [209], [319] or in-
least two modalities to make decisions. This process includes tegrate tasks of different granularities into a unified training
extracting features from each modality and integrating them framework [307]. Due to these differences in granularity,
into a unified representation vector. In multimodal represen- the contextual information varies as well. For instance, in
tation learning, modal alignment and fusion are two critical MABSA, the context includes the comment along with any
issues that must be addressed to advance the field of multi- associated images and short descriptions of aspects, whereas
modal affective computing. For vision-dominated multimodal in MERC, the context encompasses the entire conversation
17
[12] Z. Sun, P. K. Sarma, W. A. Sethares, and Y. Liang, “Learning relation- “Learning transferable visual models from natural language supervi-
ships between text, audio, and video via deep canonical correlation for sion,” in Proceedings of the 38th International Conference on Machine
multimodal language analysis,” in The Thirty-Fourth AAAI Conference Learning, ICML 2021, 18-24 July 2021, Virtual Event, 2021, pp. 8748–
on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative 8763.
Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth [29] H. Bao, W. Wang, L. Dong, Q. Liu, O. K. Mohammed, K. Aggarwal,
AAAI Symposium on Educational Advances in Artificial Intelligence, S. Som, S. Piao, and F. Wei, “Vlmo: Unified vision-language pre-
EAAI 2020, New York, NY, USA, February 7-12, 2020, 2020, pp. 8992– training with mixture-of-modality-experts,” in NeurIPS, 2022.
8999. [30] J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, and
[13] A. Zadeh, C. Mao, K. Shi, Y. Zhang, P. P. Liang, S. Poria, and Y. Wu, “Coca: Contrastive captioners are image-text foundation mod-
L. Morency, “Factorized multimodal transformer for multimodal se- els,” Trans. Mach. Learn. Res., vol. 2022, 2022.
quential learning,” CoRR, vol. abs/1911.09826, 2019. [31] H. Akbari, L. Yuan, R. Qian, W. Chuang, S. Chang, Y. Cui, and
[14] D. Lu, L. Neves, V. Carvalho, N. Zhang, and H. Ji, “Visual attention B. Gong, “VATT: transformers for multimodal self-supervised learning
model for name tagging in multimodal social media,” in Proceedings from raw video, audio and text,” in Advances in Neural Information
of the 56th Annual Meeting of the Association for Computational Processing Systems 34: Annual Conference on Neural Information
Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual,
1: Long Papers, 2018, pp. 1990–1999. 2021, pp. 24 206–24 221.
[15] A. Gandhi, K. Adhvaryu, S. Poria, E. Cambria, and A. Hussain, “Mul- [32] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. de Laroussilhe,
timodal sentiment analysis: A systematic review of history, datasets, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer
multimodal fusion methods, applications, challenges and future direc- learning for NLP,” in Proceedings of the 36th International Conference
tions,” Inf. Fusion, vol. 91, pp. 424–444, 2023. on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach,
[16] N. Xu, W. Mao, and G. Chen, “Multi-interactive memory network California, USA, 2019, pp. 2790–2799.
for aspect based multimodal sentiment analysis,” in The Thirty-Third [33] X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts
AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty- for generation,” in Proceedings of the 59th Annual Meeting of the
First Innovative Applications of Artificial Intelligence Conference, Association for Computational Linguistics and the 11th International
IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Joint Conference on Natural Language Processing, ACL/IJCNLP 2021,
Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, C. Zong,
- February 1, 2019, 2019, pp. 371–378. F. Xia, W. Li, and R. Navigli, Eds. Association for Computational
[17] F. Wang, Z. Ding, R. Xia, Z. Li, and J. Yu, “Multimodal emotion-cause Linguistics, 2021, pp. 4582–4597.
pair extraction in conversations,” CoRR, vol. abs/2110.08020, 2021. [34] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du,
[18] X. Yang, S. Feng, D. Wang, Q. Sun, W. Wu, Y. Zhang, P. Hong, and A. M. Dai, and Q. V. Le, “Finetuned language models are zero-shot
S. Poria, “Few-shot joint multimodal aspect-sentiment analysis based learners,” arXiv preprint arXiv:2109.01652, 2021.
on generative multimodal prompt,” in Findings of the Association for [35] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhari-
Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, wal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal,
2023, 2023, pp. 11 575–11 589. A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh,
[19] S. Rahmani, S. Hosseini, R. Zall, M. R. Kangavari, S. Kamran, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler,
and W. Hua, “Transfer-based adaptive tree for multimodal sentiment M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish,
analysis based on user latent aspects,” Knowl. Based Syst., vol. 261, p. A. Radford, I. Sutskever, and D. Amodei, “Language models are few-
110219, 2023. shot learners,” in Advances in Neural Information Processing Systems
[20] Z. Li, Y. Zhou, Y. Liu, F. Zhu, C. Yang, and S. Hu, “QAP: 33: Annual Conference on Neural Information Processing Systems
A quantum-inspired adaptive-priority-learning model for multimodal 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle,
emotion recognition,” in Findings of the Association for Computational M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., 2020.
Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, 2023, pp.
[36] Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun,
12 191–12 204.
J. Xu, and Z. Sui, “A survey on in-context learning,” arXiv preprint
[21] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng,
arXiv:2301.00234, 2022.
“Multimodal deep learning,” in Proceedings of the 28th International
[37] S. Zou, X. Huang, and X. Shen, “Multimodal prompt transformer with
Conference on Machine Learning, ICML 2011, Bellevue, Washington,
hybrid contrastive learning for emotion recognition in conversation,”
USA, June 28 - July 2, 2011, L. Getoor and T. Scheffer, Eds.
CoRR, vol. abs/2310.04456, 2023.
Omnipress, 2011, pp. 689–696.
[22] W. Rahman, M. K. Hasan, S. Lee, A. B. Zadeh, C. Mao, L. Morency, [38] G. Hu, T. Lin, Y. Zhao, G. Lu, Y. Wu, and Y. Li, “Unimse: Towards
and M. E. Hoque, “Integrating multimodal information in large pre- unified multimodal sentiment analysis and emotion recognition,” in
trained transformers,” in Proceedings of the 58th Annual Meeting of Proceedings of the 2022 Conference on Empirical Methods in Nat-
the Association for Computational Linguistics, ACL 2020, Online, July ural Language Processing, EMNLP 2022, Abu Dhabi, United Arab
5-10, 2020, 2020, pp. 2359–2369. Emirates, December 7-11, 2022, 2022, pp. 7837–7851.
[23] Y. Zhang and Q. Yang, “A survey on multi-task learning,” IEEE Trans. [39] Y. Zhang, X. Yang, X. Xu, Z. Gao, Y. Huang, S. Mu, S. Feng, D. Wang,
Knowl. Data Eng., vol. 34, no. 12, pp. 5586–5609, 2022. Y. Zhang, K. Song et al., “Affective computing in the era of large
[24] Y. Xie, K. Yang, C. Sun, B. Liu, and Z. Ji, “Knowledge-interactive language models: A survey from the nlp perspective,” arXiv preprint
network with sentiment polarity intensity-aware multi-task learning arXiv:2408.04638, 2024.
for emotion recognition in conversations,” in Findings of the Asso- [40] B. Pan, K. Hirota, Z. Jia, and Y. Dai, “A review of multimodal emotion
ciation for Computational Linguistics: EMNLP 2021, Virtual Event / recognition from datasets, preprocessing, features, and fusion methods,”
Punta Cana, Dominican Republic, 16-20 November, 2021, M. Moens, Neurocomputing, vol. 561, p. 126866, 2023.
X. Huang, L. Specia, and S. W. Yih, Eds. Association for Computa- [41] K. Ezzameli and H. Mahersia, “Emotion recognition from unimodal to
tional Linguistics, 2021, pp. 2879–2889. multimodal analysis: A review,” Inf. Fusion, vol. 99, p. 101847, 2023.
[25] W. Zheng, J. Yu, R. Xia, and S. Wang, “A facial expression-aware [42] Z. WANG, X. ZHANG, J. CUI, S.-B. HO, and E. CAMBRIA, “A
multimodal multi-task learning framework for emotion recognition in review of chinese sentiment analysis: Subjects, methods, and trends.”
multi-party conversations,” in Proceedings of the 61st Annual Meeting [43] A. Gandhi, K. Adhvaryu, S. Poria, E. Cambria, and A. Hussain, “Mul-
of the Association for Computational Linguistics (Volume 1: Long timodal sentiment analysis: A systematic review of history, datasets,
Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, 2023, pp. multimodal fusion methods, applications, challenges and future direc-
15 445–15 459. tions,” Information Fusion, vol. 91, pp. 424–444, 2023.
[26] Z. Chen, L. Chen, B. Chen, L. Qin, Y. Liu, S. Zhu, J. Lou, and [44] L. Zhu, Z. Zhu, C. Zhang, Y. Xu, and X. Kong, “Multimodal sentiment
K. Yu, “Unidu: Towards A unified generative dialogue understanding analysis based on fusion methods: A survey,” Inf. Fusion, vol. 95, pp.
framework,” CoRR, vol. abs/2204.04637, 2022. 306–325, 2023.
[27] H. Luo, L. Ji, B. Shi, H. Huang, N. Duan, T. Li, X. Chen, and M. Zhou, [45] T. Thongtan and T. Phienthrakul, “Sentiment classification using doc-
“Univilm: A unified video and language pre-training model for mul- ument embeddings trained with cosine similarity,” in Proceedings of
timodal understanding and generation,” CoRR, vol. abs/2002.06353, the 57th Conference of the Association for Computational Linguistics,
2020. ACL 2019, Florence, Italy, July 28 - August 2, 2019, Volume 2: Student
[28] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, Research Workshop, F. Alva-Manchego, E. Choi, and D. Khashabi, Eds.
G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, Association for Computational Linguistics, 2019, pp. 407–414.
19
[46] L. Morency, R. Mihalcea, and P. Doshi, “Towards multimodal sentiment E. Grave, and G. Lample, “Llama: Open and efficient foundation
analysis: harvesting opinions from the web,” in Proceedings of the language models,” CoRR, vol. abs/2302.13971, 2023.
13th International Conference on Multimodal Interfaces, ICMI 2011, [62] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei,
Alicante, Spain, November 14-18, 2011, 2011, pp. 169–176. N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher,
[47] A. G. A. and V. Vetriselvi, “Survey on multimodal approaches to C. Canton-Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes,
emotion recognition,” Neurocomputing, vol. 556, p. 126693, 2023. J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn,
[48] Y. Sun, N. Yu, and G. Fu, “A discourse-aware graph neural network S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa,
for emotion recognition in multi-party conversation,” in Findings of I. Kloumann, A. Korenev, P. S. Koura, M. Lachaux, T. Lavril, J. Lee,
the Association for Computational Linguistics: EMNLP 2021, Virtual D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra,
Event / Punta Cana, Dominican Republic, 16-20 November, 2021, I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi,
M. Moens, X. Huang, L. Specia, and S. W. Yih, Eds. Association for A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan,
Computational Linguistics, 2021, pp. 2949–2958. B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov,
[49] A. Joshi, A. Bhat, A. Jain, A. V. Singh, and A. Modi, “COGMEN: Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic,
contextualized GNN based multimodal emotion recognition,” CoRR, S. Edunov, and T. Scialom, “Llama 2: Open foundation and fine-tuned
vol. abs/2205.02455, 2022. chat models,” CoRR, vol. abs/2307.09288, 2023.
[50] N. Xu, W. Mao, and G. Chen, “Multi-interactive memory network for [63] A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with
aspect based multimodal sentiment analysis,” in Proceedings of the selective state spaces,” CoRR, vol. abs/2312.00752, 2023.
AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. [64] Y. Gong, Y. Chung, and J. R. Glass, “AST: audio spectrogram
371–378. transformer,” in 22nd Annual Conference of the International Speech
[51] Z. Chen and T. Qian, “Transfer capsule network for aspect level Communication Association, Interspeech 2021, Brno, Czechia, August
sentiment classification,” in Proceedings of the 57th Conference of 30 - September 3, 2021, H. Hermansky, H. Cernocký, L. Burget,
the Association for Computational Linguistics, ACL 2019, Florence, L. Lamel, O. Scharenborg, and P. Motlı́cek, Eds. ISCA, 2021, pp.
Italy, July 28- August 2, 2019, Volume 1: Long Papers, A. Korhonen, 571–575.
D. R. Traum, and L. Màrquez, Eds. Association for Computational [65] M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling for
Linguistics, 2019, pp. 547–556. convolutional neural networks,” in Proceedings of the 36th Interna-
[52] H. Yan, J. Dai, T. Ji, X. Qiu, and Z. Zhang, “A unified generative tional Conference on Machine Learning, ICML 2019, 9-15 June 2019,
framework for aspect-based sentiment analysis,” in Proceedings of the Long Beach, California, USA, ser. Proceedings of Machine Learning
59th Annual Meeting of the Association for Computational Linguistics Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR,
and the 11th International Joint Conference on Natural Language 2019, pp. 6105–6114.
Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual [66] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
Event, August 1-6, 2021, C. Zong, F. Xia, W. Li, and R. Navigli, Eds. T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly,
Association for Computational Linguistics, 2021, pp. 2416–2429. J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words:
[53] C. Li, F. Gao, J. Bu, L. Xu, X. Chen, Y. Gu, Z. Shao, Q. Zheng, Transformers for image recognition at scale,” in 9th International
N. Zhang, Y. Wang, and Z. Yu, “Sentiprompt: Sentiment knowledge Conference on Learning Representations, ICLR 2021, Virtual Event,
enhanced prompt-tuning for aspect-based sentiment analysis,” CoRR, Austria, May 3-7, 2021. OpenReview.net, 2021.
vol. abs/2109.08306, 2021. [67] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal,
[54] P. Yang, X. Sun, W. Li, S. Ma, W. Wu, and H. Wang, “SGM: sequence G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable
generation model for multi-label classification,” in Proceedings of the visual models from natural language supervision,” in International
27th International Conference on Computational Linguistics, COLING conference on machine learning, 2021, pp. 8748–8763.
2018, Santa Fe, New Mexico, USA, August 20-26, 2018, 2018, pp. [68] “Gpt-4v(ision) system card,” 2023. [Online]. Available: https:
3915–3926. //api.semanticscholar.org/CorpusID:263218031
[55] Q. Ma, C. Yuan, W. Zhou, and S. Hu, “Label-specific dual graph
[69] G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut,
neural network for multi-label text classification,” in Proceedings of the
J. Schalkwyk, A. M. Dai, A. Hauth et al., “Gemini: a family of highly
59th Annual Meeting of the Association for Computational Linguistics
capable multimodal models,” arXiv preprint arXiv:2312.11805, 2023.
and the 11th International Joint Conference on Natural Language
Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual [70] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson,
Event, August 1-6, 2021, 2021, pp. 3855–3864. K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo:
[56] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, a visual language model for few-shot learning,” Advances in Neural
“Distributed representations of words and phrases and their compo- Information Processing Systems, vol. 35, pp. 23 716–23 736, 2022.
sitionality,” in Advances in Neural Information Processing Systems 26: [71] J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-
27th Annual Conference on Neural Information Processing Systems image pre-training with frozen image encoders and large language
2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, models,” arXiv preprint arXiv:2301.12597, 2023.
Nevada, United States, 2013, pp. 3111–3119. [72] D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: En-
[57] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors hancing vision-language understanding with advanced large language
for word representation,” in Proceedings of the 2014 Conference on models,” arXiv preprint arXiv:2304.10592, 2023.
Empirical Methods in Natural Language Processing, EMNLP 2014, [73] W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li,
October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special P. Fung, and S. Hoi, “Instructblip: Towards general-purpose vision-
Interest Group of the ACL, 2014, pp. 1532–1543. language models with instruction tuning,” 2023.
[58] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. [74] H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,”
Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 36, 2024.
in Advances in Neural Information Processing Systems 30: Annual [75] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li,
Conference on Neural Information Processing Systems 2017, December X. Wang, M. Dehghani, S. Brahma et al., “Scaling instruction-finetuned
4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, language models,” Journal of Machine Learning Research, vol. 25,
H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, Eds., no. 70, pp. 1–53, 2024.
2017, pp. 5998–6008. [76] A. Zadeh, M. Chen, S. Poria, E. Cambria, and L. Morency, “Tensor
[59] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, fusion network for multimodal sentiment analysis,” in Proceedings
V. Stoyanov, and L. Zettlemoyer, “BART: denoising sequence-to- of the 2017 Conference on Empirical Methods in Natural Language
sequence pre-training for natural language generation, translation, and Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11,
comprehension,” in Proceedings of the 58th Annual Meeting of the 2017, M. Palmer, R. Hwa, and S. Riedel, Eds. Association for
Association for Computational Linguistics, ACL 2020, Online, July 5- Computational Linguistics, 2017, pp. 1103–1114.
10, 2020, 2020, pp. 7871–7880. [77] Y. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L. Morency, and
[60] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, R. Salakhutdinov, “Multimodal transformer for unaligned multimodal
Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning language sequences,” in Proceedings of the 57th Conference of the
with a unified text-to-text transformer,” J. Mach. Learn. Res., vol. 21, Association for Computational Linguistics, ACL 2019, Florence, Italy,
pp. 140:1–140:67, 2020. July 28- August 2, 2019, Volume 1: Long Papers, A. Korhonen,
[61] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, D. R. Traum, and L. Màrquez, Eds. Association for Computational
B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, Linguistics, 2019, pp. 6558–6569.
20
[78] C. Chen, H. Hong, J. Guo, and B. Song, “Inter-intra modal representa- Web Conference 2023, WWW 2023, Austin, TX, USA, 30 April 2023 -
tion augmentation with trimodal collaborative disentanglement network 4 May 2023, 2023, pp. 1510–1518.
for multimodal sentiment analysis,” IEEE ACM Trans. Audio Speech [96] M. Sharafi, M. Yazdchi, R. Rasti, and F. Nasimi, “A novel spatio-
Lang. Process., vol. 31, pp. 1476–1488, 2023. temporal convolutional neural framework for multimodal emotion
[79] K. Yang, H. Xu, and K. Gao, “CM-BERT: cross-modal BERT for text- recognition,” Biomed. Signal Process. Control., vol. 78, p. 103970,
audio sentiment analysis,” in MM ’20: The 28th ACM International 2022.
Conference on Multimedia, Virtual Event / Seattle, WA, USA, October [97] Y. Li, W. Weng, and C. Liu, “Tscl-fhfn: two-stage contrastive learning
12-16, 2020, 2020, pp. 521–528. and feature hierarchical fusion network for multimodal sentiment
[80] Z. Lin, B. Liang, Y. Long, Y. Dang, M. Yang, M. Zhang, and analysis,” Neural Computing and Applications, pp. 1–15, 2024.
R. Xu, “Modeling intra- and inter-modal relations: Hierarchical graph [98] S. Mai, H. Hu, and S. Xing, “Divide, conquer and combine: Hierarchi-
contrastive learning for multimodal sentiment analysis,” in Proceedings cal feature fusion network with local and global perspectives for mul-
of the 29th International Conference on Computational Linguistics, timodal affective computing,” in Proceedings of the 57th Conference
COLING 2022, Gyeongju, Republic of Korea, October 12-17, 2022, of the Association for Computational Linguistics, ACL 2019, Florence,
2022, pp. 7124–7135. Italy, July 28- August 2, 2019, Volume 1: Long Papers, 2019, pp. 481–
[81] J. Tang, D. Liu, X. Jin, Y. Peng, Q. Zhao, Y. Ding, and W. Kong, 492.
“BAFN: bi-direction attention based fusion network for multimodal [99] Z. Li, B. Xu, C. Zhu, and T. Zhao, “CLMLF: A contrastive learning
sentiment analysis,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, and multi-layer fusion method for multimodal sentiment detection,”
no. 4, pp. 1966–1978, 2023. in Findings of the Association for Computational Linguistics: NAACL
[82] C. Huang, J. Zhang, X. Wu, Y. Wang, M. Li, and X. Huang, “Tefna: 2022, Seattle, WA, United States, July 10-15, 2022, 2022, pp. 2282–
Text-centered fusion network with crossmodal attention for multimodal 2294.
sentiment analysis,” Knowl. Based Syst., vol. 269, p. 110502, 2023. [100] P. P. Liang, Z. Liu, A. Zadeh, and L. Morency, “Multimodal language
[83] X. Zhang and Y. Li, “A cross-modality context fusion and analysis with recurrent multistage fusion,” in Proceedings of the 2018
semantic refinement network for emotion recognition in conversation,” Conference on Empirical Methods in Natural Language Processing,
in Proceedings of the 61st Annual Meeting of the Association for Brussels, Belgium, October 31 - November 4, 2018, 2018, pp. 150–
Computational Linguistics (Volume 1: Long Papers). Toronto, Canada: 161.
Association for Computational Linguistics, Jul. 2023, pp. 13 099– [101] J. Tang, K. Li, X. Jin, A. Cichocki, Q. Zhao, and W. Kong, “CTFN:
13 110. [Online]. Available: https://aclanthology.org/2023.acl-long.732 hierarchical learning for multimodal sentiment analysis using coupled-
[84] T. Shi and S. Huang, “Multiemo: An attention-based correlation-aware translation fusion network,” in Proceedings of the 59th Annual Meeting
multimodal fusion framework for emotion recognition in conversa- of the Association for Computational Linguistics and the 11th Interna-
tions,” in Proceedings of the 61st Annual Meeting of the Association tional Joint Conference on Natural Language Processing, ACL/IJCNLP
for Computational Linguistics (Volume 1: Long Papers), ACL 2023, 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, 2021,
Toronto, Canada, July 9-14, 2023, 2023, pp. 14 752–14 766. pp. 5301–5311.
[85] X. Chen, “Mmrbn: Rule-based network for multimodal emotion recog- [102] Z. Li, Q. Guo, Y. Pan, W. Ding, J. Yu, Y. Zhang, W. Liu, H. Chen,
nition,” in ICASSP 2024-2024 IEEE International Conference on H. Wang, and Y. Xie, “Multi-level correlation mining framework with
Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. self-supervised label generation for multimodal sentiment analysis,” Inf.
8200–8204. Fusion, vol. 99, p. 101891, 2023.
[86] X. Li, J. Liu, Y. Xie, P. Gong, X. Zhang, and H. He, “Magdra: a multi- [103] J. Peng, T. Wu, W. Zhang, F. Cheng, S. Tan, F. Yi, and Y. Huang,
modal attention graph network with dynamic routing-by-agreement for “A fine-grained modal label-based multi-stage network for multimodal
multi-label emotion recognition,” Knowledge-Based Systems, vol. 283, sentiment analysis,” Expert Syst. Appl., vol. 221, p. 119721, 2023.
p. 111126, 2024. [104] H. Luo, L. Ji, Y. Huang, B. Wang, S. Ji, and T. Li, “Scalevlad:
[87] N. K. Devulapally, S. Anand, S. D. Bhattacharjee, J. Yuan, and Improving multimodal sentiment analysis via multi-scale fusion of
Y. Chang, “Amuse: Adaptive multimodal analysis for speaker emotion locally descriptors,” CoRR, vol. abs/2112.01368, 2021.
recognition in group conversations,” CoRR, vol. abs/2401.15164, 2024. [105] S. Mai, Y. Zhao, Y. Zeng, J. Yao, and H. Hu, “Meta-learn unimodal
[88] W. Han, H. Chen, and S. Poria, “Improving multimodal fusion with signals with weak supervision for multimodal sentiment analysis,”
hierarchical mutual information maximization for multimodal senti- arXiv preprint arXiv:2408.16029, 2024.
ment analysis,” in Proceedings of the 2021 Conference on Empirical [106] S. Minglong, O. Chunping, L. Yongbin, and R. Lin, “Multimodal emo-
Methods in Natural Language Processing, EMNLP 2021, Virtual Event tion recognition based on hierarchical fusion strategy and contextual
/ Punta Cana, Dominican Republic, 7-11 November, 2021, M. Moens, information embedding,” Beijing Da Xue Xue Bao, vol. 60, no. 3, pp.
X. Huang, L. Specia, and S. W. Yih, Eds., pp. 9180–9192. 393–402, 2024.
[89] S. Mai, Y. Zeng, S. Zheng, and H. Hu, “Hybrid contrastive learning [107] J. Yang, Y. Wang, R. Yi, Y. Zhu, A. Rehman, A. Zadeh, S. Poria, and
of tri-modal representation for multimodal sentiment analysis,” CoRR, L. Morency, “MTAG: modal-temporal attention graph for unaligned
vol. abs/2109.01797, 2021. human multimodal language sequences,” in Proceedings of the 2021
[90] J. Zheng, S. Zhang, X. Wang, and Z. Zeng, “Multimodal representa- Conference of the North American Chapter of the Association for
tions learning based on mutual information maximization and mini- Computational Linguistics: Human Language Technologies, NAACL-
mization and identity embedding for multimodal sentiment analysis,” HLT 2021, Online, June 6-11, 2021, 2021, pp. 1009–1021.
arXiv preprint arXiv:2201.03969, 2022. [108] N. Majumder, D. Hazarika, A. Gelbukh, E. Cambria, and S. Poria,
[91] D. Hazarika, R. Zimmermann, and S. Poria, “MISA: modality-invariant “Multimodal sentiment analysis using hierarchical fusion with context
and -specific representations for multimodal sentiment analysis,” in modeling,” Knowledge-based systems, vol. 161, pp. 124–133, 2018.
MM ’20: The 28th ACM International Conference on Multimedia, [109] J. Zhao, R. Li, and Q. Jin, “Missing modality imagination network for
Virtual Event / Seattle, WA, USA, October 12-16, 2020, C. W. Chen, emotion recognition with uncertain missing modalities,” in Proceedings
R. Cucchiara, X. Hua, G. Qi, E. Ricci, Z. Zhang, and R. Zimmermann, of the 59th Annual Meeting of the Association for Computational
Eds. ACM, 2020, pp. 1122–1131. Linguistics and the 11th International Joint Conference on Natural
[92] L. Xiao, X. Wu, S. Yang, J. Xu, J. Zhou, and L. He, “Cross-modal Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers),
fine-grained alignment and fusion network for multimodal aspect-based Virtual Event, August 1-6, 2021, 2021, pp. 2608–2618.
sentiment analysis,” Inf. Process. Manag., vol. 60, no. 6, p. 103508, [110] S. Parthasarathy and S. Sundaram, “Training strategies to handle miss-
2023. ing modalities for audio-visual expression recognition,” in Compan-
[93] Z. Zhang, Z. Wang, X. Li, N. Liu, B. Guo, and Z. Yu, “Modalnet: an ion Publication of the 2020 International Conference on Multimodal
aspect-level sentiment classification model by exploring multimodal Interaction, ICMI Companion 2020, Virtual Event, The Netherlands,
data with fusion discriminant attentional network,” World Wide Web, October, 2020, 2020, pp. 400–404.
vol. 24, no. 6, pp. 1957–1974, 2021. [111] N. Wang, H. Cao, J. Zhao, R. Chen, D. Yan, and J. Zhang, “M2R2:
[94] Y. Zhang, M. Chen, J. Shen, and C. Wang, “Tailor versatile multi- missing-modality robust emotion recognition framework with iterative
modal learning for multi-label emotion recognition,” in Proceedings of data augmentation,” IEEE Trans. Artif. Intell., vol. 4, no. 5, pp. 1305–
the AAAI Conference on Artificial Intelligence, vol. 36, no. 8, 2022, 1316, 2023.
pp. 9100–9108. [112] J. Zeng, J. Zhou, and T. Liu, “Mitigating inconsistencies in multimodal
[95] S. Ge, Z. Jiang, Z. Cheng, C. Wang, Y. Yin, and Q. Gu, “Learning sentiment analysis under uncertain missing modalities,” in Proceedings
robust multi-modal representation for multi-label emotion recognition of the 2022 Conference on Empirical Methods in Natural Language
via adversarial masking and perturbation,” in Proceedings of the ACM Processing, 2022, pp. 2924–2934.
21
[113] Z. Yuan, W. Li, H. Xu, and W. Yu, “Transformer-based feature [131] T. Baltrušaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine
reconstruction network for robust multimodal sentiment analysis,” in learning: A survey and taxonomy,” IEEE transactions on pattern
Proceedings of the 29th ACM International Conference on Multimedia, analysis and machine intelligence, vol. 41, no. 2, pp. 423–443, 2018.
2021, pp. 4400–4407. [132] Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang, “A survey of affect
[114] W. Luo, M. Xu, and H. Lai, “Multimodal reconstruct and align net for recognition methods: Audio, visual, and spontaneous expressions,”
missing modality problem in sentiment analysis,” in MultiMedia Mod- IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 1, pp. 39–58,
eling - 29th International Conference, MMM 2023, Bergen, Norway, 2009.
January 9-12, 2023, Proceedings, Part II, 2023, pp. 411–422. [133] C. Du, C. Du, H. Wang, J. Li, W.-L. Zheng, B.-L. Lu, and H. He,
[115] C. Shang, A. Palmer, J. Sun, K. Chen, J. Lu, and J. Bi, “VIGAN: “Semi-supervised deep generative modelling of incomplete multi-
missing view imputation with generative adversarial networks,” in 2017 modality emotional data,” in Proceedings of the 26th ACM interna-
IEEE International Conference on Big Data (IEEE BigData 2017), tional conference on Multimedia, 2018, pp. 108–116.
Boston, MA, USA, December 11-14, 2017, 2017, pp. 766–775. [134] Z. Wang, Z. Wan, and X. Wan, “Transmodality: An end2end fusion
[116] J. Zeng, T. Liu, and J. Zhou, “Tag-assisted multimodal sentiment method with transformer for multimodal sentiment analysis,” in WWW
analysis under uncertain missing modalities,” in SIGIR ’22: The 45th ’20: The Web Conference 2020, Taipei, Taiwan, April 20-24, 2020,
International ACM SIGIR Conference on Research and Development 2020, pp. 2514–2520.
in Information Retrieval, Madrid, Spain, July 11 - 15, 2022, 2022, pp. [135] F. Ma, S. Huang, and L. Zhang, “An efficient approach for audio-visual
1545–1554. emotion recognition with missing labels and missing modalities,” in
[117] H. Zuo, R. Liu, J. Zhao, G. Gao, and H. Li, “Exploiting modality- 2021 IEEE International Conference on Multimedia and Expo, ICME
invariant feature for robust multimodal emotion recognition with miss- 2021, Shenzhen, China, July 5-9, 2021, 2021, pp. 1–6.
ing modalities,” in IEEE International Conference on Acoustics, Speech [136] W. Yu, H. Xu, Z. Yuan, and J. Wu, “Learning modality-specific
and Signal Processing ICASSP 2023, Rhodes Island, Greece, June 4- representations with self-supervised multi-task learning for multimodal
10, 2023, 2023, pp. 1–5. sentiment analysis,” in Thirty-Fifth AAAI Conference on Artificial
[118] Z. Liu, B. Zhou, D. Chu, Y. Sun, and L. Meng, “Modality translation- Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Appli-
based multimodal sentiment analysis under uncertain missing modali- cations of Artificial Intelligence, IAAI 2021, The Eleventh Symposium
ties,” Inf. Fusion, vol. 101, p. 101973, 2024. on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual
[119] T. Zhou, S. Canu, P. Vera, and S. Ruan, “Feature-enhanced generation Event, February 2-9, 2021, pp. 10 790–10 797.
and multi-modality fusion based deep neural network for brain tumor [137] S. Mai, H. Hu, and S. Xing, “Modality to modality translation:
segmentation with missing MR modalities,” Neurocomputing, vol. 466, An adversarial representation learning and graph fusion network for
pp. 102–112, 2021. multimodal fusion,” in The Thirty-Fourth AAAI Conference on Artificial
[120] J. Vazquez-Rodriguez, G. Lefebvre, J. Cumin, and J. L. Crowley, Intelligence, AAAI 2020, The Thirty-Second Innovative Applications
“Accommodating missing modalities in time-continuous multimodal of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI
emotion recognition,” in 11th International Conference on Affective Symposium on Educational Advances in Artificial Intelligence, EAAI
Computing and Intelligent Interaction, ACII 2023, Cambridge, MA, 2020, New York, NY, USA, February 7-12, 2020. AAAI Press, 2020,
USA, September 10-13, 2023. IEEE, 2023, pp. 1–8. pp. 164–172.
[121] H. Mao, B. Zhang, H. Xu, Z. Yuan, and Y. Liu, “Robust-msa: [138] M. S. Akhtar, D. S. Chauhan, D. Ghosal, S. Poria, A. Ekbal, and
Understanding the impact of modality noise on multimodal sentiment P. Bhattacharyya, “Multi-task learning for multi-modal emotion recog-
analysis,” in Thirty-Seventh AAAI Conference on Artificial Intelligence, nition and sentiment analysis,” in Proceedings of the 2019 Conference
AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Ar- of the North American Chapter of the Association for Computa-
tificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational tional Linguistics: Human Language Technologies, NAACL-HLT 2019,
Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short
February 7-14, 2023, 2023, pp. 16 458–16 460. Papers), J. Burstein, C. Doran, and T. Solorio, Eds. Association for
[122] S. Lai, X. Hu, Y. Li, Z. Ren, Z. Liu, and D. Miao, “Shared and Computational Linguistics, 2019, pp. 370–379.
private information learning in multimodal sentiment analysis with [139] R. Chen, W. Zhou, Y. Li, and H. Zhou, “Video-based cross-modal
deep modal alignment and self-supervised multi-task learning,” arXiv auxiliary network for multimodal sentiment analysis,” IEEE Trans.
preprint arXiv:2305.08473, 2023. Circuits Syst. Video Technol., vol. 32, no. 12, pp. 8703–8716, 2022.
[123] X. Zhang, W. Cui, B. Hu, and Y. Li, “A multi-level alignment and cross- [140] Y. Zeng, W. Yan, S. Mai, and H. Hu, “Disentanglement translation
modal unified semantic graph refinement network for conversational network for multimodal sentiment analysis,” Inf. Fusion, vol. 102, p.
emotion recognition,” IEEE Transactions on Affective Computing, 102031, 2024.
2024. [141] D. Sun, Y. He, and J. Han, “Using auxiliary tasks in multimodal fusion
[124] E. Shutova, D. Kiela, and J. Maillard, “Black holes and white rabbits: of wav2vec 2.0 and BERT for multimodal emotion recognition,” CoRR,
Metaphor identification with visual features,” in NAACL HLT 2016, The vol. abs/2302.13661, 2023.
2016 Conference of the North American Chapter of the Association for [142] Z. Zhao, Y. Wang, G. Shen, Y. Xu, and J. Zhang, “Tdfnet: Transformer-
Computational Linguistics: Human Language Technologies, San Diego based deep-scale fusion network for multimodal emotion recognition,”
California, USA, June 12-17, 2016, 2016, pp. 160–170. IEEE ACM Trans. Audio Speech Lang. Process., vol. 31, pp. 3771–
[125] S. Moon, S. Kim, and H. Wang, “Multimodal transfer deep learning 3782, 2023.
for audio visual recognition,” CoRR, vol. abs/1412.3121, 2014. [143] M. Ren, X. Huang, J. Liu, M. Liu, X. Li, and A. Liu, “MALN:
[126] G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. W. Senior, multimodal adversarial learning network for conversational emotion
“Recent advances in the automatic recognition of audiovisual speech,” recognition,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 11,
Proceedings of the IEEE, vol. 91, no. 9, pp. 1306–1326, 2003. pp. 6965–6980, 2023.
[127] G. Evangelopoulos, A. Zlatintsi, A. Potamianos, P. Maragos, K. Ra- [144] F. Liu, S. Shen, Z. Fu, H. Wang, A. Zhou, and J. Qi, “LGCCT: A light
pantzikos, G. Skoumas, and Y. Avrithis, “Multimodal saliency and gated and crossed complementation transformer for multimodal speech
fusion for movie summarization based on aural, visual, and textual emotion recognition,” Entropy, vol. 24, no. 7, p. 1010, 2022.
attention,” IEEE Trans. Multim., vol. 15, no. 7, pp. 1553–1568, 2013. [145] C. Zhang, Y. Zhang, and B. Cheng, “Rl-emo: A reinforcement learning
[128] M. Glodek, S. Tschechne, G. Layher, M. Schels, T. Brosch, S. Scherer, framework for multimodal emotion recognition,” in ICASSP 2024 -
M. Kächele, M. Schmidt, H. Neumann, G. Palm, and F. Schwenker, 2024 IEEE International Conference on Acoustics, Speech and Signal
“Multiple classifier systems for the classification of audio-visual emo- Processing (ICASSP), 2024, pp. 10 246–10 250.
tional states,” in Affective Computing and Intelligent Interaction - [146] L. Yang, J. Na, and J. Yu, “Cross-modal multitask transformer for
Fourth International Conference, ACII 2011, Memphis, TN, USA, end-to-end multimodal aspect-based sentiment analysis,” Inf. Process.
October 9-12, 2011, Proceedings, Part II, 2011, pp. 359–368. Manag., vol. 59, no. 5, p. 103038, 2022.
[129] S. Poria, E. Cambria, D. Hazarika, N. Majumder, A. Zadeh, and L.- [147] R. Jain, A. Singh, V. K. Gangwar, and S. Saha, “Abcord: Exploit-
P. Morency, “Context-dependent sentiment analysis in user-generated ing multimodal generative approach for aspect-based complaint and
videos,” in Proceedings of the 55th annual meeting of the association rationale detection,” in Proceedings of the 31st ACM International
for computational linguistics (volume 1: Long papers), 2017, pp. 873– Conference on Multimedia, MM 2023, Ottawa, ON, Canada, 29
883. October 2023- 3 November 2023, 2023, pp. 8571–8579.
[130] A. I. Middya, B. Nag, and S. Roy, “Deep learning based multimodal [148] X. Ju, D. Zhang, R. Xiao, J. Li, S. Li, M. Zhang, and G. Zhou,
emotion recognition using model-level fusion of audio-visual modali- “Joint multi-modal aspect-sentiment analysis with auxiliary cross-
ties,” Knowl. Based Syst., vol. 244, p. 108580, 2022. modal relation detection,” in Proceedings of the 2021 Conference on
22
Empirical Methods in Natural Language Processing, EMNLP 2021, [166] C. Zhu, M. Chen, S. Zhang, C. Sun, H. Liang, Y. Liu, and J. Chen,
Virtual Event / Punta Cana, Dominican Republic, 7-11 November, “SKEAFN: sentiment knowledge enhanced attention fusion network
2021, 2021, pp. 4395–4405. for multimodal sentiment analysis,” Inf. Fusion, vol. 100, p. 101958,
[149] H. Lian, C. Lu, S. Li, Y. Zhao, C. Tang, Y. Zong, and W. Zheng, 2023.
“Label distribution adaptation for multimodal emotion recognition with [167] M. Chen and X. Li, “Swafn: Sentimental words aware fusion network
multi-label learning,” in Proceedings of the 1st International Workshop for multimodal sentiment analysis,” in Proceedings of the 28th interna-
on Multimodal and Responsible Affective Computing, MRAC 2023, tional conference on computational linguistics, 2020, pp. 1067–1077.
Ottawa, ON, Canada, 29 October 2023, 2023, pp. 51–58. [168] Y. Fu, S. Okada, L. Wang, L. Guo, Y. Song, J. Liu, and J. Dang,
[150] “Aobert: All-modalities-in-one bert for multimodal sentiment analysis,” “Context- and knowledge-aware graph convolutional network for mul-
Information Fusion, vol. 92, pp. 37–45, 2023. timodal emotion recognition,” IEEE Multim., vol. 29, no. 3, pp. 91–100,
[151] F. Qian, J. Han, Y. He, T. Zheng, and G. Zheng, “Sentiment knowledge 2022.
enhanced self-supervised learning for multimodal sentiment analysis,” [169] Y. Li, Y. Wang, and Z. Cui, “Decoupled multimodal distilling for
in Findings of the Association for Computational Linguistics: ACL emotion recognition,” in IEEE/CVF Conference on Computer Vision
2023, 2023, pp. 12 966–12 978. and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June
[152] M. Arjmand, M. J. Dousti, and H. Moradi, “TEASEL: A transformer- 17-24, 2023, 2023, pp. 6631–6640.
based speech-prefixed language model,” CoRR, vol. abs/2109.05522, [170] X. Sun, H. He, H. Tang, K. Zeng, and T. Shen, “Multimodal rough
2021. set transformer for sentiment analysis and emotion recognition,” in 9th
[153] J. Yu and J. Jiang, “Adapting BERT for target-oriented multimodal IEEE International Conference on Cloud Computing and Intelligent
sentiment classification,” in Proceedings of the Twenty-Eighth Interna- Systems, CCIS 2023, Dali, China, August 12-13, 2023, 2023, pp. 250–
tional Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, 259.
China, August 10-16, 2019, 2019, pp. 5408–5414. [171] P. Wang, S. Zeng, J. Chen, L. Fan, M. Chen, Y. Wu, and X. He,
[154] J. Cheng, I. Fostiropoulos, B. W. Boehm, and M. Soleymani, “Mul- “Leveraging label information for multimodal emotion recognition,”
timodal phased transformer for sentiment analysis,” in Proceedings CoRR, vol. abs/2309.02106, 2023.
of the 2021 Conference on Empirical Methods in Natural Language [172] P. Yuan, G. Cai, M. Chen, and X. Tang, “Topics guided multimodal fu-
Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican sion network for conversational emotion recognition,” in International
Republic, 7-11 November, 2021, M. Moens, X. Huang, L. Specia, and Conference on Intelligent Computing. Springer, 2024, pp. 250–262.
S. W. Yih, Eds. Association for Computational Linguistics, 2021, pp. [173] F. Zhang, X. Li, C. P. Lim, Q. Hua, C. Dong, and J. Zhai, “Deep
2447–2458. emotional arousal network for multimodal sentiment analysis and
[155] H. Zhang, Y. Wang, G. Yin, K. Liu, Y. Liu, and T. Yu, “Learn- emotion recognition,” Inf. Fusion, vol. 88, pp. 296–304, 2022.
ing language-guided adaptive hyper-modality representation for mul- [174] Z. Xu, Q. Su, and J. Xiao, “Multimodal aspect-based sentiment clas-
timodal sentiment analysis,” CoRR, vol. abs/2310.05804, 2023. sification with knowledge-injected transformer,” in IEEE International
[156] J. Li, X. Wang, and Z. Zeng, “Tracing intricate cues in dialogue: Conference on Multimedia and Expo, ICME 2023, Brisbane, Australia,
Joint graph structure and sentiment dynamics for multimodal emotion July 10-14, 2023, 2023, pp. 1379–1384.
recognition,” arXiv preprint arXiv:2407.21536, 2024.
[175] H. Yang, Y. Zhao, and B. Qin, “Face-sensitive image-to-emotional-
[157] K. Liu, J. Wang, and X. Zhang, “Entity-related unsupervised pretraining
text cross-modal translation for multimodal aspect-based sentiment
with visual prompts for multimodal aspect-based sentiment analysis,” in
analysis,” in Proceedings of the 2022 Conference on Empirical Methods
Natural Language Processing and Chinese Computing - 12th National
in Natural Language Processing, EMNLP 2022, Abu Dhabi, United
CCF Conference, NLPCC 2023, Foshan, China, October 12-15, 2023,
Arab Emirates, December 7-11, 2022, 2022, pp. 3324–3335.
Proceedings, Part II, 2023, pp. 481–493.
[176] L. Xiao, X. Wu, S. Yang, J. Xu, J. Zhou, and L. He, “Cross-modal
[158] Y. Ling, J. Yu, and R. Xia, “Vision-language pre-training for multi-
modal aspect-based sentiment analysis,” in Proceedings of the 60th fine-grained alignment and fusion network for multimodal aspect-based
sentiment analysis,” Information Processing & Management, vol. 60,
Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, no. 6, p. 103508, 2023.
2022, pp. 2149–2159. [177] J. Yu, K. Chen, and R. Xia, “Hierarchical interactive multimodal
[159] K. Zhang, K. Zhang, M. Zhang, H. Zhao, Q. Liu, W. Wu, and E. Chen, transformer for aspect-based multimodal sentiment analysis,” IEEE
“Incorporating dynamic semantics into pre-trained language model for Transactions on Affective Computing, 2022.
aspect-based sentiment analysis,” in Findings of the Association for [178] W. Zheng, J. Yu, and R. Xia, “A unimodal valence-arousal driven
Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, contrastive learning framework for multimodal multi-label emotion
2022, 2022, pp. 3599–3610. recognition,” in ACM Multimedia 2024.
[160] Z. Yu, J. Wang, L. Yu, and X. Zhang, “Dual-encoder transformers [179] C. Peng, K. Chen, L. Shou, and G. Chen, “Carat: Contrastive feature
with cross-modal alignment for multimodal aspect-based sentiment reconstruction and aggregation for multi-modal multi-label emotion
analysis,” in Proceedings of the 2nd Conference of the Asia-Pacific recognition,” in Proceedings of the AAAI Conference on Artificial
Chapter of the Association for Computational Linguistics and the Intelligence, vol. 38, no. 13, 2024, pp. 14 581–14 589.
12th International Joint Conference on Natural Language Processing, [180] J. Zhao, Y. Zhao, and J. Li, “M3tr: Multi-modal multi-label recognition
AACL/IJCNLP 2022 - Volume 1: Long Papers, Online Only, November with transformer,” in Proceedings of the 29th ACM international
20-23, 2022, 2022, pp. 414–423. conference on multimedia, 2021, pp. 469–477.
[161] H. Jin, J. Tan, L. Liu, L. Qiu, S. Yao, X. Chen, and X. Zeng, “MSRA: [181] D. Zhang, X. Ju, W. Zhang, J. Li, S. Li, Q. Zhu, and G. Zhou, “Multi-
A multi-aspect semantic relevance approach for e-commerce via mul- modal multi-label emotion recognition with heterogeneous hierarchical
timodal pre-training,” in Proceedings of the 32nd ACM International message passing,” in Proceedings of the AAAI Conference on Artificial
Conference on Information and Knowledge Management, CIKM 2023, Intelligence, vol. 35, no. 16, 2021, pp. 14 338–14 346.
Birmingham, United Kingdom, October 21-25, 2023, 2023, pp. 3988– [182] D. S. Chauhan, M. S. Akhtar, A. Ekbal, and P. Bhattacharyya, “Context-
3992. aware interactive attention for multi-modal sentiment and emotion
[162] Q. Wang, H. Xu, Z. Wen, B. Liang, M. Yang, B. Qin, and R. Xu, analysis,” in Proceedings of the 2019 Conference on Empirical Meth-
“Image-to-text conversion and aspect-oriented filtration for multimodal ods in Natural Language Processing and the 9th International Joint
aspect-based sentiment analysis,” IEEE Transactions on Affective Com- Conference on Natural Language Processing, EMNLP-IJCNLP 2019,
puting, no. 01, pp. 1–15, 2023. Hong Kong, China, November 3-7, 2019, K. Inui, J. Jiang, V. Ng, and
[163] C. Wang, Y. Luo, C. Meng, and F. Yuan, “An adaptive dual graph X. Wan, Eds. Association for Computational Linguistics, 2019, pp.
convolution fusion network for aspect-based sentiment analysis,” ACM 5646–5656.
Transactions on Asian and Low-Resource Language Information Pro- [183] S. Poria, E. Cambria, D. Hazarika, N. Majumder, A. Zadeh, and
cessing, 2024. L. Morency, “Multi-level multiple attentions for contextual multimodal
[164] J. Mu, F. Nie, W. Wang, J. Xu, J. Zhang, and H. Liu, “Mocolnet: sentiment analysis,” in 2017 IEEE International Conference on Data
A momentum contrastive learning network for multimodal aspect- Mining, ICDM 2017, New Orleans, LA, USA, November 18-21, 2017,
level sentiment analysis,” IEEE Transactions on Knowledge and Data 2017, pp. 1033–1038.
Engineering, 2023. [184] M. Huang, C. Qing, J. Tan, and X. Xu, “Context-based adaptive multi-
[165] D. Wang, X. Guo, Y. Tian, J. Liu, L. He, and X. Luo, “TETFN: A modal fusion network for continuous frame-level sentiment prediction,”
text enhanced transformer fusion network for multimodal sentiment IEEE ACM Trans. Audio Speech Lang. Process., vol. 31, pp. 3468–
analysis,” Pattern Recognit., vol. 136, p. 109259, 2023. 3477, 2023.
23
[185] Z. Li, Y. Sun, L. Zhang, and J. Tang, “Ctnet: Context-based tandem [203] Y. Lee, S. Yoon, and K. Jung, “Multimodal speech emotion recog-
network for semantic segmentation,” IEEE Trans. Pattern Anal. Mach. nition using cross attention with aligned audio and text,” CoRR, vol.
Intell., vol. 44, no. 12, pp. 9904–9917, 2022. abs/2207.12895, 2022.
[186] X. Sun, X. Ren, and X. Xie, “A novel multimodal sentiment analysis [204] C. Zong, F. Xia, W. Li, and R. Navigli, Eds., Proceedings of the
model based on gated fusion and multi-task learning,” in ICASSP 2024- 59th Annual Meeting of the Association for Computational Linguistics
2024 IEEE International Conference on Acoustics, Speech and Signal and the 11th International Joint Conference on Natural Language
Processing (ICASSP). IEEE, 2024, pp. 8336–8340. Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual
[187] J. Hu, Y. Liu, J. Zhao, and Q. Jin, “MMGCN: multimodal fusion via Event, August 1-6, 2021. Association for Computational Linguistics,
deep graph convolution network for emotion recognition in conversa- 2021.
tion,” in Proceedings of the 59th Annual Meeting of the Association for [205] F. Wang, S. Tian, L. Yu, J. Liu, J. Wang, K. Li, and Y. Wang,
Computational Linguistics and the 11th International Joint Conference “TEDT: transformer-based encoding-decoding translation network for
on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long multimodal sentiment analysis,” Cogn. Comput., vol. 15, no. 1, pp.
Papers), Virtual Event, August 1-6, 2021, C. Zong, F. Xia, W. Li, and 289–303, 2023.
R. Navigli, Eds. Association for Computational Linguistics, 2021, pp. [206] Z. Yu, J. Wang, L.-C. Yu, and X. Zhang, “Dual-encoder transformers
5666–5675. with cross-modal alignment for multimodal aspect-based sentiment
[188] D. Hu, X. Hou, L. Wei, L. Jiang, and Y. Mo, “MM-DFN: multimodal analysis,” in Proceedings of the 2nd Conference of the Asia-Pacific
dynamic fusion network for emotion recognition in conversations,” Chapter of the Association for Computational Linguistics and the
in IEEE International Conference on Acoustics, Speech and Signal 12th International Joint Conference on Natural Language Processing
Processing, ICASSP 2022, Virtual and Singapore, 23-27 May 2022, (Volume 1: Long Papers), 2022, pp. 414–423.
2022, pp. 7037–7041. [207] Y. Ge, D. Chen, and H. Li, “Mutual mean-teaching: Pseudo label re-
finery for unsupervised domain adaptation on person re-identification,”
[189] D. Zhang, F. Chen, J. Chang, X. Chen, and Q. Tian, “Structure
arXiv preprint arXiv:2001.01526, 2020.
aware multi-graph network for multi-modal emotion recognition in
[208] H. Pham, Z. Dai, Q. Xie, and Q. V. Le, “Meta pseudo labels,” in IEEE
conversations,” IEEE Trans. Multim., vol. 26, pp. 3987–3997, 2024.
Conference on Computer Vision and Pattern Recognition, CVPR 2021,
[190] F. Chen, J. Shao, S. Zhu, and H. T. Shen, “Multivariate, multi-frequency virtual, June 19-25, 2021. Computer Vision Foundation / IEEE, 2021,
and multimodal: Rethinking graph neural networks for emotion recog- pp. 11 557–11 568.
nition in conversation,” in IEEE/CVF Conference on Computer Vision [209] Y. Zhang, M. Zhang, S. Wu, and J. Zhao, “Towards unifying the label
and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June space for aspect- and sentence-based sentiment analysis,” in Findings
17-24, 2023, 2023, pp. 10 761–10 770. of the Association for Computational Linguistics: ACL 2022, Dublin,
[191] B. Yao and W. Shi, “Speaker-centric multimodal fusion networks for Ireland, May 22-27, 2022, S. Muresan, P. Nakov, and A. Villavicencio,
emotion recognition in conversations,” in ICASSP 2024 - 2024 IEEE Eds. Association for Computational Linguistics, 2022, pp. 20–30.
International Conference on Acoustics, Speech and Signal Processing [210] Y. Zhang, J. Wang, Y. Liu, L. Rong, Q. Zheng, D. Song, P. Tiwari, and
(ICASSP), 2024, pp. 8441–8445. J. Qin, “A multitask learning model for multimodal sarcasm, sentiment
[192] Z. Li, F. Tang, M. Zhao, and Y. Zhu, “Emocaps: Emotion capsule and emotion recognition in conversations,” Inf. Fusion, vol. 93, pp.
based model for conversational emotion recognition,” arXiv preprint 282–301, 2023.
arXiv:2203.13504, 2022. [211] M. S. Akhtar, D. S. Chauhan, D. Ghosal, S. Poria, A. Ekbal, and
[193] J. Li, X. Wang, G. Lv, and Z. Zeng, “GA2MIF: graph and attention P. Bhattacharyya, “Multi-task learning for multi-modal emotion recog-
based two-stage multi-source information fusion for conversational nition and sentiment analysis,” arXiv preprint arXiv:1905.05812, 2019.
emotion detection,” IEEE Trans. Affect. Comput., vol. 15, no. 1, pp. [212] T. Xie, C. H. Wu, P. Shi, R. Zhong, T. Scholak, M. Yasunaga, C. Wu,
130–143, 2024. M. Zhong, P. Yin, S. I. Wang, V. Zhong, B. Wang, C. Li, C. Boyle,
[194] C. Xu, X. Luo, and D. Wang, “MCPR: A chinese product review A. Ni, Z. Yao, D. R. Radev, C. Xiong, L. Kong, R. Zhang, N. A. Smith,
dataset for multimodal aspect-based sentiment analysis,” in Cognitive L. Zettlemoyer, and T. Yu, “Unifiedskg: Unifying and multi-tasking
Computing - ICCC 2022 - 6th International Conference, Held as Part structured knowledge grounding with text-to-text language models,”
of the Services Conference Federation, SCF 2022, Honolulu, HI, USA, CoRR, vol. abs/2201.05966, 2022.
December 10-14, 2022, Proceedings, 2022, pp. 83–90. [213] J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-
[195] M. Anschütz, T. Eder, and G. Groh, “Retrieving users’ opinions on agnostic visiolinguistic representations for vision-and-language tasks,”
social media with multimodal aspect-based sentiment analysis,” in 17th in Advances in Neural Information Processing Systems 32: Annual
IEEE International Conference on Semantic Computing, ICSC 2023, Conference on Neural Information Processing Systems 2019, NeurIPS
Laguna Hills, CA, USA, February 1-3, 2023, 2023, pp. 1–8. 2019, December 8-14, 2019, Vancouver, BC, Canada, H. M. Wallach,
[196] F. Zhao, C. Li, Z. Wu, Y. Ouyang, J. Zhang, and X. Dai, “M2DF: H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Gar-
multi-grained multi-curriculum denoising framework for multimodal nett, Eds., 2019, pp. 13–23.
aspect-based sentiment analysis,” CoRR, vol. abs/2310.14605, 2023. [214] W. Wang, H. Bao, L. Dong, and F. Wei, “Vlmo: Unified vision-
[197] R. Zhou, W. Guo, X. Liu, S. Yu, Y. Zhang, and X. Yuan, “Aom: language pre-training with mixture-of-modality-experts,” CoRR, vol.
Detecting aspect-oriented information for multimodal aspect-based abs/2111.02358, 2021.
sentiment analysis,” in Findings of the Association for Computational [215] Z. Zhang, X. Meng, Y. Wang, X. Jiang, Q. Liu, and Z. Yang, “Unims:
Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, 2023, pp. A unified framework for multimodal summarization with knowledge
8184–8196. distillation,” in Thirty-Sixth AAAI Conference on Artificial Intelligence,
AAAI 2022, Thirty-Fourth Conference on Innovative Applications of
[198] J. Zhao and F. Yang, “Fusion with gcn and se-resnext network for
Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Edu-
aspect based multimodal sentiment analysis,” in 2023 IEEE 6th In-
cational Advances in Artificial Intelligence, EAAI 2022 Virtual Event,
formation Technology, Networking, Electronic and Automation Control
February 22 - March 1, 2022, 2022, pp. 11 757–11 764.
Conference (ITNEC), vol. 6, 2023, pp. 336–340.
[216] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis,
[199] D. Zhang, X. Ju, J. Li, S. Li, Q. Zhu, and G. Zhou, “Multi-modal L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized BERT
multi-label emotion detection with modality and label dependence,” in pretraining approach,” CoRR, vol. abs/1907.11692, 2019.
Proceedings of the 2020 Conference on Empirical Methods in Natural [217] S. Qiu, N. Sekhar, and P. Singhal, “Topic and style-aware transformer
Language Processing, EMNLP 2020, Online, November 16-20, 2020, for multimodal emotion recognition,” in Findings of the Association
2020, pp. 3584–3593. for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-
[200] X. Ju, D. Zhang, J. Li, and G. Zhou, “Transformer-based label set 14, 2023, 2023, pp. 2074–2082.
generation for multi-modal multi-label emotion detection,” in MM ’20: [218] C. Xi, G. Lu, and J. Yan, “Multimodal sentiment analysis based on
The 28th ACM International Conference on Multimedia, Virtual Event multi-head attention mechanism,” in Proceedings of the 4th interna-
/ Seattle, WA, USA, October 12-16, 2020, 2020, pp. 512–520. tional conference on machine learning and soft computing, 2020, pp.
[201] J. Zhao, T. Zhang, J. Hu, Y. Liu, Q. Jin, X. Wang, and H. Li, “M3ED: 34–39.
multi-modal multi-scene multi-label emotional dialogue database,” in [219] Y. Zhang, D. Song, P. Zhang, P. Wang, J. Li, X. Li, and B. Wang, “A
ACL 2022, 2022, pp. 5699–5710. quantum-inspired multimodal sentiment analysis framework,” Theoret-
[202] H. Luo, L. Ji, Y. Huang, B. Wang, S. Ji, and T. Li, “Scalevlad: ical Computer Science, vol. 752, pp. 21–40, 2018.
Improving multimodal sentiment analysis via multi-scale fusion of [220] A. Metallinou, M. Wollmer, A. Katsamanis, F. Eyben, B. Schuller, and
locally descriptors,” arXiv preprint arXiv:2112.01368, 2021. S. Narayanan, “Context-sensitive learning for enhanced audiovisual
24
emotion classification,” IEEE Transactions on Affective Computing, I. Gurevych and Y. Miyao, Eds. Association for Computational
vol. 3, no. 2, pp. 184–198, 2012. Linguistics, 2018, pp. 2236–2246.
[221] Y. Li, K. Zhang, J. Wang, and X. Gao, “A cognitive brain model for [239] W. Yu, H. Xu, F. Meng, Y. Zhu, Y. Ma, J. Wu, J. Zou, and K. Yang,
multimodal sentiment analysis based on attention neural networks,” “Ch-sims: A chinese multimodal sentiment analysis dataset with fine-
Neurocomputing, vol. 430, pp. 159–173, 2021. grained annotation of modality,” in Proceedings of the 58th annual
[222] I. Chaturvedi, R. Satapathy, S. Cavallari, and E. Cambria, “Fuzzy meeting of the association for computational linguistics, 2020, pp.
commonsense reasoning for multimodal sentiment analysis,” Pattern 3718–3727.
Recognition Letters, vol. 125, pp. 264–270, 2019. [240] A. Zadeh, Y. S. Cao, S. Hessner, P. P. Liang, S. Poria, and L.-P.
[223] W. Wu, Y. Wang, S. Xu, and K. Yan, “Sfnn: semantic features Morency, “Cmu-moseas: A multimodal language dataset for spanish,
fusion neural network for multimodal sentiment analysis,” in 2020 portuguese, german and french,” in Proceedings of the Conference on
5th International Conference on Automation, Control and Robotics Empirical Methods in Natural Language Processing. Conference on
Engineering (CACRE), 2020, pp. 661–665. Empirical Methods in Natural Language Processing, vol. 2020, 2020,
[224] W. Zheng, J. Yu, R. Xia, and S. Wang, “A facial expression-aware p. 1801.
multimodal multi-task learning framework for emotion recognition in [241] M. Wöllmer, F. Weninger, T. Knaup, B. W. Schuller, C. Sun, K. Sagae,
multi-party conversations,” in Proceedings of the 61st Annual Meeting and L. Morency, “Youtube movie reviews: Sentiment analysis in an
of the Association for Computational Linguistics (Volume 1: Long audio-visual context,” IEEE Intell. Syst., vol. 28, no. 3, pp. 46–53,
Papers), 2023, pp. 15 445–15 459. 2013.
[225] H. Ma, J. Wang, H. Lin, B. Zhang, Y. Zhang, and B. Xu, “A [242] Y. Liu, Z. Yuan, H. Mao, Z. Liang, W. Yang, Y. Qiu, T. Cheng, X. Li,
transformer-based model with self-distillation for multimodal emotion H. Xu, and K. Gao, “Make acoustic and visual cues matter: Ch-sims
recognition in conversations,” CoRR, vol. abs/2310.20494, 2023. v2. 0 dataset and av-mixup consistent module,” in Proceedings of the
[226] Y. Wang, Y. Li, P. Bell, and C. Lai, “Cross-attention is not enough: 2022 International Conference on Multimodal Interaction, 2022, pp.
Incongruity-aware multimodal sentiment analysis and emotion recog- 247–258.
nition,” CoRR, vol. abs/2305.13583, 2023. [243] S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and
[227] Y. Zhang, A. Jia, B. Wang, P. Zhang, D. Zhao, P. Li, Y. Hou, X. Jin, R. Mihalcea, “MELD: A multimodal multi-party dataset for emotion
D. Song, and J. Qin, “M3GAT: A multi-modal, multi-task interactive recognition in conversations,” in Proceedings of the 57th Conference
graph attention network for conversational sentiment analysis and of the Association for Computational Linguistics, ACL 2019, Florence,
emotion recognition,” ACM Trans. Inf. Syst., vol. 42, no. 1, pp. 13:1– Italy, July 28- August 2, 2019, Volume 1: Long Papers, A. Korhonen,
13:32, 2024. D. R. Traum, and L. Màrquez, Eds., pp. 527–536.
[244] C. Busso, M. Bulut, C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N.
[228] Y.-P. Ruan, S. Han, T. Li, and Y. Wu, “Fusing modality-specific
representations and decisions for multimodal emotion recognition,” Chang, S. Lee, and S. S. Narayanan, “IEMOCAP: interactive emotional
dyadic motion capture database,” Lang. Resour. Evaluation, vol. 42,
in ICASSP 2024-2024 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 7925–7929. no. 4, pp. 335–359, 2008.
[245] Z. Fang, A. He, Q. Yu, B. Gao, W. Ding, T. Zhang, and L. Ma, “FAF: A
[229] F. Zhao, C. Li, Z. Wu, Y. Ouyang, J. Zhang, and X. Dai, “M2DF: multi-
novel multimodal emotion recognition approach integrating face, body
grained multi-curriculum denoising framework for multimodal aspect-
and text,” CoRR, vol. abs/2211.15425, 2022.
based sentiment analysis,” in Proceedings of the 2023 Conference on
[246] Y. Wang and L. Guan, “Recognizing human emotional state from
Empirical Methods in Natural Language Processing, EMNLP 2023,
audiovisual signals,” IEEE Trans. Multim., vol. 10, no. 4, pp. 659–
Singapore, December 6-10, 2023, 2023, pp. 9057–9070.
668, 2008.
[230] Y. Zhang, M. Chen, J. Shen, and C. Wang, “Tailor versatile multi- [247] S. Zhalehpour, O. Onder, Z. Akhtar, and C. E. Erdem, “BAUM-1: A
modal learning for multi-label emotion recognition,” in Thirty-Sixth spontaneous audio-visual face database of affective and mental states,”
AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth IEEE Trans. Affect. Comput., vol. 8, no. 3, pp. 300–313, 2017.
Conference on Innovative Applications of Artificial Intelligence, IAAI [248] M. Soleymani, J. Lichtenauer, T. Pun, and M. Pantic, “A multimodal
2022, The Twelveth Symposium on Educational Advances in Artificial database for affect recognition and implicit tagging,” IEEE Trans.
Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022, Affect. Comput., vol. 3, no. 1, pp. 42–55, 2012.
2022, pp. 9100–9108. [249] S. Koelstra, C. Muhl, M. Soleymani, J.-S. Lee, A. Yazdani, T. Ebrahimi,
[231] D. Zhang, X. Ju, W. Zhang, J. Li, S. Li, Q. Zhu, and G. Zhou, “Multi- T. Pun, A. Nijholt, and I. Patras, “Deap: A database for emotion
modal multi-label emotion recognition with heterogeneous hierarchical analysis; using physiological signals,” IEEE transactions on affective
message passing,” in Thirty-Fifth AAAI Conference on Artificial Intelli- computing, vol. 3, no. 1, pp. 18–31, 2011.
gence, AAAI 2021, Thirty-Third Conference on Innovative Applications [250] L. Stappen, A. Baird, L. Schumann, and B. W. Schuller, “The multi-
of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on modal sentiment analysis in car reviews (muse-car) dataset: Collection,
Educational Advances in Artificial Intelligence, EAAI 2021, Virtual insights and improvements,” IEEE Trans. Affect. Comput., vol. 14,
Event, February 2-9, 2021, 2021, pp. 14 338–14 346. no. 2, pp. 1334–1350, 2023.
[232] J. Zhao, Y. Zhao, and J. Li, “M3TR: multi-modal multi-label recog- [251] Y. Li, J. Tao, B. Schuller, S. Shan, D. Jiang, and J. Jia, “Mec
nition with transformer,” in MM ’21: ACM Multimedia Conference, 2017: Multimodal emotion recognition challenge,” in 2018 First Asian
Virtual Event, China, October 20 - 24, 2021, 2021, pp. 469–477. Conference on Affective Computing and Intelligent Interaction (ACII
[233] Z. Zhang, Z. Wang, X. Li, N. Liu, B. Guo, and Z. Yu, “Modalnet: an Asia), 2018, pp. 1–5.
aspect-level sentiment classification model by exploring multimodal [252] C. Busso, S. Parthasarathy, A. Burmania, M. AbdelWahab,
data with fusion discriminant attentional network,” World Wide Web, N. Sadoughi, and E. M. Provost, “Msp-improv: An acted corpus of
vol. 24, pp. 1957–1974, 2021. dyadic interactions to study emotion perception,” IEEE Transactions
[234] J. Yang, Y. Xiao, and X. Du, “Multi-grained fusion network on Affective Computing, vol. 8, no. 1, pp. 67–80, 2017.
with self-distillation for aspect-based multimodal sentiment analysis,” [253] M. Firdaus, H. Chauhan, A. Ekbal, and P. Bhattacharyya, “Meisd:
Knowledge-Based Systems, vol. 293, p. 111724, 2024. A multimodal multi-label emotion, intensity and sentiment dialogue
[235] W. Li, L. Zhu, R. Mao, and E. Cambria, “Skier: A symbolic knowledge dataset for emotion recognition and sentiment analysis in conversa-
integrated model for conversational emotion recognition,” in Proceed- tions,” in Proceedings of the 28th international conference on compu-
ings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 11, tational linguistics, 2020, pp. 4441–4453.
2023, pp. 13 121–13 129. [254] A. Jia, Y. He, Y. Zhang, S. Uprety, D. Song, and C. Lioma, “Beyond
[236] J. Wen, D. Jiang, G. Tu, C. Liu, and E. Cambria, “Dynamic interactive emotion: A multi-modal dataset for human desire understanding,” in
multiview memory network for emotion recognition in conversation,” Proceedings of the 2022 Conference of the North American Chapter
Information Fusion, vol. 91, pp. 123–133, 2023. of the Association for Computational Linguistics: Human Language
[237] A. Zadeh, R. Zellers, E. Pincus, and L. Morency, “Multimodal senti- Technologies, NAACL 2022, Seattle, WA, United States, July 10-15,
ment intensity analysis in videos: Facial gestures and verbal messages,” 2022, 2022, pp. 1512–1522.
IEEE Intell. Syst., vol. 31, no. 6, pp. 82–88, 2016. [255] L. Stappen, A. Baird, L. Christ, L. Schumann, B. Sertolli, E. Meßner,
[238] A. Zadeh, P. P. Liang, S. Poria, E. Cambria, and L. Morency, E. Cambria, G. Zhao, and B. W. Schuller, “The muse 2021 multi-
“Multimodal language analysis in the wild: CMU-MOSEI dataset modal sentiment analysis challenge: Sentiment, emotion, physiological-
and interpretable dynamic fusion graph,” in Proceedings of the 56th emotion, and stress,” in MuSe ’21: Proceedings of the 2nd on Multi-
Annual Meeting of the Association for Computational Linguistics, ACL modal Sentiment Analysis Challenge, Virtual Event, China, 24 October
2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, 2021, 2021, pp. 5–14.
25
[256] J. Sun, S. Han, Y.-P. Ruan, X. Zhang, S.-K. Zheng, Y. Liu, Y. Huang, [277] Z. Tao, Y. Wang, J. Lin, H. Wang, X. Mai, J. Yu, X. Tong, Z. Zhou,
and T. Li, “Layer-wise fusion with modality independence modeling for S. Yan, Q. Zhao, L. Han, and W. Zhang, “A3 lign-dfer: Pioneering com-
multi-modal emotion recognition,” in Proceedings of the 61st Annual prehensive dynamic affective alignment for dynamic facial expression
Meeting of the Association for Computational Linguistics (Volume 1: recognition with clip,” 2024.
Long Papers), 2023, pp. 658–670. [278] H. Liu, R. An, Z. Zhang, B. Ma, W. Zhang, Y. Song, Y. Hu, W. Chen,
[257] J. A. M. Correa, M. K. Abadi, N. Sebe, and I. Patras, “AMIGOS: A and Y. Ding, “Norface: Improving facial expression analysis by identity
dataset for affect, personality and mood research on individuals and normalization,” arXiv preprint arXiv:2407.15617, 2024.
groups,” IEEE Trans. Affect. Comput., vol. 12, no. 2, pp. 479–493, [279] H. Huang, X. Qiao, Z. Chen, H. Chen, B. Li, Z. Sun, M. Chen, and
2021. X. Li, “Crest: Cross-modal resonance through evidential deep learning
[258] Q. Zhang, J. Fu, X. Liu, and X. Huang, “Adaptive co-attention network for enhanced zero-shot learning,” arXiv preprint arXiv:2404.09640,
for named entity recognition in tweets,” in Proceedings of the Thirty- 2024.
Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th [280] Z. Han, C. Zhang, H. Fu, and J. T. Zhou, “Trusted multi-view
innovative Applications of Artificial Intelligence (IAAI-18), and the 8th classification with dynamic evidential fusion,” IEEE transactions on
AAAI Symposium on Educational Advances in Artificial Intelligence pattern analysis and machine intelligence, vol. 45, no. 2, pp. 2551–
(EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, 2018, 2566, 2022.
pp. 5674–5681. [281] H. Huang, Z. Liu, S. Letchmunan, M. Lin, M. Deveci, W. Pedrycz,
[259] C. Xu, X. Luo, and D. Wang, “Mcpr: A chinese product review dataset and P. Siarry, “Evidential deep partial multi-view classification with
for multimodal aspect-based sentiment analysis,” in International Con- discount fusion,” arXiv preprint arXiv:2408.13123, 2024.
ference on Cognitive Computing. Springer, 2022, pp. 83–90. [282] H. Huang, C. Qin, Z. Liu, K. Ma, J. Chen, H. Fang, C. Ban, H. Sun, and
[260] H. Yang, Y. Zhao, J. Liu, Y. Wu, and B. Qin, “MACSA: A multi- Z. He, “Trusted unified feature-neighborhood dynamics for multi-view
modal aspect-category sentiment analysis dataset with multimodal fine- classification,” arXiv preprint arXiv:2409.00755, 2024.
grained aligned annotations,” CoRR, vol. abs/2206.13969, 2022. [283] J. Gao, M. Chen, and C. Xu, “Vectorized evidential learning for
[261] J. Zhou, J. Zhao, J. X. Huang, Q. V. Hu, and L. He, “Masad: A weakly-supervised temporal action localization,” IEEE Transactions on
large-scale dataset for multimodal aspect-based sentiment analysis,” Pattern Analysis and Machine Intelligence, vol. 45, no. 12, pp. 15 949–
Neurocomputing, vol. 455, pp. 47–58, 2021. 15 963, 2023.
[262] M. Luo, H. Fei, B. Li, S. Wu, Q. Liu, S. Poria, E. Cambria, M.-L. [284] K. Ma, H. Huang, J. Chen, H. Chen, P. Ji, X. Zang, H. Fang,
Lee, and W. Hsu, “Panosent: A panoptic sextuple extraction benchmark C. Ban, H. Sun, M. Chen et al., “Beyond uncertainty: Evidential
for multimodal conversational aspect-based sentiment analysis,” arXiv deep learning for robust video temporal grounding,” arXiv preprint
preprint arXiv:2408.09481, 2024. arXiv:2408.16272, 2024.
[263] D. Borth, R. Ji, T. Chen, T. Breuel, and S.-F. Chang, “Large-scale [285] J. Gao, M. Chen, and C. Xu, “Vectorized evidential learning for
visual sentiment ontology and detectors using adjective noun pairs,” in weakly-supervised temporal action localization,” IEEE Transactions on
Proceedings of the 21st ACM international conference on Multimedia, Pattern Analysis and Machine Intelligence, vol. 45, no. 12, pp. 15 949–
2013, pp. 223–232. 15 963, 2023.
[264] S. Li and W. Deng, “Deep facial expression recognition: A survey,” [286] Z. Aldeneh and E. M. Provost, “Using regional saliency for speech
IEEE transactions on affective computing, vol. 13, no. 3, pp. 1195– emotion recognition,” in 2017 IEEE International Conference on
1215, 2020. Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans,
[265] Y. Zhang, C. Wang, and W. Deng, “Relative uncertainty learning LA, USA, March 5-9, 2017. IEEE, 2017, pp. 2741–2745.
for facial expression recognition,” Advances in Neural Information [287] P. Li, Y. Song, I. McLoughlin, W. Guo, and L. Dai, “An attention
Processing Systems, vol. 34, pp. 17 616–17 627, 2021. pooling based representation learning method for speech emotion
recognition,” in 19th Annual Conference of the International Speech
[266] J. Li, K. Jin, D. Zhou, N. Kubota, and Z. Ju, “Attention mechanism-
Communication Association, Interspeech 2018, Hyderabad, India,
based cnn for facial expression recognition,” Neurocomputing, vol. 411,
September 2-6, 2018, B. Yegnanarayana, Ed. ISCA, 2018, pp. 3087–
pp. 340–350, 2020.
3091.
[267] A. H. Farzaneh and X. Qi, “Facial expression recognition in the wild
[288] G. Trigeorgis, F. Ringeval, R. Brueckner, E. Marchi, M. A. Nicolaou,
via deep attentive center loss,” in Proceedings of the IEEE/CVF winter
B. W. Schuller, and S. Zafeiriou, “Adieu features? end-to-end speech
conference on applications of computer vision, 2021, pp. 2402–2411.
emotion recognition using a deep convolutional recurrent network,”
[268] H. Wang, B. Li, S. Wu, S. Shen, F. Liu, S. Ding, and A. Zhou, in 2016 IEEE International Conference on Acoustics, Speech and
“Rethinking the learning paradigm for dynamic facial expression Signal Processing, ICASSP 2016, Shanghai, China, March 20-25, 2016.
recognition,” in Proceedings of the IEEE/CVF conference on computer IEEE, 2016, pp. 5200–5204.
vision and pattern recognition, 2023, pp. 17 958–17 968. [289] S. Mirsamadi, E. Barsoum, and C. Zhang, “Automatic speech emotion
[269] Z. Zhao and Q. Liu, “Former-dfer: Dynamic facial expression recog- recognition using recurrent neural networks with local attention,” in
nition transformer,” in Proceedings of the 29th ACM International 2017 IEEE International Conference on Acoustics, Speech and Signal
Conference on Multimedia, 2021, pp. 1553–1561. Processing, ICASSP 2017, New Orleans, LA, USA, March 5-9, 2017.
[270] H. Li, M. Sui, Z. Zhu et al., “Nr-dfernet: Noise-robust network for dy- IEEE, 2017, pp. 2227–2231.
namic facial expression recognition,” arXiv preprint arXiv:2206.04975, [290] Z. Zhao, Y. Zheng, Z. Zhang, H. Wang, Y. Zhao, and C. Li, “Ex-
2022. ploring spatio-temporal representations by integrating attention-based
[271] Y. Wang, Y. Sun, W. Song, S. Gao, Y. Huang, Z. Chen, W. Ge, and bidirectional-lstm-rnns and fcns for speech emotion recognition,” in
W. Zhang, “Dpcnet: Dual path multi-excitation collaborative network 19th Annual Conference of the International Speech Communication
for facial expression representation learning in videos,” in Proceedings Association, Interspeech 2018, Hyderabad, India, September 2-6, 2018,
of the 30th ACM International Conference on Multimedia, 2022, pp. B. Yegnanarayana, Ed. ISCA, 2018, pp. 272–276.
101–110. [291] D. Luo, Y. Zou, and D. Huang, “Investigation on joint representation
[272] Y. Liu, W. Wang, C. Feng, H. Zhang, Z. Chen, and Y. Zhan, “Ex- learning for robust feature extraction in speech emotion recognition,”
pression snippet transformer for robust video-based facial expression in 19th Annual Conference of the International Speech Communication
recognition,” 2021. Association, Interspeech 2018, Hyderabad, India, September 2-6, 2018,
[273] F. Ma, B. Sun, and S. Li, “Logo-former: Local-global spatio-temporal B. Yegnanarayana, Ed. ISCA, 2018, pp. 152–156.
transformer for dynamic facial expression recognition,” in ICASSP [292] M. Jiménez-Guarneros and G. F. Pineda, “Cross-subject eeg-based
2023-2023 IEEE International Conference on Acoustics, Speech and emotion recognition via semisupervised multisource joint distribution
Signal Processing (ICASSP). IEEE, 2023, pp. 1–5. adaptation,” IEEE Trans. Instrum. Meas., vol. 72, pp. 1–11, 2023.
[274] H. Li, H. Niu, Z. Zhu, and F. Zhao, “Intensity-aware loss for dynamic [293] Y. Peng, Y. Zhang, W. Kong, F. Nie, B. Lu, and A. Cichocki,
facial expression recognition in the wild,” 2022. “S3 lrr: A unified model for joint discriminative subspace identification
[275] H. Chen, H. Huang, J. Dong, M. Zheng, and D. Shao, “Finecliper: and semisupervised EEG emotion recognition,” IEEE Trans. Instrum.
Multi-modal fine-grained clip for dynamic facial expression recognition Meas., vol. 71, pp. 1–13, 2022.
with adapters,” arXiv preprint arXiv:2407.02157, 2024. [294] Y. Peng, W. Kong, F. Qin, F. Nie, J. Fang, B. Lu, and A. Cichocki,
[276] Z. Zhao and I. Patras, “Prompting visual-language models for dynamic “Self-weighted semi-supervised classification for joint eeg-based emo-
facial expression recognition,” in British Machine Vision Conference tion recognition and affective activation patterns mining,” IEEE Trans.
(BMVC), 2023, pp. 1–14. Instrum. Meas., vol. 70, pp. 1–11, 2021.
26
[295] X. Quan, Z. Zeng, J. Jiang, Y. Zhang, B. Lu, and D. Wu, “Physio- Association for Computational Linguistics, ACL 2019, Florence, Italy,
logical signals based affective computing: A systematic review,” Acta July 28- August 2, 2019, Volume 1: Long Papers, 2019, pp. 4762–4779.
Automatica Sinica, vol. 47, no. 8, pp. 1769–1784, 2021. [313] M. Sap, R. L. Bras, E. Allaway, C. Bhagavatula, N. Lourie, H. Rashkin,
[296] V. Chaparro, A. Gomez, A. Salgado, O. L. Quintero, N. López, and B. Roof, N. A. Smith, and Y. Choi, “ATOMIC: an atlas of machine
L. F. Villa, “Emotion recognition from EEG and facial expressions: commonsense for if-then reasoning,” in The Thirty-Third AAAI Confer-
a multimodal approach,” in 40th Annual International Conference of ence on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative
the IEEE Engineering in Medicine and Biology Society, EMBC 2018, Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth
Honolulu, HI, USA, July 18-21, 2018. IEEE, 2018, pp. 530–533. AAAI Symposium on Educational Advances in Artificial Intelligence,
[297] Y. Huang, J. Yang, P. Liao, and J. Pan, “Fusion of facial expressions and EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019,
EEG for multimodal emotion recognition,” Comput. Intell. Neurosci., 2019, pp. 3027–3035.
vol. 2017, pp. 2 107 451:1–2 107 451:8, 2017. [314] D. Li, Y. Li, J. Zhang, K. Li, C. Wei, J. Cui, and B. Wang, “C3KG:
[298] Q. Zhu, G. Lu, and J. Yan, “Valence-arousal model based emotion A chinese commonsense conversation knowledge graph,” CoRR, vol.
recognition using eeg, peripheral physiological signals and facial abs/2204.02549, 2022.
expression,” in ICMLSC 2020: The 4th International Conference on [315] Z. Yang, D. Yang, C. Dyer, X. He, A. J. Smola, and E. H. Hovy, “Hi-
Machine Learning and Soft Computing, Haiphong City, Viet Nam, erarchical attention networks for document classification,” in NAACL
January 17-19, 2020. ACM, 2020, pp. 81–85. HLT 2016, The 2016 Conference of the North American Chapter
[299] H. Tang, W. Liu, W. Zheng, and B. Lu, “Multimodal emotion recog- of the Association for Computational Linguistics: Human Language
nition using deep neural networks,” in Neural Information Processing Technologies, San Diego California, USA, June 12-17, 2016, K. Knight,
- 24th International Conference, ICONIP 2017, Guangzhou, China, A. Nenkova, and O. Rambow, Eds. The Association for Computational
November 14-18, 2017, Proceedings, Part IV, ser. Lecture Notes in Linguistics, 2016, pp. 1480–1489.
Computer Science, D. Liu, S. Xie, Y. Li, D. Zhao, and E. M. El-Alfy, [316] J. Zhou, C. Ma, D. Long, G. Xu, N. Ding, H. Zhang, P. Xie, and
Eds., vol. 10637. Springer, 2017, pp. 811–819. G. Liu, “Hierarchy-aware global model for hierarchical text classifica-
[300] Y. Tan, Z. Sun, F. Duan, J. Solé-Casals, and C. F. Caiafa, “A mul- tion,” in Proceedings of the 58th Annual Meeting of the Association
timodal emotion recognition method based on facial expressions and for Computational Linguistics, ACL 2020, Online, July 5-10, 2020,
electroencephalography,” Biomed. Signal Process. Control., vol. 70, p. D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault, Eds. Association
103029, 2021. for Computational Linguistics, 2020, pp. 1106–1117.
[301] J. Vazquez-Rodriguez, G. Lefebvre, J. Cumin, and J. L. Crowley, [317] G. Hu, G. Lu, and Y. Zhao, “FSS-GCN: A graph convolutional
“Emotion recognition with pre-trained transformers using multimodal networks with fusion of semantic and structure for emotion cause
signals,” in 10th International Conference on Affective Computing and analysis,” Knowl. Based Syst., vol. 212, p. 106584, 2021.
Intelligent Interaction, ACII 2022, Nara, Japan, October 18-21, 2022, [318] D. Ghosal, N. Majumder, S. Poria, N. Chhaya, and A. F. Gelbukh,
2022, pp. 1–8. “Dialoguegcn: A graph convolutional neural network for emotion
[302] G. Hu, G. Lu, and Y. Zhao, “Bidirectional hierarchical attention recognition in conversation,” in Proceedings of the 2019 Conference on
networks based on document-level context for emotion cause extrac- Empirical Methods in Natural Language Processing and the 9th Inter-
tion,” in Findings of the Association for Computational Linguistics: national Joint Conference on Natural Language Processing, EMNLP-
EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16- IJCNLP 2019, Hong Kong, China, November 3-7, 2019, K. Inui,
20 November, 2021, 2021, pp. 558–568. J. Jiang, V. Ng, and X. Wan, Eds. Association for Computational
Linguistics, 2019, pp. 154–164.
[303] M. Li, H. Zhao, T. Gu, and D. Ying, “Experiencer-driven and
[319] M. Munezero, C. S. Montero, E. Sutinen, and J. Pajunen, “Are they
knowledge-aware graph model for emotion-cause pair extraction,”
different? affect, feeling, emotion, sentiment, and opinion detection in
Knowl. Based Syst., vol. 278, p. 110703, 2023.
text,” IEEE Trans. Affect. Comput., vol. 5, no. 2, pp. 101–111, 2014.
[304] W. Li, Y. Li, V. Pandelea, M. Ge, L. Zhu, and E. Cambria, “ECPEC:
[320] K. Cheng, Z. Yang, M. Zhang, and Y. Sun, “Uniker: A unified
Emotion-cause pair extraction in conversations,” IEEE Transactions on
framework for combining embedding and definite horn rule reasoning
Affective Computing, vol. 14, no. 3, pp. 1754–1765, 2023.
for knowledge graph inference,” in Proceedings of the 2021 Conference
[305] B. Li, H. Fei, F. Li, T.-s. Chua, and D. Ji, “Multimodal emotion-cause on Empirical Methods in Natural Language Processing, EMNLP 2021,
pair extraction with holistic interaction and label constraint,” ACM Virtual Event / Punta Cana, Dominican Republic, 7-11 November,
Trans. Multimedia Comput. Commun. Appl., aug 2024, just Accepted. 2021, M. Moens, X. Huang, L. Specia, and S. W. Yih, Eds., pp. 9753–
[Online]. Available: https://doi.org/10.1145/3689646 9771.
[306] J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image [321] W. Wang, H. Bao, L. Dong, J. Bjorck, Z. Peng, Q. Liu, K. Aggarwal,
pre-training for unified vision-language understanding and generation,” O. K. Mohammed, S. Singhal, S. Som, and F. Wei, “Image as a foreign
in International Conference on Machine Learning. PMLR, 2022, pp. language: Beit pretraining for all vision and vision-language tasks,”
12 888–12 900. CoRR, vol. abs/2208.10442, 2022.
[307] Y. Zeng, S. Mai, and H. Hu, “Which is making the contribution: Mod- [322] D. Hershcovich, S. Frank, H. Lent, M. de Lhoneux, M. Abdou,
ulating unimodal and cross-modal dynamics for multimodal sentiment S. Brandl, E. Bugliarello, L. Cabello Piqueras, I. Chalkidis, R. Cui,
analysis,” in Findings of the Association for Computational Linguistics: C. Fierro, K. Margatina, P. Rust, and A. Søgaard, “Challenges and
EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 strategies in cross-cultural NLP,” in Proceedings of the 60th Annual
November, 2021, 2021, pp. 1262–1274. Meeting of the Association for Computational Linguistics (Volume 1:
[308] C. Fan, H. Yan, J. Du, L. Gui, L. Bing, M. Yang, R. Xu, and Long Papers), S. Muresan, P. Nakov, and A. Villavicencio, Eds., May
R. Mao, “A knowledge regularized hierarchical approach for emotion 2022.
cause analysis,” in Proceedings of the 2019 Conference on Empirical [323] S. Hareli, K. Kafetsios, and U. Hess, “A cross-cultural study on emotion
Methods in Natural Language Processing and the 9th International expression and the learning of social norms,” Frontiers in psychology,
Joint Conference on Natural Language Processing, EMNLP-IJCNLP vol. 6, p. 1501, 2015.
2019, Hong Kong, China, November 3-7, 2019, 2019, pp. 5613–5623. [324] L. Zhu, R. Mao, E. Cambria, and B. J. Jansen, “Neurosymbolic ai for
[309] R. Speer, J. Chin, and C. Havasi, “Conceptnet 5.5: An open multilin- personalized sentiment analysis,” in Proceedings of HCII, 2024.
gual graph of general knowledge,” in Proceedings of the Thirty-First [325] M. Obrist, S. A. Seah, and S. Subramanian, “Talking about tactile
AAAI Conference on Artificial Intelligence, February 4-9, 2017, San experiences,” in Proceedings of Human Factors in Computing Systems,
Francisco, California, USA, 2017, pp. 4444–4451. 2013, pp. 1659–1668.
[310] E. Cambria, X. Zhang, R. Mao, M. Chen, and K. Kwok, “SenticNet 8:
Fusing emotion AI and commonsense AI for interpretable, trustworthy,
and explainable affective computing,” in International Conference on
Human-Computer Interaction (HCII), 2024.
[311] H. Zhang, D. Khashabi, Y. Song, and D. Roth, “Transomcs: From
linguistic graphs to commonsense knowledge,” in Proceedings of the
Twenty-Ninth International Joint Conference on Artificial Intelligence,
IJCAI 2020, 2020, pp. 4004–4010.
[312] A. Bosselut, H. Rashkin, M. Sap, C. Malaviya, A. Celikyilmaz, and
Y. Choi, “COMET: commonsense transformers for automatic knowl-
edge graph construction,” in Proceedings of the 57th Conference of the