100
100
Abstract Multimodal sentiment analysis has become Sentiment analysis involves analyzing sentiment po-
an important research area in the field of artificial in- larity through available information [3, 4]. With the
telligence. With the latest advances in deep learning, rapid development of fields such as artificial intelligence,
this technology has reached new heights. It has great computer vision, and natural language processing, it is
potential for both application and research, making it a becoming increasingly possible for artificial agents to
popular research topic. This review provides an overview implement sentiment analysis. Sentiment analysis is an
of the definition, background, and development of multi- interdisciplinary research area that includes computer
modal sentiment analysis. It also covers recent datasets science, psychology, social science, and other fields [5–7].
and advanced models, emphasizing the challenges and Scientists have been working to empower AI agents
future prospects of this technology. Finally, it looks with sentiment analysis capabilities for decades. This
ahead to future research directions. It should be noted is a key component of human-like AI, making AI more
that this review provides constructive suggestions for like a human.
promising research directions and building better per- Sentiment analysis has significant research value [8–
forming multimodal sentiment analysis models, which 11]. With the explosive growth of Internet data, vendors
can help researchers in this field. can use evaluative data such as reviews and review
Keywords Multimodal Sentiment Analysis · Mul- videos to improve their products. Sentiment analysis
timodal Fusion · Affective Computing · Computer also has countless research values, such as lie detection,
Vision interrogation, and entertainment. The following sections
will elaborate on the application and research value of
sentiment analysis.
1 Introduction In the past, sentiment analysis has mostly focused
on a single modality (visual modality, speech modal-
Emotion is a subjective reaction of an organism to ex- ity, or text modality) [12]. Text-based sentiment anal-
ternal stimuli [?,1]. Humans possess a powerful capacity ysis [13–15] has gone a long way in NLP. Vision-based
for sentiment analysis, and researchers are currently sentiment analysis pays more attention to human facial
exploring ways to make this ability available to artificial expressions [16] and movement postures. Speech-based
agents [2]. sentiment analysis mainly extracts features such as pitch,
timbre, and temperament in speech for sentiment anal-
Songning Lai ([email protected]), Xifeng Hu
ysis [17]. With the development of deep learning, these
([email protected]), Haoxuan Xu (202020120237@
mail.sdu.edu.cn) and Zhi Liu ([email protected]) are with three modalities have gained some foothold in sentiment
the School of Information Science and Engineering, Shandong analysis.
University, Qingdao, China. However, using a single modality for sentiment anal-
Zhaoxia Ren ([email protected]) is with Assets and Labora-
tory Management Department, Shandong University, Qingdao,
ysis has limitations [18–21]. The emotional information
China. contained in a single modality is limited and incomplete.
Corresponding author: Zhi Liu and Zhaoxia Ren. (liuzhi@ Combining information from multiple modalities can
sdu.edu.cn,[email protected]). provide deeper emotional polarity. Analyzing only one
2 Songning Lai et al.
Fig. 1 Figure shows the model architecture for the more classical multi-modal sentiment analysis. The overall architecture
consists of three parts: one part for feature extraction of individual modalities, one part for fusion of features of each modality,
and one part for sentiment analysis of the fused features. These three parts are very important, and researchers have begun to
optimize these three parts one by one
modality results in limited results and makes it difficult same field, our focus is on providing constructive sug-
to accurately analyze the emotion of an action. gestions for promising research directions and building
Researchers have gradually realized the need for better performing multimodal sentiment analysis mod-
multi-modal sentiment analysis, and many multi-modal els. We emphasize the challenges and future prospects
sentiment analysis models have emerged to accomplish of these technologies.
this task. Text features dominate and play a key role
in the analysis of deep emotions [22]. Visual modality
extraction of expression and pose features can effectively 2 Multimodal Sentiment Analysis
aid text sentiment analysis and judgment [23]. On the Datasets
one hand, speech modality can extract text features,
and on the other hand, speech tone can be recognized to With the growth of the Internet, an era of data explosion
reveal the state of text at each time point [24]. Figure1 has been created [26–28]. Numerous researchers have
shows the model architecture for the more classical widely collected these data from the Internet (videos,
multi-modal sentiment analysis. The overall architecture reviews, etc.) and built sentiment datasets according
consists of three parts: one part for feature extraction to their own needs. Tab1 summarizes the commonly
of individual modalities, one part for fusion of features used multimodal datasets. The first column indicates
of each modality, and one part for sentiment analysis of the name of the data set. The second column is the year
the fused features. These three parts are very important, in which the sentiment data was released. The third
and researchers have begun to optimize these three parts column is the category of modalities included in the
one by one [25]. sentiment dataset. The fourth column is the platform
In this review, we provide a comprehensive overview from which the data set came. The fifth column is the
of the field of multimodal sentiment analysis. The review language used by the dataset.The sixth column is the
includes a summary and brief introduction of datasets, amount of data contained in the dataset.Each dataset
which can help researchers select appropriate datasets. has its own characteristics. This section lists well-known
We compare and analyze models that have significant datasets in the community, aiming to help researchers
research significance in multimodal sentiment analysis sort out the characteristics of each dataset and make it
and provide suggestions for model construction. We easier for them to choose datasets.
elaborate on three types of modal fusion methods and
explain the advantages and disadvantages of different
modal fusion methods. Finally, we look ahead to the 2.1 IEMOCAP [29]
challenges and future development directions of multi-
modal sentiment analysis, providing several promising IEMOCAP, a sentiment analysis dataset released by
research directions. Compared to other reviews in the the Speech Analysis and Interpretation Laboratory in
Multimodal Sentiment Analysis: A Survey 3
Table 1 This table contains the used multimodal datasets. The first column indicates the name of the data set. The second
column is the year in which the sentiment data was released. The third column is the category of modalities included in the
sentiment dataset. The fourth column is the platform from which the data set came. The fifth column is the language used by
the dataset.The sixth column is the amount of data contained in the dataset.
2008, is a multi-modal dataset that comprises 1,039 con- 2.3 CMU-MOSI [31]
versational segments, with a total video length of 12
hours. Participants in the study engaged in five different CMU-MOSI dataset is comprised of 93 critical YouTube
scenarios, performing emotions as per a pre-set scenario. videos that cover a range of topics (Zadeh et al., 2016).
The dataset includes not only audio, video, and text These videos were carefully selected to ensure that they
information but also facial expression and posture in- featured only one speaker who was facing the camera,
formation obtained through additional sensors. Data allowing for clear capture of facial expressions. While
points are categorized into ten emotions: neutral, happy, there were no restrictions on camera model, distance,
sad, angry, surprised, scared, disgusted, frustrated, ex- or speaker scene, all presentations and comments were
cited, and other. Overall, IEMOCAP provides a rich made in English by 89 different speakers, including 41
resource for researchers exploring sentiment analysis women and 48 men. The 93 videos were divided into
across multiple modalities. 2,199 subjective opinion segments and annotated with
sentiment intensity ranging from strongly negative to
strongly positive (-3 to 3). Overall, the CMU-MOSI
dataset provides a valuable resource for researchers
studying sentiment analysis.
2.2 DEAP [30]
ment intensity markers range from strongly negative to includes whether the speaker expressed an opinion or
strongly positive (-3 to 3). Overall, CMU-MOSEI is an made an objective statement. Emotions are divided into
invaluable resource for researchers exploring sentiment six categories for each sentence: happiness, sadness, fear,
analysis across multiple modalities. disgust, surprise, and annotation.
MELD is a comprehensive dataset that includes video FACTIFY is a fake news detection dataset that focuses
clips from the popular television series Friends. The on implementation validation. It includes data for both
dataset comprises textual, audio, and video information image and text modalities and contains 50,000 sets of
that corresponds to the textual data. It contains 1400 data. Most of the data’s claims refer to politics and gov-
videos, which are further divided into 13,000 individual ernment. The dataset is annotated into three categories:
segments. The dataset is annotated with seven categories support, no evidence, and refutation. This dataset is a
of annotations: Anger, Disgust, Sadness, Joy, Neutral, valuable resource for researchers interested in detecting
Surprise, and Fear. Each segment has three sentiment and combating the spread of fake news.
annotations: positive, negative, and neutral.
Another approach is to use recurrent neural networks Fig. 3 Figure shows the overall framework of medium-term
and adversarial learning to learn joint representations be- model-based approaches for multimodal fusion. This class
inputs the feature information of each modality into multiple
tween different modalities, thereby improving the ability kernel learning, neural networks, graph models, and alternative
of single-modal representations and dealing with missing methods to complete the fusion of modalities. Most of the
modalities or noise. nodes of its modal fusion are variable.
3.1.5 MCTN (Multimodal Cyclic Translation 3.2.2 BERT-like (Self Supervised Models to
Network) [43] Improve Multimodal Speech Emotion
Recognition) [45]
A medium-term model-based multimodal fusion ap-
proach involves feeding multimodal data into the net- This model is a Transformer-based multi-modal senti-
work, and the intermediate layers of the model per- ment analysis method that can leverage self-attention
form feature fusion between the modalities. Model-based mechanism to achieve alignment and fusion between text
modality fusion methods can select the location of modal- and image. The model adopts a self-supervised learn-
ity feature fusion to achieve intermediate interactions. ing method, which can effectively handle multimodal
Model-based fusion typically uses multiple kernel learn- emotion recognition tasks with high accuracy and ro-
ing, neural networks, graph models, and alternative bustness. In addition, the model may be affected by the
methods. quality of data and annotations.
and attention mechanisms are used to extract visual 4.0.7 TIMF [55]
features. Visual features are mapped to text features
and combined with text modality features for sentiment The main idea of this model is that each modality learns
analysis. features separately and performs tensor fusion of the
features of each modality. In the dataset fusion stage,
the feature fusion for each modality is implemented by
4.0.5 MISA [53] a tensor fusion network. In the decision fusion stage, the
upstream results are fused by soft fusion to adjust the
The proposed model presents a novel multi-modal senti- decision results.
ment analysis framework. Each modality is mapped into
two distinct feature spaces after feature extraction. One
4.0.8 Auto-ML based Fusion [56]
feature space mainly learns the invariant features of the
modality and the other one learns the unique features
The authors propose to combine text and image indi-
of the modality.
vidual sentiment analysis into a final fused classification
based on AutoML. This approach combines individ-
4.0.6 MAG-BERT [54] ual classifiers into a final classification using the best
model generated by Auto-ML. This is a typical model
The authors propose a ”multi-modal” adaptation archi- for decision-level fusion.
tecture and apply it to BERT. The model can receive
input from multiple modalities during fine-tuning. MAG 4.0.9 Self-MM [57]
can be thought of as a vector embedding structure that
allows us to input multimodal information and embed In, the authors combine self-supervised learning and
it as a sequence to BERT. multi-task learning to construct a novel multi-modal
Multimodal Sentiment Analysis: A Survey 9
sentiment analysis architecture. To learn the private multimodal representations. It incorporates textual in-
information of each modality, the authors construct a formation in learning sentiment-related nonlinguistic
single-modal label generation module ULGM based on representations through text-based multi-head attention
self-supervised learning. The loss function corresponding and retains differentiated information among modalities
to this module is designed to incorporate the private through unimodal label prediction. Additionally, the
features learned by the three self-supervised learning vision pre-trained model Vision-Transformer is utilized
subtasks into the original multi-modal sentiment anal- to extract visual features from the original videos to pre-
ysis model using a weight adjustment strategy. The serve both global and local information of a human face.
proposed model performs well, and the self-supervised The strength of this model lies in its ability to incorpo-
learning based ULGM module also has the ability of rate textual information to improve the effectiveness of
single-modal label calibration. nonlinguistic modalities in MSA, while preserving inter-
and intra-modality relationships.
4.0.10 DISRFN [58]
4.0.13 SPIL [61]
The model is a dynamically invariant representation-
specific fusion network. The joint domain separation This model proposes a deep modal shared information
network is improved to obtain a joint domain separa- learning module for effective representation learning in
tion representation for all modalities, so that redundant multimodal sentiment analysis tasks. The proposed mod-
information can be effectively utilized. Second, a HGFN ule captures both shared and private information in a
network is used to dynamically fuse the feature informa- complete modal representation by using a covariance ma-
tion of each modality and learn the features of multiple trix to capture shared information between modalities
modal interactions. At the same time, a loss function and a self-supervised learning strategy to capture pri-
that improves the fusion effect is constructed to help vate information. The module is plug-and-play and can
the model learn the representation information of each adjust the information exchange relationship between
modality in the subspace. modes to learn private or shared information. Addition-
ally, a multi-task learning strategy is employed to help
the model focus its attention on modal differentiation
4.0.11 TEDT [59]
training data. The proposed model outperforms current
state-of-the-art methods on most metrics of three public
This model proposes a multimodal encoding-decoding
datasets, and more combinatorial techniques for the use
translation network with a transformer to address the
of the module are explored.
challenges of multimodal sentiment analysis, specifically
the impact of individual modal data and the poor quality
of nonnatural language features. The proposed method 5 Model comparison and suggestions
uses text as the primary information and sound and
image as the secondary information, and a modality This section evaluates five state-of-the-art multimodal
reinforcement cross-attention module to convert nonnat- sentiment analysis models: DFF-ATMF, MAG-BERT,
ural language features into natural language features to TIMF, Self-MM, and DISRFN. While DFF-ATMF does
improve their quality. Additionally, a dynamic filtering not consider the vision modality, the other models ana-
mechanism filters out error information generated in the lyze sentiment from three modalities of audio, text, and
cross-modal interaction. The strength of this model lies vision.
in its ability to improve the effect of multimodal fusion For the interaction relations of multimodal data,
and more accurately analyze human sentiment. However, DFF-ATMF and TIMF build transformer-based models
it may require significant computational resources and to learn complex relationships among the data. MAG-
may not be suitable for real-time analysis. BERT uses a simple yet effective multimodal adap-
tive gate fusion strategy. Self-MM uses self-supervised
4.0.12 TETFN [60] multi-task learning as the fusion strategy, self-supervised
generation of single-modal labels, and combination of
The Text Enhanced Transformer Fusion Network (TETFN) single-modal labels to complete the multi-modal senti-
is a novel method proposed for multimodal sentiment ment analysis task. DISRFN uses a Dynamic Invariant-
analysis (MSA) that addresses the challenge of different Specific Representation Fusion Network to obtain jointly
contributions of textual, visual, and acoustic modali- domain-separated representations of all modalities and
ties. The proposed method learns text-oriented pairwise dynamically fuse each representation through a hierar-
cross-modal mappings for obtaining effective unified chical graph fusion network.
10 Songning Lai et al.
DFF-ATMF uses two parallel branches to fuse au- different tasks. Advantages: Self-supervised multi-task
dio and text modalities. Its core mechanisms are feature learning as the fusion strategy, self-supervised generation
vector fusion and multimodal attention fusion, which of single-modal labels, and combination of single-modal
can learn more comprehensive sentiment information. labels to complete the multi-modal sentiment analy-
However, due to the use of multi-layer neural networks sis task. Disadvantages: May require a large amount
and sophisticated fusion methods, overfitting may occur. of labeled data to achieve good performance, and the
Advantages: Simple structure, easy to implement, and self-supervised learning process may be computationally
can learn comprehensive sentiment information through expensive.
feature vector fusion and multimodal attention fusion. DISRFN is a deep residual network-based multi-
Disadvantages: Does not consider the vision modality, modal sentiment analysis model that exploits the strat-
may suffer from overfitting due to the use of multi-layer egy of Dynamic Invariant-Specific Representation Fusion
neural networks and sophisticated fusion methods. Network to improve sentiment recognition capability.
MAG-BERT adapts the interior of BERTs using Its advantage is that it can efficiently utilize redundant
multimodal adaptation gates, which employ a simple yet information to obtain joint domain-separated represen-
effective fusion strategy without changing the structure tations of all modalities through a modified joint domain
and parameters of BERTs. However, the multimodal at- separation network and dynamically fuse each represen-
tention can only be performed within the same timestep tation through a hierarchical graph fusion network to
but not across timesteps, which may ignore some tem- obtain the interaction information of multimodal data.
poral relationships. Additionally, MAG-BERT requires However, as with Self-MM, interference and imbalance
freezing the parameters of BERT without being able between multiple tasks can occur, and a suitable weight
to fine-tune BERT, which may result in a representa- adjustment strategy needs to be designed to balance the
tion of BERT that is not adapted to a specific task or learning progress of different tasks. Advantages: Uses a
domain. Advantages: Uses a simple yet effective multi- Dynamic Invariant-Specific Representation Fusion Net-
modal adaptive gate fusion strategy without changing work to obtain jointly domain-separated representations
the structure and parameters of BERTs. Disadvantages: of all modalities and dynamically fuse each representa-
Multimodal attention can only be performed within the tion through a hierarchical graph fusion network. Disad-
same timestep but not across timesteps, which may ig- vantages: May require a large amount of labeled data to
nore some temporal relationships. Requires freezing the achieve good performance, and the hierarchical graph
parameters of BERT without being able to fine-tune fusion network may be computationally expensive.
BERT, which may result in a representation of BERT TEDT proposes a multimodal encoding-decoding
that is not adapted to a specific task or domain. translation network with a transformer to address the
TIMF leverages the self-attention mechanism of challenges of multimodal sentiment analysis. The strength
Transformers to learn complex interactions between of this model lies in its ability to improve the effect of
multimodal data and generate unified sentiment rep- multimodal fusion and more accurately analyze human
resentations. While it has the advantage of being able sentiment. By incorporating the modality reinforcement
to learn complex relationships between modalities, it cross-attention module and dynamic filtering mecha-
may suffer from extreme computational complexity, long nism, the model is able to address the challenges of
training times, and problems with large amounts of la- individual modal data impact and poor quality of non-
beled data. Advantages: Leverages the self-attention natural language features. To build effective multimodal
mechanism of Transformers to learn complex interac- sentiment analysis models, it is recommended to care-
tions between multimodal data and generate unified fully consider the contribution of each modality and
sentiment representations. Disadvantages: May suffer how to effectively integrate them. Additionally, atten-
from extreme computational complexity, long training tion should be paid to addressing challenges such as
times, and problems with large amounts of labeled data. individual modal data impact and poor quality of non-
Self-MM is a self-supervised multi-modal sentiment natural language features. Finally, it is important to
analysis model that uses a multi-task learning strat- consider the computational requirements of the model
egy to learn both multimodal and unimodal emotion and ensure that it is suitable for the intended use case.
recognition tasks. Its advantage is that it can generate TETFN is a novel method proposed for MSA that
single-modal labels using a self-supervised approach, addresses the challenge of different contributions of tex-
saving the cost and time of manual labeling. However, tual, visual, and acoustic modalities. Compared to the
interference and imbalance between multiple tasks can TEDT model, the TETFN model focuses on incorporat-
occur, and an appropriate weight adjustment strategy ing textual information to improve the effectiveness of
needs to be designed to balance the learning progress of nonlinguistic modalities in MSA while preserving inter-
Multimodal Sentiment Analysis: A Survey 11
CMU-MOSI CMU-MOSEI
Model MAE Corr Acc F1-Score MAE Corr Acc F1-Score
DFF-ATMF – – 80.9 81.3 – – 77.2 78.3
MAG-BERT 0.712 0.796 – 86 0.623 0.677 82 82.1
TIMF 0.373 0.93 92.3 92.3 0.645 0.669 79.5 79.5
Self-MM 0.723 0.797 84.8 84.8 0.534 0.764 84.1 84.1
DISRFN 0.798 0.734 83.4 83.6 0.591 0.78 87.5 87.5
TEDT 0.709 0,812 0.893 0.892 0.524 0.749 0.862 0.861
TETFN 0.717 0.800 0.841 0.838 0.551 0.748 0.843 0.842
SPIL 0.704 0.794 0.851 0.854 .523 0.766 0.850 0.849
Table 3 The table shows the performance metrics of the DFF-ATMF, MAG-BERT, TIMF, Self-MM, DISRFN, TEDT ,TETFN
and SPIL models under the CMU-MOSI and CMU-MOSEI datasets. The evaluation parameters included: MAE, Corr, Acc and
F1-Score.
and intra-modality relationships. The TETFN model a self-supervised learning strategy to capture private
achieves this by using text-based multi-head attention information, while the TETFN model uses text-based
and unimodal label prediction to retain differentiated multi-head attention and unimodal label prediction to
information among modalities. In contrast, the TEDT retain differentiated information among modalities. The
model uses a modality reinforcement cross-attention SPIL model also employs a multi-task learning strategy
module to convert nonnatural language features into to help the model focus its attention on modal differ-
natural language features and a dynamic filtering mech- entiation training data, while the TEDT and TETFN
anism to filter out error information generated in the models do not explicitly mention this. The strength of
cross-modal interaction. The strength of the TETFN the SPIL model lies in its ability to capture both shared
model lies in its ability to effectively incorporate tex- and private information in a complete modal represen-
tual information to improve the effectiveness of nonlin- tation, which can be adjusted based on the specific task
guistic modalities in MSA while preserving inter- and at hand. Additionally, the use of a multi-task learning
intra-modality relationships. Additionally, the use of the strategy helps to improve the model’s performance by
vision pre-trained model Vision-Transformer helps to ex- focusing its attention on modal differentiation training
tract visual features from the original videos to preserve data. The SPIL model’s approach of capturing both
both global and local information of a human face. To shared and private information in a complete modal
build effective multimodal sentiment analysis models, it representation is worth considering in future models.
is recommended to carefully consider the contribution Tab3 shows the performance metrics of these five
of each modality and how to effectively integrate them. models under the CMU-MOSI and CMU-MOSEI datasets.
Additionally, attention should be paid to addressing When analyzing the performance metrics of the five mod-
challenges such as individual modal data impact and els on these datasets, we recommend using BERT to
poor quality of nonnatural language features. extract features of text information while using LSTM
to extract features for video and audio modality infor-
SPIL proposes a deep modal shared information mation since it requires capturing modality information
learning module for effective representation learning in in the time series.
multimodal sentiment analysis tasks. The proposed mod- DFF-ATMF does not consider visual modality, re-
ule captures both shared and private information in a sulting in relatively low performance metrics. Visual
complete modal representation by using a covariance ma- information can provide additional information about
trix to capture shared information between modalities human expressions, poses, scenes, etc., which can en-
and a self-supervised learning strategy to capture pri- hance the information of text and speech modalities as
vate information. The module is plug-and-play and can well as complement them. Therefore, visual modality
adjust the information exchange relationship between information deserves to be considered and explored in
modes to learn private or shared information. Addi- multimodal sentiment analysis.
tionally, a multi-task learning strategy is employed to
help the model focus its attention on modal differentia-
tion training data. Compared to the TEDT and TETFN 6 Challenges and Future Scope
models, the SPIL model also focuses on capturing shared
and private information in a complete modal representa- With the development of deep learning, multimodal sen-
tion. However, the SPIL model uses a covariance matrix timent analysis techniques have also been rapidly devel-
to capture shared information between modalities and oped [62–65]. However, multi-modal sentiment analysis
12 Songning Lai et al.
still faces many challenges. This subsection analyzes for multimodal sentiment analysis tasks. Making good
the current state of research, challenges, and future use of memes that are mixed in the text is an important
developments in multimodal sentiment analysis. research topic, as memes often contain extremely strong
emotional messages about reviewers. Additionally, most
text data is transcribed directly through speech, making
6.1 Dataset it particularly difficult to analyze a person’s emotions
when multiple people are talking. Combined with the
In multimodal sentiment analysis, the dataset plays a cultural characteristics of different regions and countries,
crucial role. Currently, a large dataset in multiple lan- the same text data may reflect different emotions.
guages is missing. Given the diversity of languages and
races in many countries, a large, diverse dataset could
be used to train a multi-modal sentiment analysis model 6.5 Future Prospectsn
with strong generalization and wide usage. Additionally,
current multimodal datasets still have low annotation The future of multimodal sentiment analysis techniques
accuracy and have not yet reached absolute continu- is extremely bright, and some of the future applications
ous values, requiring researchers to label multimodal are listed below. Multimodal emotion analysis for real-
datasets more finely. Most current multimodal data con- time assessment of mental health [71–73]; multimodal
tain only visual, speech, and text modalities and lack criminal linguistic deception detection model [74]; offen-
modal information combined with physiological signals sive language detection; A human-like emotion-aware
such as brain waves and pulses. robot, etc. Multimodal emotion analysis is a technique
for recognizing and analyzing emotions. Models that
combine multi-modal information data for sentiment
6.2 Detection of Hidden Emotions analysis can effectively improve the accuracy of sen-
timent analysis. In the future, multi-modal sentiment
There has always been a recognized difficulty in multi- analysis techniques will be gradually improved. Perhaps
modal sentiment analysis tasks: the analysis of hidden one day there will be a multimodal sentiment analysis
emotions. Hidden emotions [66, 67] include: sarcastic model with a large number of parameters that will have
emotions (such as sarcastic words), emotions that need the same sentiment analysis capabilities as humans. It
to be concretely analyzed in context, and complex emo- was a thing of rapture.
tions [68, 69] (such as a person’s happiness and sadness).
It is important to explore these hidden emotions. It’s
the gap between human and artificial intelligence [70]. 7 Conclusion
Multimodal sentiment analysis techniques have been
6.3 Multiple forms of video data recognized as important by researchers in various fields,
making it a central research topic in the fields of nat-
In multimodal sentiment analysis tasks, video data is ural language processing and computer vision. In this
particularly challenging. Although the speaker is facing review, we provide a detailed description of various as-
the camera and the video data resolution is maintained pects of multimodal sentiment analysis, including its
at a high level, the actual situation is more complicated research background, definition, and development pro-
and requires the model to be robust against noise and cess. We also summarize commonly used benchmark
applicable to low-resolution video data. Capturing the datasets in Table 1 and compare and analyze recent
micro-expressions and micro-gestures of speakers for state-of-the-art multimodal sentiment analysis models.
sentiment analysis is also an area worth exploring by Finally, we present the challenges posed by the field
researchers. of multimodal sentiment analysis and explore possible
future developments.
Many prospective works are being actively carried
6.4 Multiform language data out and have even been largely implemented. However,
there are still challenges to be addressed, leading to the
The form of text data in multimodal sentiment analysis following meaningful research directions:
tasks is typically single. However, evaluation texts in on- (1) Construct a large multimodal sentiment dataset
line communities are often cross-lingual, with reviewers in multiple languages.
using multiple languages to make more vivid comments. (2) Solve the domain transfer problem of video, text,
Text data with mixed emotions also remains a challenge and speech modal data.
Multimodal Sentiment Analysis: A Survey 13
(3) Build a unified, large-scale multimodal senti- 3. Soo-Min Kim and Eduard Hovy. Determining the senti-
ment analysis model with excellent generalization per- ment of opinions. In COLING 2004: Proceedings of the
20th International Conference on Computational Linguis-
formance. tics, pages 1367–1373, 2004.
(4) Reduce model parameters, optimize algorithms, 4. Erik Cambria, Björn Schuller, Yunqing Xia, and Catherine
and reduce algorithm complexity. Havasi. New avenues in opinion mining and sentiment
(5) Solve the multi-lingual hybridness problem in analysis. IEEE Intelligent systems, 28(2):15–21, 2013.
multimodal sentiment analysis. 5. Arshi Parvaiz, Muhammad Anwaar Khalid, Rukhsana
Zafar, Huma Ameer, Muhammad Ali, and Muham-
(6) Discuss the weight problem of modal fusion and mad Moazam Fraz. Vision transformers in medical com-
provide the most reasonable scheme to assign weights puter vision—a contemplative retrospection. Engineering
of different modalities in different cases. Applications of Artificial Intelligence, 122:106126, 2023.
(7) Discuss the correlation between modalities and 6. Bo Zhang, Jun Zhu, and Hang Su. Toward the third gen-
eration artificial intelligence. Science China Information
separate shared and private information between them Sciences, 66(2):1–19, 2023.
to improve model performance and interpretability. 7. Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang,
(8) Construct a multimodal sentiment analysis model Hiroaki Hayashi, and Graham Neubig. Pre-train, prompt,
that can well complete hidden emotions. and predict: A systematic survey of prompting methods
in natural language processing. ACM Computing Surveys,
55(9):1–35, 2023.
8. Jireh Yi-Le Chan, Khean Thye Bea, Steven Mun Hong
Declarations Leow, Seuk Wai Phoong, and Wai Khuen Cheng. State of
the art: a review of sentiment analysis based on sequential
Availability of data and materials transfer learning. Artificial Intelligence Review, 56(1):749–
780, 2023.
9. Mayur Wankhade, Annavarapu Chandra Sekhara Rao,
Not applicable. and Chaitanya Kulkarni. A survey on sentiment analysis
methods, applications, and challenges. Artificial Intelli-
gence Review, 55(7):5731–5780, 2022.
Competing interests 10. Hui Li, Qi Chen, Zhaoman Zhong, Rongrong Gong, and
Guokai Han. E-word of mouth sentiment analysis for user
behavior studies. Information Processing & Management,
The authors declare that they have no competing inter- 59(1):102784, 2022.
ests. 11. Ashima Yadav and Dinesh Kumar Vishwakarma. Senti-
ment analysis using deep learning architectures: a review.
Artificial Intelligence Review, 53(6):4335–4385, 2020.
Funding 12. Ganesh Chandrasekaran, Tu N Nguyen, and Jude He-
manth D. Multimodal sentimental analysis for social
media applications: A comprehensive review. Wiley Inter-
This work was supported in part by Joint found for disciplinary Reviews: Data Mining and Knowledge Dis-
smart computing of Shandong Natural Science Foun- covery, 11(5):e1415, 2021.
dation under Grant ZR2020LZH013; open project of 13. Bernhard Kratzwald, Suzana Ilic, Mathias Kraus, Stefan
State Key Laboratory of Computer Architecture CAR- Feuerriegel, and Helmut Prendinger. Decision support
with text-based emotion recognition: Deep learning for
CHA202001; the Major Scientific and Technological In- affective computing. arXiv preprint arXiv:1803.06397,
novation Project in Shandong Province under Grant 2018.
2021CXG010506 and 2022CXG010504; ”New Univer- 14. Carlo Strapparava and Rada Mihalcea. Semeval-2007 task
sity 20 items” Funding Project of Jinan under Grant 14: Affective text. In Proceedings of the fourth interna-
tional workshop on semantic evaluations (SemEval-2007),
2021GXRC108 and 2021GXRC024.
pages 70–74, 2007.
15. Yang Li, Quan Pan, Suhang Wang, Tao Yang, and Erik
Cambria. A generative model for category text generation.
Acknowledgments Information Sciences, 450:301–315, 2018.
16. Rong Dai. Facial expression recognition method based on
Not applicable. facial physiological features and deep learning. Journal
of Chongqing University of Technology (Natural Science),
34(6):146–153, 2020.
17. Zhu Ren, Jia Jia, Quan Guo, Kuo Zhang, and Lianhong
References Cai. Acoustics, content and geo-information based sen-
timent prediction from large-scale networked voice data.
1. Julien Deonna and Fabrice Teroni. The emotions: A In 2014 IEEE International Conference on Multimedia
philosophical introduction. Routledge, 2012. and Expo (ICME), pages 1–4. IEEE, 2014.
2. Clayton Hutto and Eric Gilbert. Vader: A parsimonious 18. LIU Jiming, ZHANG Peixiang, LIU Ying, ZHANG Wei-
rule-based model for sentiment analysis of social media dong, and FANG Jie. Summary of multi-modal sentiment
text. In Proceedings of the international AAAI conference analysis technology. Journal of Frontiers of Computer
on web and social media, volume 8, pages 216–225, 2014. Science & Technology, 15(7):1165, 2021.
14 Songning Lai et al.
19. Feiran Huang, Xiaoming Zhang, Zhonghua Zhao, Jie Xu, 33. Soujanya Poria, Devamanyu Hazarika, Navonil Majumder,
and Zhoujun Li. Image–text sentiment analysis via deep Gautam Naik, Erik Cambria, and Rada Mihalcea. Meld:
multimodal attentive fusion. Knowledge-Based Systems, A multimodal multi-party dataset for emotion recognition
167:26–37, 2019. in conversations. arXiv preprint arXiv:1810.02508, 2018.
20. Akshi Kumar and Geetanjali Garg. Sentiment analy- 34. Nan Xu, Wenji Mao, and Guandan Chen. Multi-
sis of multimodal twitter data. Multimedia Tools and interactive memory network for aspect based multimodal
Applications, 78:24103–24119, 2019. sentiment analysis. In Proceedings of the AAAI Confer-
21. Ankita Gandhi, Kinjal Adhvaryu, and Vidhi Khanduja. ence on Artificial Intelligence, volume 33, pages 371–378,
Multimodal sentiment analysis: review, application do- 2019.
mains and future directions. In 2021 IEEE Pune Section 35. Wenmeng Yu, Hua Xu, Fanyang Meng, Yilin Zhu, Yixiao
International Conference (PuneCon), pages 1–5. IEEE, Ma, Jiele Wu, Jiyun Zou, and Kaicheng Yang. Ch-sims:
2021. A chinese multimodal sentiment analysis dataset with
22. Vaibhav Rupapara, Furqan Rustam, Hina Fatima fine-grained annotation of modality. In Proceedings of the
Shahzad, Arif Mehmood, Imran Ashraf, and Gyu Sang 58th annual meeting of the association for computational
Choi. Impact of smote on imbalanced text features for linguistics, pages 3718–3727, 2020.
toxic comments classification using rvvc model. IEEE 36. Amir Zadeh, Yan Sheng Cao, Simon Hessner, Paul Pu
Access, 9:78621–78634, 2021. Liang, Soujanya Poria, and Louis-Philippe Morency. Cmu-
23. Jia Li, Ziyang Zhang, Junjie Lang, Yueqi Jiang, Liuwei An, moseas: A multimodal language dataset for spanish, por-
Peng Zou, Yangyang Xu, Sheng Gao, Jie Lin, Chunxiao tuguese, german and french. In Proceedings of the Con-
Fan, et al. Hybrid multimodal feature extraction, min- ference on Empirical Methods in Natural Language Pro-
ing and fusion for sentiment analysis. In Proceedings of cessing. Conference on Empirical Methods in Natural
the 3rd International on Multimodal Sentiment Analysis Language Processing, volume 2020, page 1801. NIH Pub-
Workshop and Challenge, pages 81–88, 2022. lic Access, 2020.
24. Anna Favaro, Chelsie Motley, Tianyu Cao, Miguel Iglesias, 37. Shreyash Mishra, S Suryavardan, Amrit Bhaskar, Parul
Ankur Butala, Esther S Oh, Robert D Stevens, Jesús Chopra, Aishwarya Reganti, Parth Patwa, Amitava Das,
Villalba, Najim Dehak, and Laureano Moro-Velázquez. A Tanmoy Chakraborty, Amit Sheth, Asif Ekbal, et al. Fact-
multi-modal array of interpretable features to evaluate ify: A multi-modal fact verification dataset. In Proceedings
language and speech patterns in different neurological of the First Workshop on Multimodal Fact-Checking and
disorders. In 2022 IEEE Spoken Language Technology Hate Speech Detection (DE-FACTIFY), 2022.
Workshop (SLT), pages 532–539. IEEE, 2023.
38. Sathyanarayanan Ramamoorthy, Nethra Gunti, Shreyash
25. Soujanya Poria, Erik Cambria, Rajiv Bajpai, and Amir
Mishra, S Suryavardan, Aishwarya Reganti, Parth Patwa,
Hussain. A review of affective computing: From unimodal
Amitava DaS, Tanmoy Chakraborty, Amit Sheth, Asif
analysis to multimodal fusion. Information fusion, 37:98–
Ekbal, et al. Memotion 2: Dataset on sentiment and
125, 2017.
emotion analysis of memes. In Proceedings of De-Factify:
26. Sathyan Munirathinam. Industry 4.0: Industrial internet
Workshop on Multimodal Fact Checking and Hate Speech
of things (iiot). In Advances in computers, volume 117,
Detection, CEUR, 2022.
pages 129–164. Elsevier, 2020.
39. Louis-Philippe Morency, Rada Mihalcea, and Payal Doshi.
27. Esteban Ortiz-Ospina and Max Roser. The rise of social
Towards multimodal sentiment analysis: Harvesting opin-
media. Our world in data, 2023.
28. Abdul Haseeb, Enjun Xia, Shah Saud, Ashfaq Ahmad, and ions from the web. In Proceedings of the 13th international
Hamid Khurshid. Does information and communication conference on multimodal interfaces, pages 169–176, 2011.
technologies improve environmental quality in the era 40. Paul Pu Liang, Ziyin Liu, Amir Zadeh, and Louis-Philippe
of globalization? an empirical analysis. Environmental Morency. Multimodal language analysis with recurrent
Science and Pollution Research, 26:8594–8608, 2019. multistage fusion. arXiv preprint arXiv:1808.03920, 2018.
29. Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe 41. Yansen Wang, Ying Shen, Zhun Liu, Paul Pu Liang, Amir
Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Zadeh, and Louis-Philippe Morency. Words can shift: Dy-
Chang, Sungbok Lee, and Shrikanth S Narayanan. namically adjusting word representations using nonverbal
Iemocap: Interactive emotional dyadic motion capture behaviors. In Proceedings of the AAAI Conference on
database. Language resources and evaluation, 42:335–359, Artificial Intelligence, volume 33, pages 7216–7223, 2019.
2008. 42. Sijie Mai, Haifeng Hu, and Songlong Xing. Divide, conquer
30. Sander Koelstra, Christian Muhl, Mohammad Soley- and combine: Hierarchical feature fusion network with
mani, Jong-Seok Lee, Ashkan Yazdani, Touradj Ebrahimi, local and global perspectives for multimodal affective
Thierry Pun, Anton Nijholt, and Ioannis Patras. Deap: A computing. In Proceedings of the 57th annual meeting
database for emotion analysis; using physiological signals. of the association for computational linguistics, pages
IEEE transactions on affective computing, 3(1):18–31, 481–492, 2019.
2011. 43. Hai Pham, Paul Pu Liang, Thomas Manzini, Louis-
31. Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis- Philippe Morency, and Barnabás Póczos. Found in trans-
Philippe Morency. Mosi: multimodal corpus of sentiment lation: Learning robust joint representations by cyclic
intensity and subjectivity analysis in online opinion videos. translations between modalities. In Proceedings of the
arXiv preprint arXiv:1606.06259, 2016. AAAI Conference on Artificial Intelligence, volume 33,
32. AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, pages 6892–6899, 2019.
Erik Cambria, and Louis-Philippe Morency. Multimodal 44. Soujanya Poria, Erik Cambria, and Alexander Gelbukh.
language analysis in the wild: Cmu-mosei dataset and Deep convolutional neural network textual features and
interpretable dynamic fusion graph. In Proceedings of multiple kernel learning for utterance-level multimodal
the 56th Annual Meeting of the Association for Com- sentiment analysis. In Proceedings of the 2015 confer-
putational Linguistics (Volume 1: Long Papers), pages ence on empirical methods in natural language processing,
2236–2246, 2018. pages 2539–2544, 2015.
Multimodal Sentiment Analysis: A Survey 15
45. Shamane Siriwardhana, Andrew Reis, Rivindu Weerasek- tion fusion network for multimodal sentiment analysis.
era, and Suranga Nanayakkara. Jointly fine-tuning” bert- Computational Intelligence and Neuroscience, 2022, 2022.
like” self supervised models to improve multimodal speech 59. Fan Wang, Shengwei Tian, Long Yu, Jing Liu, Junwen
emotion recognition. arXiv preprint arXiv:2008.06682, Wang, Kun Li, and Yongtao Wang. Tedt: Transformer-
2020. based encoding–decoding translation network for mul-
46. Behnaz Nojavanasghari, Deepak Gopinath, Jayanth timodal sentiment analysis. Cognitive Computation,
Koushik, Tadas Baltrušaitis, and Louis-Philippe Morency. 15(1):289–303, 2023.
Deep multimodal fusion for persuasiveness prediction. In 60. Di Wang, Xutong Guo, Yumin Tian, Jinhui Liu, LiHuo
Proceedings of the 18th ACM International Conference He, and Xuemei Luo. Tetfn: A text enhanced transformer
on Multimodal Interaction, pages 284–288, 2016. fusion network for multimodal sentiment analysis. Pattern
47. Haohan Wang, Aaksha Meghawat, Louis-Philippe Recognition, 136:109259, 2023.
Morency, and Eric P Xing. Select-additive learning: Im- 61. Songning Lai, Xifeng Hu, Yulong Li, Zhaoxia Ren, Zhi Liu,
proving generalization in multimodal sentiment analysis. and Danmin Miao. Shared and private information learn-
In 2017 IEEE International Conference on Multimedia ing in multimodal sentiment analysis with deep modal
and Expo (ICME), pages 949–954. IEEE, 2017. alignment and self-supervised multi-task learning. arXiv
48. Hongliang Yu, Liangke Gui, Michael Madaio, Amy Ogan, preprint arXiv:2305.08473, 2023.
62. Mahesh G Huddar, Sanjeev S Sannakki, and Vijay S
Justine Cassell, and Louis-Philippe Morency. Temporally
Rajpurohit. A survey of computational approaches and
selective attention model for social and affective state
challenges in multimodal sentiment analysis. Int. J. Com-
recognition in multimedia content. In Proceedings of the
put. Sci. Eng, 7(1):876–883, 2019.
25th ACM international conference on Multimedia, pages 63. Ramandeep Kaur and Sandeep Kautish. Multimodal
1743–1751, 2017. sentiment analysis: A survey and comparison. Research
49. Nan Xu and Wenji Mao. Multisentinet: A deep semantic Anthology on Implementing Sentiment Analysis Across
network for multimodal sentiment analysis. In Proceedings Multiple Disciplines, pages 1846–1870, 2022.
of the 2017 ACM on Conference on Information and 64. Lukas Stappen, Alice Baird, Lea Schumann, and Schuller
Knowledge Management, pages 2399–2402, 2017. Bjorn. The multimodal sentiment analysis in car reviews
50. Feiyang Chen, Ziqian Luo, Yanyan Xu, and Dengfeng (muse-car) dataset: Collection, insights and improvements.
Ke. Complementary fusion of multi-features and IEEE Transactions on Affective Computing, 2021.
multi-modalities in sentiment analysis. arXiv preprint 65. Anurag Illendula and Amit Sheth. Multimodal emotion
arXiv:1904.08138, 2019. classification. In companion proceedings of the 2019 world
51. Jie Xu, Zhoujun Li, Feiran Huang, Chaozhuo Li, and wide web conference, pages 439–449, 2019.
S Yu Philip. Social image sentiment analysis by exploiting 66. Donglei Tang, Zhikai Zhang, Yulan He, Chao Lin, and
multimodal content and heterogeneous relations. IEEE Deyu Zhou. Hidden topic–emotion transition model for
Transactions on Industrial Informatics, 17(4):2974–2982, multi-level social emotion detection. Knowledge-Based
2020. Systems, 164:426–435, 2019.
52. Weidong Wu, Yabo Wang, Shuning Xu, and Kaibo Yan. 67. Petr Hajek, Aliaksandr Barushka, and Michal Munk. Fake
Sfnn: Semantic features fusion neural network for mul- consumer review detection using deep neural networks
timodal sentiment analysis. In 2020 5th International integrating word embeddings and emotion mining. Neural
Conference on Automation, Control and Robotics Engi- Computing and Applications, 32:17259–17274, 2020.
neering (CACRE), pages 661–665. IEEE, 2020. 68. Soonil Kwon. A cnn-assisted enhanced audio signal pro-
53. Devamanyu Hazarika, Roger Zimmermann, and Soujanya cessing for speech emotion recognition. Sensors, 20(1):183,
Poria. Misa: Modality-invariant and-specific representa- 2019.
tions for multimodal sentiment analysis. In Proceedings of 69. Umar Rashid, Muhammad Waseem Iqbal, Muhammad Ak-
the 28th ACM International Conference on Multimedia, mal Skiandar, Muhammad Qasim Raiz, Muhammad Raza
pages 1122–1131, 2020. Naqvi, and Syed Khuram Shahzad. Emotion detection
of contextual text using deep learning. In 2020 4th In-
54. Wasifur Rahman, Md Kamrul Hasan, Sangwu Lee, Amir
ternational symposium on multidisciplinary studies and
Zadeh, Chengfeng Mao, Louis-Philippe Morency, and
innovative technologies (ISMSIT), pages 1–5. IEEE, 2020.
Ehsan Hoque. Integrating multimodal information in large
70. Fereshteh Ghanbari-Adivi and Mohammad Mosleh. Text
pretrained transformers. In Proceedings of the conference.
emotion detection in social networks using a novel ensem-
Association for Computational Linguistics. Meeting, vol-
ble classifier based on parzen tree estimator (tpe). Neural
ume 2020, page 2359. NIH Public Access, 2020.
Computing and Applications, 31(12):8971–8983, 2019.
55. Jianguo Sun, Hanqi Yin, Ye Tian, Junpeng Wu, Linshan 71. Zhentao Xu, Verónica Pérez-Rosas, and Rada Mihalcea.
Shen, and Lei Chen. Two-level multimodal fusion for Inferring social media users’ mental health status from
sentiment analysis in public security. Security and Com- multimodal information. In Proceedings of the 12th lan-
munication Networks, 2021:1–10, 2021. guage resources and evaluation conference, pages 6292–
56. Vasco Lopes, António Gaspar, Luı́s A Alexandre, and João 6299, 2020.
Cordeiro. An automl-based approach to multimodal image 72. Rahee Walambe, Pranav Nayak, Ashmit Bhardwaj, and
sentiment analysis. In 2021 International Joint Confer- Ketan Kotecha. Employing multimodal machine learning
ence on Neural Networks (IJCNN), pages 1–9. IEEE, for stress detection. Journal of Healthcare Engineering,
2021. 2021:1–12, 2021.
57. Wenmeng Yu, Hua Xu, Ziqi Yuan, and Jiele Wu. Learn- 73. Nujud Aloshban, Anna Esposito, and Alessandro Vincia-
ing modality-specific representations with self-supervised relli. What you say or how you say it? depression detection
multi-task learning for multimodal sentiment analysis. In through joint modeling of linguistic and acoustic aspects
Proceedings of the AAAI conference on artificial intelli- of speech. Cognitive Computation, 14(5):1585–1598, 2022.
gence, volume 35, pages 10790–10797, 2021. 74. Safa Chebbi and Sofia Ben Jebara. Deception detection
58. Jing He, Haonan Yanga, Changfan Zhang, Hongrun Chen, using multimodal fusion approaches. Multimedia Tools
and Yifu Xua. Dynamic invariant-specific representa- and Applications, pages 1–30, 2021.