2022.NAACL-CLMLF
2022.NAACL-CLMLF
Abstract
Compared with unimodal data, multimodal
data can provide more features to help the
model analyze the sentiment of data. Previ-
arXiv:2204.05515v4 [cs.CL] 14 Jun 2022
ous research works rarely consider token-level (a) Heathrow. Fly early to- (b) Blue Jays game with the
feature fusion, and few works explore learning morrow morning. (positive) fam! Let’s go! (positive)
the common features related to sentiment in
multimodal data to help the model fuse mul-
timodal features. In this paper, we propose
a Contrastive Learning and Multi-Layer Fu-
sion (CLMLF) method for multimodal senti-
ment detection. Specifically, we first encode
text and image to obtain hidden representa- (c) Ridge Avenue is closed af- (d) Flexible spinal cord im-
tions, and then use a multi-layer fusion mod- ter a partial building collapse plants will let paralyzed peo-
and electrical fire Saturday ple walk. (neutral)
ule to align and fuse the token-level features night. (negative)
of text and image. In addition to the sen-
timent analysis task, we also designed two Figure 1: Examples of multimodal sentiment tweets
contrastive learning tasks, label based con-
trastive learning and data based contrastive
learning tasks, which will help the model
multimodal sentiment analysis has become a popu-
learn common features related to sentiment in
multimodal data. Extensive experiments con- lar research topic (Kaur and Kautish, 2019).
ducted on three publicly available multimodal As for multimodal data, the complementarity
datasets demonstrate the effectiveness of our between text and image can help the model ana-
approach for multimodal sentiment detection lyze the real sentiment of the multimodal data. As
compared with existing methods. The codes shown in Figure 1, detecting sentiment with only
are available for use at https://github.
text modality or image modality may not be cer-
com/Link-Li/CLMLF
tain of the true intention of the tweet. Such as
1 Introduction Figure 1a, if we only analyze the text modality, we
will find that this is a declarative sentence that does
With the development of social networking plat- not express sentiment. In fact, the girl’s smile in
forms which have become the main platform for the image shows that the sentiment of this tweet is
people to share their personal opinions. How to positive. At the same time, in Figure 1c, we can
extract and analyze sentiments in social media data find that the ruins in the image which deepen the
efficiently and correctly has broad applications. expression of negative sentiment in the text.
Therefore, it has attracted attention from both aca- For multimodal sentiment analysis, we focus
demic and industrial communities (Zhang et al., on text-image sentiment analysis in social media
2018a; Yue et al., 2019). At the same time, with the data. In existing works, some models try to con-
increasing use of mobile internet and smartphones, catenate different modal feature vectors to fuse the
more and more users are willing to post multimodal multimodal features, such as MultiSentiNet (Xu
data (e.g., text, image, and video) about different and Mao, 2017) and HSAN (Xu, 2017). Kumar
topics to convey their feelings and sentiments. So and Vepa (2020) proposes to use gating mechanism
∗
Corresponding author and attention to obtain deep multimodal contextual
feature vectors. Multi-view Attentional Network performance of the model.
(MVAN) is proposed by Yang et al. (2020) which In this paper, we propose a Contrastive Learning
introduces memory networks to realize the interac- and Multi-Layer Fusion (CLMLF) method for
tion between modalities. Although the above men- multimodal sentiment analysis based on text and
tioned models are relatively better than unimodal image modalities. For evaluation, CLMLF is
models, the inputs with different modalities are in verified on three multimodal sentiment datasets,
different vector spaces. Therefore, it is difficult to namely MVSA-Single, MVSA-Multiple (Niu et al.,
fuse multimodal data with a simple concatenation 2016) and HFM (Cai et al., 2019). CLMLF
strategy, so the improvement is also limited. Fur- achieves better performance compared to several
thermore, the gating mechanism and memory net- baseline models in all three datasets. Through a
work are essentially not designed for multimodal comprehensive set of ablation experiments, case
fusion. Although they can help the model analyzes study, and visualizations, we demonstrate the ad-
the sentiment in the multimodal data by storing vantages of CLMLF for multimodal fusion1 . Our
and filtering the features in the data, it is obvious main contributions are summarized as follows:
that these methods are difficult to align and fuse
• We propose a multi-layer fusion module based
the features of text and image. Since Transformers
on Transformer-Encoder that multi-headed
have achieved great success in many fields, such
self-attention can help align and fuse token-
as natural language processing and computer vi-
level features of text and image, and it can
sion (Lin et al., 2021; Khan et al., 2021), we pro-
also benefit from the depth of MLF which im-
pose Multi-Layer Fusion (MLF) module based on
proves model abstraction ability. Experiments
Transformer-Encoder. Benefiting from the multi-
show that the proposed architecture of MLF
headed self-attention in Transformer, which can
is simple but effective.
capture the internal correlation of data vectors.
Therefore, text tokens and image patches with ex- • We propose two contrastive learning tasks
plicit and implicit relationships will have higher based on label and data, which leverages sen-
attention weight allocation to each other which timent label features and data augmentation.
means the MLF module can help align and fuse Those two contrastive learning tasks can help
the token-level text and image features better. And the model learn common features related to
MLF is a multi-layer encoder, which can help im- sentiment in multimodal data, which improve
prove the abstraction ability of the model and ob- the performance of the model.
tain deep features in multimodal data.
2 Approach
Some previous work has explored the applica- 2.1 Overview
tion of contrastive learning in the multimodal field. In this section, we will introduce CLMLF. Figure 2
Huang et al. (2021) proposes the application of illustrates the overall architecture of CLMLF model
contrastive learning in multilingual text-to-video for multimodal sentiment detection that consists of
search, and Yuan et al. (2021) applies contrastive two modules: multi-layer fusion module and multi-
learning to learn visual representations that em- task learning module. Specifically, the multi-layer
braces multimodal data. However, there is little fusion module is on the right in Figure 2, it includes
work to study the application of contrastive learn- a text-image encoder, image Transformer layer, and
ing in multimodal sentiment analysis, so we pro- text-image Transformer fusion layer modules. The
pose two contrastive learning tasks, Label Based multi-task learning module is on the left in Fig-
Contrastive Learning (LBCL) and Data Based ure 2, it includes three tasks, sentiment classifi-
Contrastive Learning (DBCL), which will help the cation, label based contrastive learning and data
model learn common features related to sentiment based contrastive learning tasks.
in multimodal data. For example, as shown in Fig-
ure 1a and Figure 1b. We can find that both tweets 2.2 Multi-Layer Fusion Module
show positive sentiment. And we also can find We use Multi-Layer Fusion module to align and
there are smiling expressions in the image of the fuse the token-level features of text and image.
two tweets which is a common feature of those 1
There are also the experimental results and analysis of
tweets. If the model can learn common features CLMLF in aspect based multimodal sentiment analysis task,
related to sentiment, it will greatly improve the which can refer to Appendix B
SC Task LBCL Task DBCL Task MLF Module
pos
Sentiment … … Data Features
Label neg
… …
Attention & FC-Layer
…
…
…
…
…
…
FC-Layer &
pos 𝑓𝑓𝐶𝐶 𝑓𝑓1 𝑓𝑓2 … 𝑓𝑓𝑆𝑆 𝑓𝑓1+𝑁𝑁 𝑓𝑓2+𝑁𝑁 … 𝑓𝑓49+𝑁𝑁
Softmax … …
pos neg pos
Text-Image Transformer Fusion Layer
𝑡𝑡𝐶𝐶 𝑡𝑡1 𝑡𝑡2 … 𝑡𝑡𝑆𝑆 𝑖𝑖1 𝑖𝑖2 … 𝑖𝑖49
ResNet
Was fully committed Data Augmentation Today I devote myself
to my health today to my health
[CLS] Was fully … [SEP]
L = Lsc + λlbcl Llbcl + λdbcl Ldbcl (11) Unimodal Baselines: For text modality,
CNN (Kim, 2014) and Bi-LSTM (Zhou et al., 2016)
where λlbcl and λdbcl are coefficients to balance are well-known models for text classification tasks.
the different training losses. TGNN (Huang et al., 2019) is a text-level graph
neural network for text classification. BERT (De-
3 Experimental Setup vlin et al., 2019) is a pre-trained model for text,
and we fine-tuned on the text only. For image
3.1 Dataset
modality, OSDA (Yang et al., 2020) is an image
We demonstrate the effectiveness of our method sentiment analysis model based on multiple views.
on three public datasets which are MVSA-Single, ResNet (He et al., 2015) is pre-trained and fine-
MVSA-Multiple2 (Niu et al., 2016) and HFM3 (Cai tuned on the image only.
et al., 2019). Both datasets collect data from Twit-
ter, each text-image pair is labeled by a single sen- Multimodal Baselines: MultiSentiNet (Xu
timent. For a fair comparison, we process the orig- and Mao, 2017) is a deep semantic network
inal two MVSA datasets in the same way used in with attention for multimodal sentiment analysis.
Xu and Mao (2017), as for HFM, we adopt the HSAN (Xu, 2017) is a hierarchical semantic atten-
same data preprocessing method as that of Cai et al. tional network based on image captions for multi-
(2019). We randomly split the MVSA datasets into modal sentiment analysis. Co-MN-Hop6 (Xu et al.,
train set, validation set, and test set by using the 2018) is a co-memory network for iteratively mod-
split ratio 8:1:1. The statistics of these datasets are eling the interactions between multiple modalities.
given in Table 2. The detailed statistics of these MGNNS (Yang et al., 2021) is a multi-channel
datasets are given in Appendix A. graph neural networks with sentiment-awareness
for image-text sentiment detection. Schifanella
3.2 Implementation Details et al. (2016) concatenates different feature vectors
of different modalities as multimodal feature rep-
For the experiments of CLMLF, we use the Py-
resentation. Concat(2) means concatenating text
torch4 and HuggingFace Transformers5 (Wolf et al.,
features and image features, while Concat(3) has
2020) as the implementation of baselines and our
one more image attribute features. MMSD (Cai
2
http://mcrlab.net/research/mvsa-sentiment-analysis-on- et al., 2019) fuses text, image, and image attributes
multi-view-social-data/ with a multimodal hierarchical fusion model. Xu
3
https://github.com/headacheboy/data-of-multimodal-
sarcasm-detection
et al. (2020) proposes the D&R Net to fuse text,
4
https://pytorch.org/ image, and image attributes by constructing the
5
https://github.com/huggingface/transformers Decomposition and Relation Network.
MVSA-Single MVSA-Multiple HFM
Modality Model Model
Acc F1 Acc F1 Acc F1
CNN 0.6819 0.5590 0.6564 0.5766 CNN 0.8003 0.7532
BiLSTM 0.7012 0.6506 0.6790 0.6790 BiLSTM 0.8190 0.7753
Text
BERT 0.7111 0.6970 0.6759 0.6624 BERT 0.8389 0.8326
TGNN 0.7034 0.6594 0.6967 0.6180
ResNet-50 0.6467 0.6155 0.6188 0.6098 ResNet-50 0.7277 0.7138
Image
OSDA 0.6675 0.6651 0.6662 0.6623 ResNet-101 0.7248 0.7122
MultiSentiNet 0.6984 0.6984 0.6886 0.6811 Concat(2) 0.8103 0.7799
HSAN 0.6988 0.6690 0.6796 0.6776 Concat(3) 0.8174 0.7874
Multimodal Co-MN-Hop6 0.7051 0.7001 0.6892 0.6883 MMSD 0.8344 0.8018
MGNNS 0.7377 0.7270 0.7249 0.6934 D&R Net 0.8402 0.8060
CLMLF 0.7533 0.7346 0.7200 0.6983 CLMLF 0.8543 0.8487
Table 1: Experimental results of different models on MVSA-Single, MVSA-Multiple and HFM datasets
Dataset Train Val Test Total multimodal sentiment analysis task which can refer
MVSA-S 3611 450 450 4511 to Appendix B for details.
MVSA-M 13624 1700 1700 17024
HFM 19816 2410 2409 24635 4.2 Ablation
Table 2: Statistics of the three datasets We further evaluate the influence of multi-layer fu-
sion module, label based contrastive learning, and
data based contrastive learning. The evaluation re-
4 Results and Analysis sults are listed in Table 3. The Result shows that
the whole CLMLF model achieves the best perfor-
4.1 Overall Result
mance among all models. We can see multi-layer
Table 1 illustrates the performance comparison of fusion module can improve the performance, which
our CLMLF model with the baseline methods. We shows that a multi-layer fusion module can fuse the
use Weighted-F1 and ACC as the evaluation met- multimodal data. On this foundation, adding the
rics for MVSA-Single and MVSA-Multiple which label and data based contrastive learning can im-
is the same as Yang et al. (2021) and use Macro-F1 prove the model performance more, which means
and ACC as the evaluation metrics for HFM. we contrastive learning can lead the model to learn
have the following observations. First of all, our common features about sentiment and lead differ-
model is competitive with the other strong baseline ent sentiment data away from each other.
models on the three datasets. Second, the multi-
modal models perform better than the unimodal 4.3 Influence of MLF Layer
models on all three datasets. What is more, we We explored the effects of different layers of
found the sentiment analysis on the image modality Transformer-Encoder on the results. As shown in
gets the worst results, this may be that the senti- Figure 3a, fix the image transformer layer and set
mental features in the image is too sparse and noisy, the text-image transformer fusion layer from 1 to
which makes it difficult for the model to obtain ef- 6. As shown in Figure 3b, fix the text-image trans-
fective features for sentiment analysis. At last, for former fusion layer and set the image transformer
simple tasks, the performance improvement of mul- layer from 1 to 3. Finally, we selected different
timodal models is limited. For example, on HFM combinations of 3-2 (which means three layers of
dataset, the improvement of CLMLF relative to text-image transformer fusion layer and two layers
BERT is less than MVSA-Single dataset that be- of image transformer layer), 4-2, and 5-1 for the
cause HFM is a binary classification task, while three datasets. This also proves that the contribu-
MVSA-Single is a three classification task. tion of text and images in the dataset is different. It
We also try to apply CLMLF to aspect based can be seen from Table 1 that CLMLF gains more
6
https://huggingface.co/bert-base-uncased from the text than images in HFM dataset. There-
7
https://pytorch.org/vision/stable/models.html fore, in MLF module, the layers of transformer
MVSA-Single MVSA-Multiple HFM
Model
Acc F1 Acc F1 Acc F1
BERT 0.7111 0.6970 0.6759 0.6624 0.8389 0.8326
ResNet-50 0.6467 0.6155 0.6188 0.6098 0.7277 0.7138
+MLF 0.7111 0.7101 0.7059 0.6849 0.8414 0.8355
+MLF, LBCL 0.7378 0.7291 0.7112 0.6863 0.8489 0.8446
+MLF, DBCL 0.7356 0.7276 0.7153 0.6832 0.8468 0.8422
CLMLF 0.7533 0.7346 0.7200 0.6983 0.8543 0.8487
related to text are more than images. Image Text CLMLF BERT
Why are you feeling
despondent? Take the Positive Neutral
0.850 0.850
0.825 0.825
0.800 0.800
quiz:
0.775 MVSA-S MVSA-S
0.775
Thx for taking me to
MVSA-M MVSA-M
HFM HFM
0.750 0.750
Positive Negative
0.725 0.725 get cheap slushies ?
0.700
0.700
0.675
1 2 3 4 5 6 1 2 3 Car rolls over to avoid
real estate sign on Negative Neutral
(a) The text-image Trans- (b) The image Transformer
Burlington Skyway.
former fusion layer layer
Figure 3: Experimental results of different layer of Figure 4: Example of misclassified by BERT and cor-
multi-layer fusion module. The solid line indicates the rectly classified by CLMLF
accuracy and the dotted line indicates the F1. The x-
axis represents the number of layers of the transformer
text and image features. In particular, for Figure 5b,
although "lady" only shows half of the face in the
4.4 Case Study figure, the model still aligns the text and the im-
To further demonstrate the effectiveness of our age very accurately. These indicate that the model
model, we give a case study. We compare the aligns the text and image features at token-level
sentiment label predicted based on CLMLF and according to our assumptions.
BERT. As shown in Figure 4, We can find that if
we only consider the sentiment of the text, it is
difficult to correctly obtain the user’s sentimental
tendency. For example, for the first data in Figure 4,
the meaning of the text is to refer to the image, and
the image expresses a positive meaning. for the (a) The fishing is a little slow (b) Kimmy, you’re one
but the flowers are vibrant blessed lady!
second data, if we only observe the text, we find and beautiful.
that it may express negative sentiments. If add the
image, we find that it is just a joke and actually
expresses positive sentiment.
4.5 Visualization
Attention Visualization: We visualize the atten- (c) Martha said for Valen- (d) It is truly a hilarious,
tion weight of the first head of the Transformer- tine’s Day she wanted a heart light-hearted read that is a
Encoder in the last layer of the Multi-Layer Fusion shaped pancake for lunch. treasure on anyone’s book-
shelf.
module. The result of the attention visualization is
shown in Figure 5. We can see that for a given key- Figure 5: Attention visualization of some multimodal
word, The model can find the target from the image sentiment data examples
very well and give it more attention weight. This
shows that the model aligns the words in the text Cluster Visualization: In order to verify that
with the patch area of the image at a token-level, our proposed contrastive learning tasks can help
which plays an important role in the model to fuse the model to learn common features related to senti-
ment in multimodal data, we conducted a visualiza- effectively model the intra-modality interactions
tion experiment on the MVSA-Single dataset. The including aspect-text and aspect-image alignments,
data feature vector of the last layer of the model is and the inter-modality interactions. MVAN (Yang
visualized by dimensionality reduction. We use the et al., 2020) applies interactive learning of text and
TSNE dimensionality reduction algorithm to obtain image features through the attention memory net-
a 2-dimensional feature vector and visualize it, as work module, and the multimodal feature fusion
shown in Figure 6, Figure 6a is the visualization of module is constructed by using a multi-layer per-
the [CLS] of the Bert-base model, and Figure 6b ceptron and a stacking-pooling module. Yang et al.
shows the visualization of the fusion result output (2021) uses multi-channel graph neural networks
from the CLMLF model. From the figure, we can with sentiment-awareness which is built based on
see that after adding contrastive learning, the dis- the global characteristics of the dataset for multi-
tance between positive sentiment and negative sen- modal sentiment analysis.
timent in the vector space is greater, and the degree
of data aggregation is more obvious. This shows 5.2 Contrastive Learning
that the model distinguishes these data in vector Self-supervised learning attracts many researchers
space according to common features existing in the for its soaring performance on representation learn-
same sentimental data. Because the number of neu- ing in the last several years (Liu et al., 2021; Jing
tral sentiment data is relatively small, among the and Tian, 2020; Jaiswal et al., 2021). Many models
visualization results of the two models, CLMLF’s based on contrastive learning have been proposed
visualization results obviously gather the neutral in both natural language processing and computer
data together, rather than scattered in the vector vision fields. ConSERT (Yan et al., 2021), Sim-
space like Bert. All these indicate that adding con- CSE (Gao et al., 2021), CLEAR(Wu et al., 2020)
trast learning can help the model to learn common proposed the application of contrastive learning in
features related to sentiment which can improve the field of natural language processing. MoCo (He
the performance of the model. et al., 2020), SimCLR (Chen et al., 2020), Sim-
Siam (Chen and He, 2021), CLIP (Radford et al.,
2021) proposed the application of contrastive learn-
ing in the field of computer vision, and they also
have achieved good results in zero-shot learning
and few-shot learning. Recently, contrastive learn-
ing has been more and more widely used in the field
of multimodality. Huang et al. (2021) uses intra-
modal, inter-modal, and cross-lingual contrastive
(a) BERT (b) Contrastive Learning learning which can significantly improves the per-
Figure 6: Cluster visualization of MVSA-Single formance of video search. Yuan et al. (2021) ex-
ploits intrinsic data properties within each modality
and semantic information from cross-modal corre-
5 Related Work lation simultaneously, hence improving the quality
5.1 Multimodal Sentiment Analysis of learned visual representations.
Compared with the above works, we focus on
In recent years, deep learning models have how to align and fuse the token-level features and
achieved promising results for multimodal senti- learn the common features related to sentiment to
ment analysis. MultiSentiNet (Xu and Mao, 2017) further improve the performance of model.
and HSAN (Xu, 2017) use LSTM and CNN to
encode texts and images to get hidden represen- 6 Conclusion and Future Work
tations, then concatenate texts and images hid-
den representations to fuse multimodal features. In this paper, we propose a contrastive learning
CoMN (Xu et al., 2018) uses a co-memory net- and multi-layer fusion method for multimodal sen-
work to iteratively model the interactions between timent detection. Compared with previous works,
visual contents and textual words for multimodal our proposed MLF module performs multimodal
sentiment analysis. Yu et al. (2019) proposes an feature fusion from the fine-grained token-level,
aspect sensitive attention and fusion network to which is more conducive to the fusion of local fea-
tures of text and image. At the same time, we Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
design learning tasks based on contrastive learning Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language under-
to help the model learn sentiment related features in
standing. In Proceedings of the 2019 Conference
the multimodal data and improve the ability of the of the North American Chapter of the Association
model to extract and fuse features of multimodal for Computational Linguistics: Human Language
data. The experimental results on public datasets Technologies, Volume 1 (Long and Short Papers),
demonstrate that our proposed model is competitive pages 4171–4186, Minneapolis, Minnesota. Associ-
ation for Computational Linguistics.
with strong baseline models. Especially through
visualization, the contrastive learning tasks and Sergey Edunov, Myle Ott, Michael Auli, and David
multi-layer fusion module we proposed can be ver- Grangier. 2018. Understanding back-translation at
ified with intuitive interpretations. In future work, scale. In Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Processing,
we will incorporate other modalities such as audio
pages 489–500, Brussels, Belgium. Association for
into the sentiment detection task. Computational Linguistics.