Multimodal sentiment analysis model with cross-modal text information enhancement

doi:10.11772/j.issn.1001-9081.2024060886

Abstract

Abstract:

Multimodal Sentiment Analysis （MSA） that utilize text， visual， and audio data to analyze speakers’ emotions in videos have garnered widespread attention. However， the contributions of different modalities to sentiment analysis vary significantly. Generally， the information contained in text is more intuitive， making it particularly important to seek a strategy for enhancing text in sentiment analysis. To address this issue， a Multimodal Sentiment Analysis Model with Cross-modal Text-information Enhancement （MSAM-CTE） was proposed. Firstly， the BERT （Bidirectional Encoder Representations from Transformers） pre-trained model was employed to extract text features， and the Bi-directional Long Short-Term Memory （Bi-LSTM） network was used to further process the pre-processed audio and video features. Then， a text based cross-attention mechanism was applied to integrate text information into emotion related nonverbal representations， thereby learning text oriented pairwise cross-modal mappings to obtain effective unified multimodal representations. Finally， the fused features were utilized for sentiment analysis. Experimental results show that compared to the optimal baseline model — Text Enhanced Transformer Fusion Network （TETFN）， the proposed model achieved a 2.6% reduction in Mean Absolute Error （MAE） and a 0.1% increase in Pearson Correlation coefficient （Corr） on the CMU-MOSI （Carnegie Mellon University Multimodal Opinion Sentiment Intensity） dataset；on the CMU-MOSEI （Carnegie Mellon University Multimodal Opinion Sentiment and Emotion Intensity） dataset， the improvements are 3.8% for MAE and 1.7% for Corr， respectively， verifying the effectiveness of MSAM-CTE in sentiment analysis.

Key words: Multimodal Sentiment Analysis (MSA), text information enhancement, cross-attention mechanism, Bi-directional Long Short-Term Memory (Bi-LSTM) network, cross-modal information fusion

摘要：

近年来，利用文本、视觉和音频数据分析视频中说话者情感的多模态情感分析（MSA）引起了广泛关注。然而，不同模态在情感分析中的贡献大不相同。通常，文本中包含的信息更加直观，因此寻求一种用于增强文本在情感分析中作用的策略显得尤为重要。针对这一问题，提出一种跨模态文本信息增强的多模态情感分析模型（MSAM-CTE）。首先，使用BERT（Bidirectional Encoder Representations from Transformers）预训练模型提取文本特征，并使用双向长短期记忆（Bi-LSTM）网络对预处理后的音频和视频特征进行进一步处理；其次，通过基于文本的交叉注意力机制，将文本信息融入情感相关的非语言表示中，以学习面向文本的成对跨模态映射，从而获得有效的统一多模态表示；最后，使用融合特征进行情感分析。实验结果表明，与最优的基线模型——文本增强Transformer融合网络（TETFN）相比，MSAM-CTE在数据集CMU-MOSI （Carnegie Mellon University Multimodal Opinion Sentiment Intensity）上的平均绝对误差（MAE）和皮尔逊相关系数（Corr）分别降低了2.6%和提高了0.1%；在数据集CMU-MOSEI （Carnegie Mellon University Multimodal Opinion Sentiment and Emotion Intensity）上的两个指标分别降低了3.8%和提高了1.7%，验证了MSAM-CTE在情感分析中的有效性。

关键词: 多模态情感分析, 文本信息增强, 交叉注意力机制, 双向长短期记忆网络, 跨模态信息融合

CLC Number:

TP18

Yihan WANG, Chong LU, Zhongyuan CHEN. Multimodal sentiment analysis model with cross-modal text information enhancement[J]. Journal of Computer Applications, 2025, 45(7): 2237-2244.

王艺涵, 路翀, 陈忠源. 跨模态文本信息增强的多模态情感分析模型[J]. 《计算机应用》唯一官方网站, 2025, 45(7): 2237-2244.

Figures/Tables 8

References 33

[1]	罗俊豪，朱焱.用于未对齐多模态语言序列情感分析的多交互感知网络［J］.计算机应用，2024， 44（1）： 79-85.
	LUO J H， ZHU Y. Multi-dynamic aware network for unaligned multimodal language sequence sentiment analysis ［J］. Journal of Computer Applications， 2024， 44（1）： 79-85.
[2]	LIU Y， LIU L， GUO Y， et al. Learning visual and textual representations for multimodal matching and classification ［J］. Pattern Recognition， 2018， 84： 51-67.
[3]	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need ［C］// Proceedings of the 31st Advances in Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2017： 6000-6010.
[4]	TANG J， LI K， JIN X， et al. CTFN： hierarchical learning for multimodal sentiment analysis using coupled-translation fusion network ［C］// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing （Volume 1： Long Papers）. Stroudsburg： ACL， 2021： 5301-5311.
[5]	TSAI Y H H， BAI S， LIANG P P， et al. Multimodal transformer for unaligned multimodal language sequences ［C］// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg： ACL， 2019： 6558-6569.
[6]	HAZARIKA D， ZIMMERMANN R， PORIA S. MISA： modality-invariant and-specific representations for multimodal sentiment analysis ［C］// Proceedings of the 28th ACM International Conference on Multimedia. New York： ACM， 2020： 1122-1131.
[7]	YU W， XU H， YUAN Z， et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis ［C］// Proceedings of the 35th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2021： 10790-10797.
[8]	WANG D， GUO X， TIAN Y， et al. TETFN： a text enhanced transformer fusion network for multimodal sentiment analysis ［J］. Pattern Recognition， 2023， 136： No.109259.
[9]	HUANG Z， XU W， YU K. Bidirectional LSTM-CRF models for sequence tagging ［EB/OL］. ［2024-09-04］. .
[10]	ZHANG Y， JIN R， ZHOU Z H. Understanding bag-of-words model： a statistical framework ［J］. International Journal of Machine Learning and Cybernetics， 2010， 1（1/2/3/4）： 43-52.
[11]	LI B， LIU T， ZHAO Z， et al. Neural bag-of-ngrams ［C］// Proceedings of the 31st AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2017： 3067-3074.
[12]	CHEN P H， LIN C J， SCHÖLKOPF B. A tutorial on ν-support vector machines ［J］. Applied Stochastic Models in Business and Industry， 2005， 21（2）： 111-136.
[13]	RISH I. An empirical study of the naive Bayes classifier ［EB/OL］. ［2024-09-01］. .
[14]	PHILLIPS S J. A brief tutorial on MaxEnt ［EB/OL］. ［2024-09-01］. .
[15]	ALBAWI S， MOHAMMED T A， AL-ZAWI S. Understanding of a convolutional neural network ［C］// Proceedings of the 2017 International Conference on Engineering and Technology. Piscataway： IEEE， 2017： 1-6.
[16]	HOCHREITER S， SCHMIDHUBER J. Long short-term memory ［J］. Neural Computation， 1997， 9（8）： 1735-1780.
[17]	DEVLIN J， CHANG M W， LEE K， et al. BERT： pretraining of deep bidirectional transformers for language understanding ［C］// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies （Volume 1： Long and Short Papers）. Stroudsburg： ACL， 2019： 4171-4186.
[18]	LIU Y， OTT M， GOYAL N， et al. RoBERTa： a robustly optimized BERT pretraining approach ［EB/OL］. ［2023-08-01］. .
[19]	PORIA S， CHATURVEDI I， CAMBRIA E， et al. Convolutional MKL based multimodal emotion recognition and sentiment analysis ［C］// Proceedings of the IEEE 16th International Conference on Data Mining. Piscataway： IEEE， 2016： 439-448.
[20]	ZADEH A， LIANG P P， MAZUMDER N， et al. Memory fusion network for multi-view sequential learning ［C］// Proceedings of the 32nd AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2018： 5634-5641.
[21]	KAMPMAN O， BAREZI E J， BERTERO D， et al. Investigating audio， visual， and text fusion methods for end-to-end automatic personality prediction ［C］// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics （Volume 2： Short Papers）. Stroudsburg： ACL， 2018： 606-611.
[22]	RAHMAN W， HASAN M K， LEE S， et al. Integrating multimodal information in large pretrained Transformers ［C］// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg： ACL， 2020： 2359-2369.
[23]	WANG Y， SHEN Y， LIU Z， et al. Words can shift： dynamically adjusting word representations using nonverbal behaviors ［C］// Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2019： 7216-7223.
[24]	HAN W， CHEN H， GELBUKH A， et al. Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis ［C］// Proceedings of the 2021 International Conference on Multimodal Interaction. New York： ACM， 2021： 6-15.
[25]	PENNINGTON J， SOCHER R， MANNING C D. GloVe： global vectors for word representation ［C］// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2014： 1532-1543.
[26]	YANG J， YU Y， NIU D， et al. ConFEDE： contrastive feature decomposition for multimodal sentiment analysis ［C］// Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg： ACL， 2023： 7617-7630.
[27]	LIAN H， LU C， LI S， et al. A survey of deep learning-based multimodal emotion recognition： speech， text， and face ［J］. Entropy， 2023， 25（10）： No.1440.
[28]	LAI S， HU X， XU H， et al. Multimodal sentiment analysis： a survey ［J］. Displays， 2023， 74： No.102563.
[29]	ZADEH A， ZELLERS R， PINCUS E， et al. MOSI： multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos ［EB/OL］. ［2023-09-04］. .
[30]	ZADEH A B， LIANG P P， PORIA S， et al. Multimodal language analysis in the wild： CMU-MOSEI dataset and interpretable dynamic fusion graph ［C］// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg： ACL， 2018： 2236-2246.
[31]	ZADEH A， CHEN M， PORIA S， et al. Tensor fusion network for multimodal sentiment analysis ［C］// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2017： 1103-1114.
[32]	LIU Z， SHEN Y， LAKSHMINARASIMHAN V B， et al. Efficient low-rank multimodal fusion with modality-specific factors ［C］// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg： ACL， 2018： 2247-2256.
[33]	SUN Z， SARMA P K， SETHARES W A， et al. Learning relationships between text， audio， and video via deep canonical correlation for multimodal language analysis ［C］// Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2020： 8992-8999.

数据集	样本数
数据集	训练集	验证集	测试集	合计
CMU-MOSI	1 284	229	686	2 199
CMU-MOSEI	16 326	1 871	4 659	22 856

数据集	样本数
数据集	训练集	验证集	测试集	合计
CMU-MOSI	1 284	229	686	2 199
CMU-MOSEI	16 326	1 871	4 659	22 856

超参数	含义	设置
超参数	含义	CMU-MOSI	CMU-MOSEI
batch_size	训练使用的样本数量	32	32
learning_rate of BERT	影响模型参数更新速度	3.00×10^-5	5.00×10^-6
a lstm hidden _size	音频特征提取LSTM隐藏层的维度	16	32
v lstm hidden _size	视频特征提取LSTM隐藏层的维度	64	64
attn_dropout	防止过拟合	0	0.1
num_heads	让模型关注输入的不同	5	5
kernel_size of Conv1D	卷积层的核大小	1	1
early _stop	在指定epoch内模型性能没有提升，则停止训练	8	8
dst_feature dims	投影到统一维度后的特征大小	50	50

超参数	含义	设置
超参数	含义	CMU-MOSI	CMU-MOSEI
batch_size	训练使用的样本数量	32	32
learning_rate of BERT	影响模型参数更新速度	3.00×10^-5	5.00×10^-6
a lstm hidden _size	音频特征提取LSTM隐藏层的维度	16	32
v lstm hidden _size	视频特征提取LSTM隐藏层的维度	64	64
attn_dropout	防止过拟合	0	0.1
num_heads	让模型关注输入的不同	5	5
kernel_size of Conv1D	卷积层的核大小	1	1
early _stop	在指定epoch内模型性能没有提升，则停止训练	8	8
dst_feature dims	投影到统一维度后的特征大小	50	50

模型名称	CMU-MOSI					CMU-MOSEI
模型名称	MAE	Corr	Acc-2/%	F1/%	数据状态	MAE	Corr	Acc-2/%	F1/%	数据状态
TFN （B）^*	0.901	0.698	—/80.20	—/80.70	非对齐	0.593	0.700	—/82.50	—/82.10	非对齐
LMF （B）^*	0.917	0.695	—/82.50	—/82.40	非对齐	0.623	0.677	—/82.00	—/82.10	非对齐
RAVEN^*	0.915	0.691	78.00/—	76.60/—	对齐	0.614	0.662	79.10/—	79.50/—	对齐
MulT （B）^*	0.861	0.711	81.50/84.10	80.60/83.90	对齐	0.580	0.703	—/82.50	—/82.30	对齐
ICCN	0.862	0.714	—/83.00	—/83.00	非对齐	0.655	0.713	—/84.20	—/84.20	非对齐
MAG-BERT （B）^**	0.712	0.796	84.20/86.10	84.10/86.00	对齐	—	—	84.70/—	84.50/—	对齐
Self-MM （B）^*	0.713	0.798	84.00/85.98	84.42/85.95	非对齐	0.530	0.765	82.82/85.17	82.53/85.30	非对齐
MISA （B）^*	0.783	0.761	81.80/83.40	81.70/83.60	对齐	0.555	0.756	83.60/85.50	83.30/85.30	对齐
TETFN	0.717	0.800	84.05/86.10	83.83/86.07	非对齐	0.551	0.748	84.25/85.18	84.18/85.27	非对齐
MSAM-CTE	0.698	0.801	84.11/85.52	84.12/85.57	非对齐	0.530	0.761	84.74/85.50	84.63/85.14	非对齐