Dual-channel multimodal sentiment analysis model based on contrast invariance and reinforcement specificity

doi:10.11772/j.issn.1001-9081.2025060731

Journal of Computer Applications ›› 2026, Vol. 46 ›› Issue (6): 1767-1775.DOI: 10.11772/j.issn.1001-9081.2025060731

• Artificial intelligence • Previous Articles

Dual-channel multimodal sentiment analysis model based on contrast invariance and reinforcement specificity

Yunping HE, Leichun WANG(), Ruirui SONG, Xiangfeng LU, Jinxiang WEI, Xiaomeng LIU

School of Computer Science，Hubei University，Wuhan Hubei 430062，China

Received:2025-07-02 Revised:2025-08-25 Accepted:2025-08-28 Online:2025-09-05 Published:2026-06-10
Contact: Leichun WANG
About author:HE Yunping， born in 2000， M. S. candidate. His research interests include multimodal sentiment analysis， long time series prediction.
SONG Ruirui， born in 1999， M. S. candidate. Her research interests include long time series prediction， multimodal data analysis.
LU Xiangfeng， born in 2000， M. S. candidate. Her research interests include multimodal data analysis， fake news detection.
WEI Jinxiang， born in 2000， M. S. candidate. Her research interests include deep learning， spatio-temporal data prediction.
LIU Xiaomeng， born in 2001， M. S. candidate. Her research interests include deep learning， multimodal data analysis.
First author contact:WANG Leichun， born in 1974， Ph. D.， associate professor. His research interests include deep learning， big data analysis.
Supported by:
National Natural Science Foundation of China(62106069);National Social Science Foundation of China(24BTQ019)

基于对比不变性和强化特定性的双通道多模态情感分析模型

何运平, 王雷春(), 宋芮芮, 卢祥凤, 魏金香, 刘小萌

湖北大学计算机学院，武汉 430062

通讯作者: 王雷春
作者简介:何运平（2000—），男，湖北荆州人，硕士研究生，CCF会员，主要研究方向：多模态情感分析、长时间序列预测
宋芮芮（1999—），女，山东枣庄人，硕士研究生，主要研究方向：长时间序列预测、多模态数据分析
卢祥凤（2000—），女，山东临沂人，硕士研究生，主要研究方向：多模态数据分析、假新闻检测
魏金香（2000—），女，安徽阜阳人，硕士研究生，主要研究方向：深度学习、时空数据预测
刘小萌（2001—），女，山东枣庄人，硕士研究生，主要研究方向：深度学习、多模态数据分析。
第一联系人：王雷春（1974—），男，湖北武汉人，副教授，博士，主要研究方向：深度学习、大数据分析
基金资助:
国家自然科学基金资助项目(62106069);国家自然科学基金资助项目(62102136);国家社会科学基金资助项目(24BTQ019)

Abstract

Abstract:

In view of the problem that the existing Multimodal Sentiment Analysis （MSA） methods often lead to inaccurate sentiment analysis results due to modal heterogeneity and insufficient internal interaction， a dual-channel MSA model based on Contrast Invariance and Reinforcement Specificity （CIRS） was proposed. Firstly， the features in text， video and audio data were extracted and dimensionally aligned. Secondly， the invariant features of the modals were compared in consistency， and the mutual learning of invariant features between modals was enhanced through homogeneous graph distillation， so as to improve the representation consistency of modals. Thirdly， the specific features of modals were strengthened， and the knowledge transfer of specific features between modals was performed， so as to achieve semantic spatial alignment between modals. Finally， the invariant features and specific features were deeply integrated and predicted through self-attention mechanism and cross-modal attention mechanism. Experimental results show that compared with DLF （Disentangled-Language-Focused multimodal sentiment analysis）， CIRS has the Mean Absolute Error （MAE） reduced by 4.11%， 2-Class Accuracy （Acc-2） and F1-score both improved by 1.29% on the CMU-MOSI （Carnegie Mellon University Multimodal Opinion Sentiment Intensity） dataset； CIRS has the MAE reduced by 1.85% ， and the Acc-2 and F1-score improved by 0.70% and 0.94%， respectively， on the CMU-MOSEI （Carnegie Mellon University Multimodal Opinion Sentiment and Emotion Intensity） dataset. The above verifies that CIRS can reduce errors and improve classification accuracy during multimodal sentiment analysis effectively.

Key words: Multimodal Sentiment Analysis (MSA), modal heterogeneity, semantic spatial alignment, self-attention mechanism, cross-modal attention mechanism

摘要：

针对现有多模态情感分析（MSA）方法常因模态异质性及内部交互不足导致情感分析结果不准确的问题，提出一种基于对比不变性和强化特定性的双通道MSA模型（CIRS）。首先，提取文本、视频和音频数据中的特征并对齐维度；其次，对模态的不变特征进行一致性对比，通过同质图蒸馏增强模态间不变特征的相互学习，提高模态的表征一致性；再次，强化模态的特定特征，使用异质图蒸馏对模态间的特定特征进行知识迁移，实现模态间的语义空间对齐；最后，通过自注意力机制和跨模态注意力机制对不变特征和特定特征进行深度融合与预测。实验结果表明，与DLF（Disentangled-Language-Focused multimodal sentiment analysis）相比，CIRS在CMU-MOSI （Carnegie Mellon University Multimodal Opinion Sentiment Intensity）数据集上的平均绝对误差（MAE）降低了4.11%，二分类准确率（Acc-2）和F1分数均提高了1.29%；在CMU-MOSEI （Carnegie Mellon University Multimodal Opinion Sentiment and Emotion Intensity）数据集上的MAE降低了1.85%，Acc-2和F1分数分别提高了0.70%和0.94%。以上验证了CIRS在进行多模态情感分析时能够有效降低误差和提高分类的准确率。

关键词: 多模态情感分析, 模态异质性, 语义空间对齐, 自注意力机制, 跨模态注意力机制

CLC Number:

TP391.1

Yunping HE, Leichun WANG, Ruirui SONG, Xiangfeng LU, Jinxiang WEI, Xiaomeng LIU. Dual-channel multimodal sentiment analysis model based on contrast invariance and reinforcement specificity[J]. Journal of Computer Applications, 2026, 46(6): 1767-1775.

何运平, 王雷春, 宋芮芮, 卢祥凤, 魏金香, 刘小萌. 基于对比不变性和强化特定性的双通道多模态情感分析模型[J]. 《计算机应用》唯一官方网站, 2026, 46(6): 1767-1775.

Figures/Tables 5

References 36

[1]	ARABIAN H， BATTISTEL A， CHASE J G， et al. Attention-guided network model for image-based emotion recognition［J］. Applied Sciences， 2023， 13（18）： No.10179.
[2]	KAUSHIK L， SANGWAN A， HANSEN J H L. Automatic sentiment detection in naturalistic audio［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2017， 25（8）： 1668-1679.
[3]	YU J， CHEN K， XIA R. Hierarchical interactive multimodal transformer for aspect-based multimodal sentiment analysis［J］. IEEE Transactions on Affective Computing， 2023， 14（3）： 1966-1978.
[4]	KIM Y. Convolutional neural network for sentence classification［C］// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2014： 1746-1751.
[5]	DEVLIN J， CHANG M W， LEE K， et al. BERT： pre-training of deep bidirectional Transformers for language understanding［C］// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 1 （Long and Short Papers）. Stroudsburg： ACL， 2019： 4171-4186.
[6]	SCHULLER B， VLASENKO B， EYBEN F， et al. Cross-corpus acoustic emotion recognition： variances and strategies［J］. IEEE Transactions on Affective Computing， 2010， 1（2）： 119-131.
[7]	TRIGEORGIS G， RINGEVAL F， BRUECKNER R， et al. Adieu features？ end-to-end speech emotion recognition using a deep convolutional recurrent network［C］// Proceedings of the 2016 International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2016： 5200-5204.
[8]	BAEVSKI A， ZHOU Y， MOHAMED A， et al. wav2vec 2.0： a framework for self-supervised learning of speech representations［C］// Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2020： 12449-12460.
[9]	MACHAJDIK J， HANBURY A. Affective image classification using features inspired by psychology and art theory［C］// Proceedings of the 18th ACM International Conference on Multimedia. New York： ACM， 2010： 83-92.
[10]	YOU Q， LUO J， JIN H， et al. Robust image sentiment analysis using progressively trained and domain transferred deep networks［C］// Proceedings of the 29th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2015： 381-388.
[11]	MOLLAHOSSEINI A， HASANI B， MAHOOR M H. AffectNet： a database for facial expression， valence， and arousal computing in the wild［J］. IEEE Transactions on Affective Computing， 2019， 10（1）： 18-31.
[12]	LI S， DENG W. Deep facial expression recognition： a survey［J］. IEEE Transactions on Affective Computing， 2022， 13（3）： 1195-1215.
[13]	JIANG D， CUI Y， ZHANG X， et al. Audio visual emotion recognition based on triple-stream dynamic Bayesian network models［C］// Proceedings of the 2011 International Conference on Affective Computing and Intelligent Interaction， LNCS 6974. Berlin： Springer， 2011： 609-618.
[14]	ABBURI H， PRASATH R， SHRIVASTAVA M， et al. Multimodal sentiment analysis using deep neural networks［C］// Proceedings of the 2016 International Conference on Mining Intelligence and Knowledge Exploration， LNCS 10089. Cham： Springer， 2017： 58-65.
[15]	LIN H， ZHANG P， LING J， et al. PS-Mixer： a polar-vector and strength-vector mixer model for multimodal sentiment analysis ［J］. Information Processing and Management， 2023， 60（2）： No.103229.
[16]	HAZARIKA D， ZIMMERMANN R， PORIA S. MISA： modality-invariant and -specific representations for multimodal sentiment analysis［C］// Proceedings of the 28th ACM International Conference on Multimedia. New York： ACM， 2020： 1122-1131.
[17]	郭小宇，马静. 基于SEFusion-MPOR的多模态特征融合舆情表征算法［J］. 情报理论与实践， 2024， 47（7）： 181-189.
	GUO X Y， MA J. Multimodal feature fusion public opinion representation algorithm based on SEFusion-MPOR［J］. Information Studies： Theory and Application， 2024， 47（7）： 181-189.
[18]	LI Y， WANG Y， CUI Z. Decoupled multimodal distilling for emotion recognition［C］// Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2023： 6631-6640.
[19]	GUO Z， JIN T， ZHAO Z. Multimodal prompt learning with missing modalities for sentiment analysis and emotion recognition［C］// Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg： ACL， 2024： 1726-1736.
[20]	DU J， JIN J， ZHUANG J， et al. Hierarchical graph contrastive learning of local and global presentation for multimodal sentiment analysis［J］. Scientific Reports， 2024， 14： No.5335.
[21]	宗林林，周佳慧，谢秋婕，等. 基于超图的多模态情绪识别［J］. 计算机学报， 2023， 46（12）： 2520-2534.
	ZONG L L， ZHOU J H， XIE Q J， et al. Multi-modal emotion recognition based on hypergraph［J］. Chinese Journal of Computers， 2023， 46（12）： 2520-2534.
[22]	WENG Y， WANG H， GAO T， et al. Enhancing multimodal sentiment analysis for missing modality through self-distillation and unified modality cross-attention［C］// Proceedings of the 2025 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2025： 1-5.
[23]	BALTRUŠAITIS T， ROBINSON P， MORENCY L P. OpenFace： an open source facial behavior analysis toolkit［C］// Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision. Piscataway： IEEE， 2016： 1-10.
[24]	DEGOTTEX G， KANE J， DRUGMAN T， et al. COVAREP： a collaborative voice analysis repository for speech technologies［C］// Proceedings of the 2014 International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2014： 960-964.
[25]	ZADEH A， ZELLERS R， PINCUS E， et al. Multimodal sentiment intensity analysis in videos： facial gestures and verbal messages［J］. IEEE Intelligent Systems， 2016， 31（6）： 82-88.
[26]	ZADEH A， LIANG P P， VANBRIESEN J， et al. Multimodal language analysis in the wild： CMU-MOSEI dataset and interpretable dynamic fusion graph［C］// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg： ACL， 2018： 2236-2246.
[27]	TSAI Y H H， BAI S， LIANG P P， et al. Multimodal Transformer for unaligned multimodal language sequences［C］// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg： ACL， 2019： 6558-6569.
[28]	SUN H， WANG H， LIU J， et al. CubeMLP： an MLP-based model for multimodal sentiment analysis and depression estimation［C］// Proceedings of the 30th ACM International Conference on Multimedia. New York： ACM， 2022： 3722-3729.
[29]	LIU Z， SHEN Y， LAKSHMINARASIMHAN V B， et al. Efficient low‑rank multimodal fusion with modality‑specific factors［C］// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg： ACL， 2018： 2247-2256.
[30]	TSAI Y H H， LIANG P P， ZADEH A， et al. Learning factorized multimodal representations［EB/OL］. ［2025-04-21］..
[31]	SUN Z， SARMA P K， SETHARES W A， et al. Learning relationships between text， audio， and video via deep canonical correlation for multimodal language analysis［C］// Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2020： 8992-8999.
[32]	RAHMAN W， HASAN M K， LEE S， et al. Integrating multimodal information in large pretrained Transformers［C］// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg： ACL， 2020： 2359-2369.
[33]	YU W， XU H， YUAN Z， et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis［C］// Proceedings of the 35th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2021： 10790-10797.
[34]	ZHANG H， WANG Y， YIN G， et al. Learning language-guided adaptive hyper-modality representation for multimodal sentiment analysis［C］// Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2023： 756‑767.
[35]	YANG J， YU Y， NIU D， et al. ConFEDE： contrastive feature decomposition for multimodal sentiment analysis［C］// Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg： ACL， 2023： 7617-7630.
[36]	WANG P， ZHOU Q， WU Y， et al. DLF： disentangled-language- focused multimodal sentiment analysis［C］// Proceedings of the 39th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2025： 21180-21188.

参数	值
batch_size	16
early_stop	4
nlevels	4
learning_rate	0.000 1
grad_clip	0.6

参数	值
batch_size	16
early_stop	4
nlevels	4
learning_rate	0.000 1
grad_clip	0.6

数据集	模型	MAE	Acc-2/%	F1-score/%	Acc-7/%
CMU- MOSI	MulT	0.87	83.0	82.8	40.0
	MISA	0.82	83.4	83.6	42.3
	CubeMLP	0.77	85.6	85.5	45.5
	LMF	0.92	82.5	82.4	33.2
	MFM	0.88	81.7	81.6	35.4
	ICCN	0.86	83.0	83.0	39.0
	MAG-BERT	0.73	84.4	84.6	43.6
	Self-MM	0.71	84.8	84.9	45.8
	ALMT	0.71	85.0	84.9	46.7
	ConFEDE	0.71	84.8	84.8	42.0
	DLF	0.73	85.1	85.0	47.1
	CIRS	0.70	86.2	86.1	47.8
CMU- MOSEI	MulT	0.58	82.5	82.3	51.8
	MISA	0.56	83.4	83.6	42.3
	CubeMLP	0.52	85.1	84.5	52.9
	LMF	0.62	82.0	82.1	48.0
	MFM	0.57	84.4	84.3	51.3
	ICCN	0.57	84.2	84.2	51.6
	MAG-BERT	0.54	84.8	84.7	52.7
	Self-MM	0.53	85.0	85.0	53.5
	ALMT	0.54	84.5	84.5	53.2
	ConFEDE	0.55	84.8	84.6	53.0
	DLF	0.54	85.4	85.3	53.9
	CIRS	0.53	86.0	86.1	53.8

数据集	模型	MAE	Acc-2/%	F1-score/%	Acc-7/%
CMU- MOSI	MulT	0.87	83.0	82.8	40.0
	MISA	0.82	83.4	83.6	42.3
	CubeMLP	0.77	85.6	85.5	45.5
	LMF	0.92	82.5	82.4	33.2
	MFM	0.88	81.7	81.6	35.4
	ICCN	0.86	83.0	83.0	39.0
	MAG-BERT	0.73	84.4	84.6	43.6
	Self-MM	0.71	84.8	84.9	45.8
	ALMT	0.71	85.0	84.9	46.7
	ConFEDE	0.71	84.8	84.8	42.0
	DLF	0.73	85.1	85.0	47.1
	CIRS	0.70	86.2	86.1	47.8
CMU- MOSEI	MulT	0.58	82.5	82.3	51.8
	MISA	0.56	83.4	83.6	42.3
	CubeMLP	0.52	85.1	84.5	52.9
	LMF	0.62	82.0	82.1	48.0
	MFM	0.57	84.4	84.3	51.3
	ICCN	0.57	84.2	84.2	51.6
	MAG-BERT	0.54	84.8	84.7	52.7
	Self-MM	0.53	85.0	85.0	53.5
	ALMT	0.54	84.5	84.5	53.2
	ConFEDE	0.55	84.8	84.6	53.0
	DLF	0.54	85.4	85.3	53.9
	CIRS	0.53	86.0	86.1	53.8

CMD	DE	FCE	CMU-MOSI				CMU-MOSEI
CMD	DE	FCE	MAE	Acc-2/%	F1-score/%	Acc-7/%	MAE	Acc-2/%	F1-score/%	Acc-7/%
×	×	×	0.85	82.6	82.1	42.9	0.57	82.2	82.8	52.1
√	×	×	0.80	84.7	84.7	45.5	0.54	84.5	84.5	53.4
×	√	×	0.79	84.6	84.5	45.8	0.55	84.4	84.4	53.5
×	×	√	0.80	84.5	84.6	44.7	0.57	84.3	84.3	53.0
√	√	×	0.75	85.3	85.3	46.8	0.54	85.5	85.6	53.6
√	×	√	0.72	85.6	85.6	46.5	0.54	85.5	85.5	53.4
×	√	√	0.74	85.2	85.2	46.3	0.55	85.1	85.1	53.6
√	√	√	0.70	86.2	86.1	47.8	0.53	86.0	86.1	53.8

Dual-channel multimodal sentiment analysis model based on contrast invariance and reinforcement specificity

基于对比不变性和强化特定性的双通道多模态情感分析模型

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 5

References 36

Related Articles 15

Recommended Articles

Metrics

[1]	Qianfei WANG, Yang LI, Deyu LI, Suge WANG. Dual-channel feature fusion representation method for short-text clustering based on large language model [J]. Journal of Computer Applications, 2026, 46(5): 1441-1449.
[2]	Ruirui SONG, Leichun WANG, Yunping HE, Jinxiang WEI, Xiangfeng LU, Xiaomeng LIU. Long time series prediction based on hybrid self-attention and differentiated normalization [J]. Journal of Computer Applications, 2026, 46(5): 1499-1506.
[3]	Hu LUO, Mingshu ZHANG. Rumor detection method based on cross-modal attention mechanism and contrastive learning [J]. Journal of Computer Applications, 2026, 46(2): 361-367.
[4]	Xiang WANG, Zhixiang CHEN, Guojun MAO. Multivariate time series prediction method combining local and global correlation [J]. Journal of Computer Applications, 2025, 45(9): 2806-2816.
[5]	Xiaoqiang ZHAO, Yongyong LIU, Yongyong HUI, Kai LIU. Batch process quality prediction model using improved time-domain convolutional network with multi-head self-attention mechanism [J]. Journal of Computer Applications, 2025, 45(7): 2245-2252.
[6]	Chen LIANG, Yisen WANG, Qiang WEI, Jiang DU. Source code vulnerability detection method based on Transformer-GCN [J]. Journal of Computer Applications, 2025, 45(7): 2296-2303.
[7]	Yihan WANG, Chong LU, Zhongyuan CHEN. Multimodal sentiment analysis model with cross-modal text information enhancement [J]. Journal of Computer Applications, 2025, 45(7): 2237-2244.
[8]	Hui LI, Bingzhi JIA, Chenxi WANG, Ziyu DONG, Jilong LI, Zhaoman ZHONG, Yanyan CHEN. Generative adversarial network underwater image enhancement model based on Swin Transformer [J]. Journal of Computer Applications, 2025, 45(5): 1439-1446.
[9]	Kunyuan JIANG, Xiaoxia LI, Li WANG, Yaodan CAO, Xiaoqiang ZHANG, Nan DING, Yingyue ZHOU. Boundary-cross supervised semantic segmentation network with decoupled residual self-attention [J]. Journal of Computer Applications, 2025, 45(4): 1120-1129.
[10]	Jing QIN, Zhiguang QIN, Fali LI, Yueheng PENG. Diagnosis of major depressive disorder based on probabilistic sparse self-attention neural network [J]. Journal of Computer Applications, 2024, 44(9): 2970-2974.
[11]	Liting LI, Bei HUA, Ruozhou HE, Kuang XU. Multivariate time series prediction model based on decoupled attention mechanism [J]. Journal of Computer Applications, 2024, 44(9): 2732-2738.
[12]	Zexin XU, Lei YANG, Kangshun LI. Shorter long-sequence time series forecasting model [J]. Journal of Computer Applications, 2024, 44(6): 1824-1831.
[13]	Yue LIU, Fang LIU, Aoyun WU, Qiuyue CHAI, Tianxiao WANG. 3D object detection network based on self-attention mechanism and graph convolution [J]. Journal of Computer Applications, 2024, 44(6): 1972-1977.
[14]	Rong HUANG, Junjie SONG, Shubo ZHOU, Hao LIU. Image aesthetic quality evaluation method based on self-supervised vision Transformer [J]. Journal of Computer Applications, 2024, 44(4): 1269-1276.
[15]	Ziqi HUANG, Jianpeng HU. Entity category enhanced nested named entity recognition in automotive domain [J]. Journal of Computer Applications, 2024, 44(2): 377-384.