《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (7): 2237-2244.DOI: 10.11772/j.issn.1001-9081.2024060886

• 人工智能 • 上一篇    下一篇

跨模态文本信息增强的多模态情感分析模型

王艺涵, 路翀(), 陈忠源   

  1. 新疆财经大学 信息管理学院,乌鲁木齐 830046
  • 收稿日期:2024-07-01 修回日期:2024-09-18 接受日期:2024-09-18 发布日期:2025-07-10 出版日期:2025-07-10
  • 通讯作者: 路翀
  • 作者简介:王艺涵(2000—),男,山东淄博人,硕士研究生,CCF学生会员,主要研究方向:数据分析、人工智能
    路翀(1966—),男,江苏扬州人,教授,博士,主要研究方向:人工智能、图像处理 498841300@qq.com
    陈忠源(1999—),男,山东青岛人,硕士研究生,主要研究方向:数据分析、人工智能。
  • 基金资助:
    国家自然科学基金资助项目(62166039)

Multimodal sentiment analysis model with cross-modal text information enhancement

Yihan WANG, Chong LU(), Zhongyuan CHEN   

  1. School of Informatics and Management,Xinjiang University of Finance and Economics,Urumqi Xinjiang 830046,China
  • Received:2024-07-01 Revised:2024-09-18 Accepted:2024-09-18 Online:2025-07-10 Published:2025-07-10
  • Contact: Chong LU
  • About author:WANG Yihan, born in 2000, M. S. candidate. His research interests include data analysis, artificial intelligence.
    LU Chong, born in 1966, Ph. D., professor. His research interests include artificial intelligence, image processing.
    CHEN Zhongyuan, born in 1999, M. S. candidate. His research interests include data analysis, artificial intelligence.
  • Supported by:
    National Natural Science Foundation of China(62166039)

摘要:

近年来,利用文本、视觉和音频数据分析视频中说话者情感的多模态情感分析(MSA)引起了广泛关注。然而,不同模态在情感分析中的贡献大不相同。通常,文本中包含的信息更加直观,因此寻求一种用于增强文本在情感分析中作用的策略显得尤为重要。针对这一问题,提出一种跨模态文本信息增强的多模态情感分析模型(MSAM-CTE)。首先,使用BERT(Bidirectional Encoder Representations from Transformers)预训练模型提取文本特征,并使用双向长短期记忆(Bi-LSTM)网络对预处理后的音频和视频特征进行进一步处理;其次,通过基于文本的交叉注意力机制,将文本信息融入情感相关的非语言表示中,以学习面向文本的成对跨模态映射,从而获得有效的统一多模态表示;最后,使用融合特征进行情感分析。实验结果表明,与最优的基线模型——文本增强Transformer融合网络(TETFN)相比,MSAM-CTE在数据集CMU-MOSI (Carnegie Mellon University Multimodal Opinion Sentiment Intensity)上的平均绝对误差(MAE)和皮尔逊相关系数(Corr)分别降低了2.6%和提高了0.1%;在数据集CMU-MOSEI (Carnegie Mellon University Multimodal Opinion Sentiment and Emotion Intensity)上的两个指标分别降低了3.8%和提高了1.7%,验证了MSAM-CTE在情感分析中的有效性。

关键词: 多模态情感分析, 文本信息增强, 交叉注意力机制, 双向长短期记忆网络, 跨模态信息融合

Abstract:

Multimodal Sentiment Analysis (MSA) that utilize text, visual, and audio data to analyze speakers’ emotions in videos have garnered widespread attention. However, the contributions of different modalities to sentiment analysis vary significantly. Generally, the information contained in text is more intuitive, making it particularly important to seek a strategy for enhancing text in sentiment analysis. To address this issue, a Multimodal Sentiment Analysis Model with Cross-modal Text-information Enhancement (MSAM-CTE) was proposed. Firstly, the BERT (Bidirectional Encoder Representations from Transformers) pre-trained model was employed to extract text features, and the Bi-directional Long Short-Term Memory (Bi-LSTM) network was used to further process the pre-processed audio and video features. Then, a text based cross-attention mechanism was applied to integrate text information into emotion related nonverbal representations, thereby learning text oriented pairwise cross-modal mappings to obtain effective unified multimodal representations. Finally, the fused features were utilized for sentiment analysis. Experimental results show that compared to the optimal baseline model — Text Enhanced Transformer Fusion Network (TETFN), the proposed model achieved a 2.6% reduction in Mean Absolute Error (MAE) and a 0.1% increase in Pearson Correlation coefficient (Corr) on the CMU-MOSI (Carnegie Mellon University Multimodal Opinion Sentiment Intensity) dataset;on the CMU-MOSEI (Carnegie Mellon University Multimodal Opinion Sentiment and Emotion Intensity) dataset, the improvements are 3.8% for MAE and 1.7% for Corr, respectively, verifying the effectiveness of MSAM-CTE in sentiment analysis.

Key words: Multimodal Sentiment Analysis (MSA), text information enhancement, cross-attention mechanism, Bi-directional Long Short-Term Memory (Bi-LSTM) network, cross-modal information fusion

中图分类号: