Journal of Computer Applications
Next Articles
Received:
Revised:
Online:
Published:
Supported by:
王艺涵,路翀,陈忠源
通讯作者:
基金资助:
Abstract: Multimodal Sentiment Analysis(MSA) that utilize text, visual, and audio data to analyze speakers' emotions in videos have garnered widespread attention. However, the contributions of different modalities to sentiment analysis vary significantly. Generally, the information contained in text is more intuitive, making it particularly important to seek a strategy for enhancing text in sentiment analysis. To address this issue, a Multimodal Sentiment Analysis Model with Cross-modal Text-information Enhancement (MSAM-CTE) was proposed. First, the BERT(Bidirectional Encoder Representation from Transformers) pretrained model was employed to extract textual features, and the Bi-directional Long Short-Term Memory (Bi-LSTM) network was used to further process the preprocessed audio and video features. Then, a text based cross-attention mechanism was applied to integrate text-information into emotion related nonverbal representations, learning text oriented pairwise cross modal mappings to obtain effective unified multimodal representations. Finally, the fused features were utilized for sentiment analysis. Compared to the optimal Text Enhanced Transformer Fusion Network (TETFN), the model proposed in this paper demonstrated enhancements on the benchmark dataset CMU-MOSI(Carnegie Mellon University Multimodal Opinion Sentiment Intensity), with improvements of approximately 2.65% in Mean Absolute Error (MAE) and 0.12% in Pearson Correlation Coefficient (Corr). Similarly, on the CMU-MOSEI (Carnegie Mellon University Multimodal Opinion Sentiment and Emotion Intensity) dataset, the improvements were approximately 3.18% and 1.74% respectively, substantiating the effectiveness of the MSAM-CTE model in sentiment analysis.
Key words: Multimodal Sentiment Analysis (MSA), text-information enhancement, cross-attention mechanism, Bi-directional Long Short-Term Memory (Bi-LSTM), network, cross-modal information fusion
摘要: 摘 要: 近年来,利用文本、视觉和音频来分析视频中说话者情感的多模态情感分析(MSA)引起了广泛关注。然而,不同模态在情感分析中的贡献却大不相同。通常,文本中包含的信息更加直观,所以寻求一种用于增强文本在情感分析中的策略显得尤为重要,针对这一问题,本文提出了一种跨模态文本信息增强的多模态情感分析模型(MSAM-CTE),首先该模型使用BERT(Bidirectional Encoder Representation from Transformers)预训练模型提取文本特征,使用双向长短期记忆(Bi-LSTM)网络对预处理后的音频和视频特征进行进一步处理,其次通过基于文本的交叉注意力机制,将文本信息融入情感相关的非语言表示中,学习面向文本的成对跨模态映射,以获得有效的统一多模态表示,最后融合特征进行情感分析。与最优的文本增强Transforme融合网络(TETFN)对比,本文所提出的模型在基准数据集CMU-MOSI(Carnegie Mellon University Multimodal Opinion Sentiment Intensity)上的平均绝对误差(MAE)和皮尔逊相关系数(Corr)分别提高了2.65%和0.12%;在CMU-MOSEI(Carnegie Mellon University Multimodal Opinion Sentiment and Emotion Intensity)上分别提高了3.18%和1.74%,验证了MSAM-CTE模型在情感分析中的有效性。
关键词: 模态情感分析, 文本信息增强, 交叉注意力机制, 双向长短期记忆网络, 跨模态信息融合
CLC Number:
中图分类号:TP18
王艺涵 路翀 陈忠源. 跨模态文本信息增强的多模态情感分析模型[J]. 《计算机应用》唯一官方网站, DOI: 10.11772/j.issn.1001-9081.2024060886.
0 / Recommend
Add to citation manager EndNote|Ris|BibTeX
URL: https://www.joca.cn/EN/10.11772/j.issn.1001-9081.2024060886