《计算机应用》唯一官方网站

• •    下一篇

融合自监督和多层交叉注意力的多模态情感分析网络

薛凯鹏,廖春节,徐涛   

  1. 西北民族大学中国民族信息技术研究院
  • 收稿日期:2023-09-06 修回日期:2023-11-16 发布日期:2023-12-18 出版日期:2023-12-18
  • 通讯作者: 徐涛
  • 基金资助:
    中央高校基本科研业务费专项;甘肃省青年科技计划;国家档案局科技项目

Multimodal sentiment analysis network with self-supervision and multi-layer cross attention

  • Received:2023-09-06 Revised:2023-11-16 Online:2023-12-18 Published:2023-12-18

摘要: 针对多模态情感分析任务中模态内信息不完整、模态间交互能力差和难以训练的问题,将视觉语言预训练模型(VLP)应用于多模态情感分析领域,提出了一种融合自监督和多层交叉注意力的多模态情感分析网络(Multimodal EmotionNet Fused Self-Supervised Learning and Multi-Layer Cross,MESM)。通过自监督学习强化视觉编码器模块并加入多层交叉注意力以更好地建模文本和视觉特征,使得模态内部信息更加丰富完整,同时使模态间的信息交互更加充分,并通过具有IO意识的快速、内存效率高的精确注意力(Flash Attention)解决Transformer中注意力计算高复杂度的问题。与目前主流的TomBERT、CLIP、VILT、ViLBERT相比,MESM在处理后的MVSA数据集上准确率和召回率达到最高,分别为71.3%和69.2%,证明该方法能在降低运算成本的前提下同时有效提高多模态信息融合的完整性。

关键词: 关键词: 多模态, 情感分析, 自监督, 注意力机制, 视觉语言预训练模型

Abstract: Aiming at the problems of incomplete intra-modal information, poor inter-modal interaction and difficult training in multimodal sentiment analysis tasks, a visual language pre-training model (VLP) is applied to the field of multimodal sentiment analysis, and a multimodal sentiment analysis network (Multimodal EmotionNet Fused Self- Supervised Learning and Multi-Layer Cross, MESM). The visual coder module is strengthened by self-supervised learning and multi-layer cross attention is added to better model textual and visual features, which makes the intra-modal information richer and more complete and the inter-modal information interaction more adequate, and solves the problem of high complexity of attention computation in the Transformer by the fast, memory-efficient Flash Attention with IO-awareness. high complexity problem in Transformer. Compared with the current mainstream TomBERT, CLIP, VILT, and ViLBERT, MESM has the best results in accuracy and recall on the processed MVSA dataset, reaching 71.3% and 69.2%, respectively, which proves that the method can simultaneously and efficiently improve the completeness of the multimodal information fusion while reducing the cost of computation.

Key words: Keywords: multimodal, sentiment analysis, self-supervision, attentional mechanism, visual language pretraining model

中图分类号: