《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (8): 2387-2392.DOI: 10.11772/j.issn.1001-9081.2023081209

• 人工智能 • 上一篇    下一篇

融合自监督和多层交叉注意力的多模态情感分析网络

薛凯鹏1,2, 徐涛1,2(), 廖春节1,2   

  1. 1.西北民族大学 中国民族信息技术研究院,兰州 730030
    2.语言与文化计算教育部重点实验室(西北民族大学),兰州 730030
  • 收稿日期:2023-09-06 修回日期:2023-11-16 接受日期:2023-11-20 发布日期:2024-08-22 出版日期:2024-08-10
  • 通讯作者: 徐涛
  • 作者简介:薛凯鹏(2000—),男,山东青岛人,硕士,CCF会员,主要研究方向:多模态融合
    徐涛(1986—),男,四川广安人,副教授,博士,CCF会员,主要研究方向:人工智能、知识图谱、信息检索、档案领域信息化 alfredxly@163.com
    廖春节(1997—),女,重庆人,硕士,CCF会员,主要研究方向:知识图谱、推荐系统。
  • 基金资助:
    甘肃省高等学校青年博士基金资助项目(2022QB?016);中央高校基本科研业务费专项(31920230069);甘肃省青年科技计划项目(21JR1RA21);国家档案局科技项目(2021?X?56)

Multimodal sentiment analysis network with self-supervision and multi-layer cross attention

Kaipeng XUE1,2, Tao XU1,2(), Chunjie LIAO1,2   

  1. 1.Institute of China National Information Technology,Northwest Minzu University,Lanzhou Gansu 730030,China
    2.Key Laboratory of Linguistic and Cultural Computing,Ministry of Education (Northwest Minzu University),Lanzhou Gansu 730030,China
  • Received:2023-09-06 Revised:2023-11-16 Accepted:2023-11-20 Online:2024-08-22 Published:2024-08-10
  • Contact: Tao XU
  • About author:XUE Kaipeng , born in 2000, M. S. His research interestsinclude multimodal fusion.
    XU Tao , born in 1986, Ph. D. , associate professor. His researchinterests include artificial intelligence, knowledge graph, informationretrieval, archival informatization.
    LIAO Chunjie, born in 1997, M. S. Her research interestsinclude knowledge graph, recommender system.
  • Supported by:
    This work is partially supported by Youth Doctoral Fund of GansuUniversities (2022QB-016); Fundamental Research Funds for CentralUniversities (31920230069); Youth Science and Technology Program ofGansu Province (21JR1RA21); Technology Project of National ArchivesAdministration of China( 2021-X-56).

摘要:

针对多模态情感分析任务中模态内信息不完整、模态间交互能力差和难以训练的问题,将视觉语言预训练(VLP)模型应用于多模态情感分析领域,提出一种融合自监督和多层交叉注意力的多模态情感分析网络(MSSM)。通过自监督学习强化视觉编码器模块,并加入多层交叉注意力以更好地建模文本和视觉特征,使模态内部信息更丰富完整,同时使模态间的信息交互更充分。此外,通过具有感知意识的快速、内存效率高的精确注意力FlashAttention解决Transformer中注意力计算高复杂度的问题。实验结果表明,与目前主流的基于对比文本-图像对的模型(CLIP)相比,MSSM在处理后的MVSA-S数据集上的准确率提高3.6个百分点,在MVSA-M数据集上的准确率提高2.2个百分点,验证所提网络能在降低运算成本的同时有效提高多模态信息融合的完整性。

关键词: 多模态, 情感分析, 自监督, 注意力机制, 视觉语言预训练模型

Abstract:

Aiming at the problems of incomplete intra-modal information, poor inter-modal interaction, and difficulty in training in multimodal sentiment analysis, a Multimodal Sentiment analysis network with Self-supervision and Multi-layer cross Attention fusion (MSSM) was proposed with Visual-and-Language Pre-training (VLP) model applied to the field of multimodal sentiment analysis. The visual encoder module was enhanced through self-supervised learning, and multi-layer cross attention was added to better model textual and visual features. Thus, the intra-modal information was made more abundant and complete, and the inter-modal information interaction was made more sufficient. Besides, the fast and memory-efficient exact attention with IO-awareness: FlashAttention was adopted in the proposed algorithm to address the high complexity of attention computation in Transformer. Experimental results show that compared with the current mainstream model Contrastive Language-Image Pre-training (CLIP), MSSM improves the accuracy by 3.6 percentage points on the processed MVSA-S dataset and 2.2 percentage points on MVSA-M dataset, proving that the proposed network can effectively improve the integrity of multimodal information fusion while reducing computational cost.

Key words: multimodal, sentiment analysis, self-supervision, attention mechanism, Visual-and-Language Pre-training (VLP) model

中图分类号: