《计算机应用》唯一官方网站

• •    下一篇

兼容缺失模态推理的情感识别方法

殷兵1,凌震华2,林垠1,3,奚昌凤1,刘颖1   

  1. 1.科大讯飞股份有限公司 2.中国科学技术大学 电子工程与信息科学学院 3.中国科学技术大学 自动化学院
  • 收稿日期:2024-09-04 修回日期:2024-11-25 发布日期:2024-12-20 出版日期:2024-12-20
  • 通讯作者: 林垠
  • 作者简介:殷兵(1983—),女,山东枣庄人,高级工程师,博士,CCF会员,主要研究方向:计算机视觉、多模态感知;凌震华(1979—),男,安徽合肥人,教授,博士,CCF会员,主要研究方向:语音信号处理、自然语言处理;林垠(1991—),男,福建三明人,博士研究生,CCF会员,主要研究方向:模式识别、多模态感知交互;奚昌凤(1993—),女,安徽马鞍山人,硕士,主要研究方向:情绪识别;刘颖(1998—),女,陕西渭南人,硕士,主要研究方向:情绪识别。
  • 基金资助:
    国家重点研发计划(2022YFB4500600)

Emotion recognition method compatible with missing modal reasoning#br#
#br#

YIN Bing1, LING Zhenhua2, LIN Yin1,3, XI Changfeng1, LIU Ying1   

  1. 1. iFLYTEK Research 2.Department of Electronic Engineering and Information Science, University of Science and Technology of China 3.Department of Automation, University of Science and Technology of China
  • Received:2024-09-04 Revised:2024-11-25 Online:2024-12-20 Published:2024-12-20
  • About author:YIN Bing, born in 1983, Ph. D. Her research interests include computer vision, multimodal perception. LING Zhenhua, born in 1979, Ph. D., professor. His research interests include speech signal processing, natural language processing. LIN Yin, born in 1991, Ph. D. candidate. His research interests include pattern recognition, multimodal perception interaction. XI Changfeng,born in 1993, M. S. Her research interests include emotion recognition. LIU Ying,born in 1998, M. S. Her research interests include emotion recognition.
  • Supported by:
    National Key R&D Program of China (2022YFB4500600)

摘要: 针对真实复杂场景下模态缺失带来的模型兼容性问题,提出一种支持任意模态输入的情感识别方法。首先,在预训练和精调阶段,采用模态随机丢弃的训练策略保证模型在推理阶段的兼容性;其次,分别提出时空掩码策略和基于跨模态注意力机制的特征融合机制,以减少模型过拟合的风险并提高模型跨模态特征融合的效果;最后,为了解决多种模态情感标签不一致带来的噪声标签问题,提出一种基于多原型聚类的自适应去噪策略,该策略为多种模态分别设置类中心,通过对比每种模态特征对应的聚类类别与其标签的一致性来去除噪声标签。在自建数据集上的实验结果表明,所提模型相比基线在WAR指标上,模态对齐推理提升6.98个百分点,视频缺失推理提升4.09个百分点,音频缺失推理提升33.05个百分点。与现有方法在视频公开数据集DFEW上对比,所提模型也取得了最高的WAR指标,达到了68.94%。

关键词: 情绪识别, 多模态, 模态缺失, 预训练, 深度学习

Abstract: Aiming at the problem of model compatibility caused by modality absence in real complex scenes, an emotion recognition method compatible with missing modal reasoning was proposed. Firstly, during the pre-training and fine-tuning stages, a modality-random-dropout training strategy to ensure model compatibility during inference was employed. Secondly, a spatio-temporal masking strategy and a feature fusion mechanism based on cross-modal attention mechanism were proposed separately, to reduce the risk of model over-fitting and enhance cross-modal feature fusion efficiency. Finally, to tackle the problem of inconsistent emotion labels across various modalities, an adaptive denoising strategy based on multi-prototype clustering was proposed. The strategy separately sets class centers for different modalities and removes noisy labels by comparing the consistency between the clustering categories of each modality’ features and their labels. Compared with the baseline, experimental results on a self-built dataset show that the proposed model improves the WAR index by 6.98 percentage points while full modality reasoning, 4.09 percentage points while video modality is absent, and 33.05 percentage points while audio modality is absent. Compared with existing methods on published video dataset DFEW, the proposed model also achieves the highest WAR, reaching 68.94%.

Key words: emotion recognition, multi-modality, modality absence, pre-training, deep learning

中图分类号: