Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (9): 2764-2772.DOI: 10.11772/j.issn.1001-9081.2024091262

• Artificial intelligence • Previous Articles    

Emotion recognition method compatible with missing modal reasoning

Bing YIN1, Zhenhua LING2, Yin LIN1,2(), Changfeng XI1, Ying LIU1   

  1. 1.iFLYTEK Company Limited,Hefei Anhui 230000,China
    2.School of Information Science and Technology,University of Science and Technology of China,Hefei Anhui 230026,China
  • Received:2024-09-06 Revised:2024-11-25 Accepted:2024-11-26 Online:2024-12-20 Published:2025-09-10
  • Contact: Yin LIN
  • About author:YIN Bing, born in 1983, Ph. D., senior engineer. Her research interests include computer vision, multimodal perception.
    LING Zhenhua, born in 1979, Ph. D., professor. His research interests include speech signal processing, natural language processing.
    XI Changfeng, born in 1993, M. S. Her research interests include emotion recognition.
    LIU Ying, born in 1998, M. S. Her research interests include emotion recognition.
  • Supported by:
    National Key Research and Development Program of China(2022YFB4500600)

兼容缺失模态推理的情感识别方法

殷兵1, 凌震华2, 林垠1,2(), 奚昌凤1, 刘颖1   

  1. 1.科大讯飞股份有限公司,合肥 230000
    2.中国科学技术大学 信息科学技术学院,合肥 230026
  • 通讯作者: 林垠
  • 作者简介:殷兵(1983—),女,山东枣庄人,高级工程师,博士,CCF会员,主要研究方向:计算机视觉、多模态感知
    凌震华(1979—),男,安徽合肥人,教授,博士,CCF会员,主要研究方向:语音信号处理、自然语言处理
    奚昌凤(1993—),女,安徽马鞍山人,硕士,主要研究方向:情感识别
    刘颖(1998—),女,陕西渭南人,硕士,主要研究方向:情感识别。
  • 基金资助:
    国家重点研发计划项目(2022YFB4500600)

Abstract:

Aiming at the problem of model compatibility caused by modality absence in real complex scenes, an emotion recognition method was proposed, supporting input from any available modality. Firstly, during the pre-training and fine-tuning stages, a modality-random-dropout training strategy was adopted to ensure model compatibility during reasoning. Secondly, a spatio-temporal masking strategy and a feature fusion strategy based on cross-modal attention mechanism were proposed respectively, so as to reduce risk of the model over-fitting and enhance cross-modal feature fusion effects. Finally, to solve the noise label problem brought by inconsistent emotion labels across various modalities, an adaptive denoising strategy based on multi-prototype clustering was proposed. In the strategy, class centers were set for different modalities, and noisy labels were removed by comparing the consistency between clustering categories of each modality’ features and their labels. Experimental results show that on a self-built dataset, compared with the baseline Audio-Visual Hidden unit Bidirectional Encoder Representation from Transformers (AV-HuBERT), the proposed method improves the Weighted Average Recall rate (WAR) index by 6.98 percentage points of modality alignment reasoning, 4.09 percentage points while video modality is absent, and 33.05 percentage points while audio modality is absent; compared with AV-HuBERT on public video dataset DFEW, the proposed method achieves the highest WAR, reaching 68.94%.

Key words: emotion recognition, multi-modal, modal absence, pre-training, deep learning

摘要:

针对真实复杂场景下模态缺失带来的模型兼容性问题,提出一种支持任意模态输入的情感识别方法。首先,在预训练和精调阶段,采用模态随机丢弃的训练策略保证模型在推理阶段的兼容性;其次,分别提出时空掩码策略和基于跨模态注意力机制的特征融合机制,以减少模型过拟合的风险并优化模型跨模态特征融合的效果;最后,为了解决多种模态情感标签不一致带来的噪声标签问题,提出一种基于多原型聚类的自适应去噪策略,该策略为多种模态分别设置类中心,并通过对比每种模态特征对应的聚类类别与标签的一致性去除噪声标签。实验结果表明:在自建数据集上,所提方法相比基线AV-HuBERT(Audio-Visual Hidden unit Bidirectional Encoder Representation from Transformers)在加权平均召回率(WAR)指标上,模态对齐推理、视频缺失推理和音频缺失推理分别提升了6.98、4.09和33.05个百分点;在视频公开数据集DFEW上,相较于AV-HuBERT,所提方法取得了最高的WAR指标,达到了68.94%。

关键词: 情感识别, 多模态, 模态缺失, 预训练, 深度学习

CLC Number: