《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (8): 2400-2406.DOI: 10.11772/j.issn.1001-9081.2023081160

• 人工智能 • 上一篇    

融合坐标与多头注意力机制的交互语音情感识别

高鹏淇1,2, 黄鹤鸣1,2(), 樊永红1,2   

  1. 1.青海师范大学 计算机学院,西宁 810008
    2.藏语智能信息处理及应用国家重点实验室,西宁 810008
  • 收稿日期:2023-08-29 修回日期:2023-11-10 接受日期:2023-11-20 发布日期:2024-08-22 出版日期:2024-08-10
  • 通讯作者: 黄鹤鸣
  • 作者简介:高鹏淇(1998—),女,辽宁阜新人,硕士研究生,主要研究方向:模式识别与智能系统、语音情感识别
    黄鹤鸣(1969—),男,青海乐都人,教授,博士,主要研究方向:模式识别与智能系统 huanghm@qhnu.edu.cn
    樊永红(1997—),女,宁夏吴忠人,博士研究生,主要研究方向:模式识别与智能系统、语音情感识别。
  • 基金资助:
    国家自然科学基金资助项目(620660039);青海省自然科学基金资助项目(2022?ZJ?925);高等学校学科创新引智计划项目(D20035)

Fusion of coordinate and multi-head attention mechanisms for interactive speech emotion recognition

Pengqi GAO1,2, Heming HUANG1,2(), Yonghong FAN1,2   

  1. 1.College of Computer,Qinghai Normal University,Xining Qinghai 810008,China
    2.The State Key Laboratory of Tibetan Intelligent Information Processing and Application,Xining Qinghai 810008,China
  • Received:2023-08-29 Revised:2023-11-10 Accepted:2023-11-20 Online:2024-08-22 Published:2024-08-10
  • Contact: Heming HUANG
  • About author:GAO Pengqi , born in 1998, M. S. candidate. Her researchinterests include pattern recognition and intelligent systems, speechemotion recognition.
    HUANG Heming , born in 1969, Ph. D., professor. His researchinterests include pattern recognition and intelligent systems.
    FAN Yonghong, born in 1997, Ph. D. candidate. Her researchinterests include pattern recognition and intelligent systems, speechemotion recognition.
  • Supported by:
    This work is partially supported by National Natural ScienceFoundation of China (620660039); Qinghai Provincial Natural ScienceFoundation (2022-ZJ-925); Innovation and Intelligence Program forDisciplines in Colleges and Universities( D20035).

摘要:

语音情感识别(SER)是人机交互系统中一项重要且充满挑战性的任务。针对目前SER系统中存在特征单一和特征间交互性较弱的问题,提出多输入交互注意力网络MIAN。该网络由特定特征坐标残差注意力网络和共享特征多头注意力网络两个子网络组成。前者利用Res2Net和坐标注意力模块学习从原始语音中获取的特定特征,并生成多尺度特征表示,增强模型对情感相关信息的表征能力;后者融合前向网络所获取的特征,组成共享特征,并经双向长短时记忆(BiLSTM)网络输入至多头注意力模块,能同时关注不同特征子空间中的相关信息,增强特征之间的交互性,以捕获判别性强的特征。通过2个子网络间的协同作用,能增加模型特征的多样性,增强特征之间的交互能力。在训练过程中,应用双损失函数共同监督,使同类样本更紧凑、不同类样本更分离。实验结果表明,MIAN在EMO-DB和IEMOCAP语料库上分别取得了91.43%和76.33%的加权平均精度,相较于其他主流模型,具有更好的分类性能。

关键词: 语音情感识别, 坐标注意力机制, 多头注意力机制, 特定特征学习, 共享特征学习

Abstract:

Speech Emotion Recognition (SER) is an important and challenging task in human-computer interaction systems. To address the issues of single-feature representation and weak feature interaction in current SER systems,a Multi-input Interactive Attention Network (MIAN) was proposed. The proposed network consists of two sub-networks,namely the specific feature coordinate residual attention network and the shared feature multi-head attention network. The former utilized Res2Net and coordinate attention modules to learn specific features extracted from raw speech and generate multiscale feature representations, enhancing the model’s ability to represent emotion-related information. The latter integrated the features obtained from the forward network to form shared features, which were then input into the multi-head attention module via Bidirectional Long Short-Term Memory(BiLSTM) network. This setup allowed for simultaneous attention to relevant information in different feature subspaces, enhancing the interaction among features and capturing highly discriminative features. The collaboration of the two sub-networks mentioned above increased the diversity of features and improve the interaction capability among features. During the training process, a dual-loss function was applied for joint supervision,aiming to make the samples of the same class more compact and the samples of different classes more separated. The experimental results demonstrate that the proposed model achieves a weighted average accuracy of 91.43% on EMO-DB corpus and 76.33% on IEMOCAP corpus. Compared to other state-of-the-art models,the proposed model exhibits superior classification performance.

Key words: Speech Emotion Recognition (SER), coordinate attention mechanism, multi-head attention mechanism, specific feature learning, shared feature learning

中图分类号: