融合坐标与多头注意力机制的交互语音情感识别

• •

融合坐标与多头注意力机制的交互语音情感识别

高鹏淇,黄鹤鸣,樊永红

青海师范大学

收稿日期:2023-08-29 修回日期:2023-11-10 发布日期:2023-12-18
通讯作者: 高鹏淇
基金资助:
国家自然科学基金;青海省自然科学基金;高等学校学科创新引智计划

Fusion of Coordinate and Multi-Head Attention Mechanisms for Interactive Speech Emotion Recognition

Received:2023-08-29 Revised:2023-11-10 Online:2023-12-18

摘要/Abstract

摘要： 语音情感识别是人机交互系统中一项重要且充满挑战性的任务。针对目前语音情感识别系统中存在特征单一和特征间交互性较弱问题，提出了多输入交互注意力网络。该网络由特定特征坐标残差注意力网络和共享特征多头注意力网络两个子网络组成。前者利用Res2Net和坐标注意力模块学习从原始语音中获取的特定特征，并生成多尺度特征表示，增强模型对情感相关信息的表征能力；后者融合前向网络所获取的特征，组成共享特征，并经BLSTM输入至多头注意力模块，能够同时关注不同特征子空间中的相关信息，增强特征之间的交互性，以捕获判别性强的特征。通过两个子网络间的协同作用，能够使模型增加特征的多样性，并提高特征之间的交互能力。在训练过程中，应用双损失函数共同监督，使同类样本更紧凑，不同类样本更分离。实验表明：所提模型在EMO-DB和IEMOCAP语料库上分别取得了91.43和76.33的加权平均精度，相较于其他主流模型，具有更好的分类性能。

关键词: 语音情感识别，坐标注意力机制，多头注意力机制，特定特征学习，共享特征学习

Abstract: Speech emotion recognition is an important and challenging task in human-computer interaction systems. To address the issues of single-feature representation and weak feature interaction in current speech emotion recognition systems, a multi-input interactive attention network is proposed. The network consists of two sub-networks, namely the Specific Feature Coordinate Residual Attention Network and the Shared Feature Multi-Head Attention Network. The former utilizes Res2Net and coordinate attention modules to learn specific features extracted from raw speech and generate multi-scale feature representations, enhancing the model's ability to represent emotion-related information. The latter combines features obtained from the forward network to form shared features, which are then input into the multi-head attention module. The multi-head attention mechanism allows for simultaneous attention to relevant information in different feature subspaces, enhancing the interaction among features and capturing highly discriminative characteristics. Through the synergistic effect of the two sub-networks mentioned above, the model can increase the diversity of features and improve the interaction capability among features. During the training process, a dual-loss function is applied for joint supervision, aiming to make the samples of the same class more compact and the samples of different classes more separated. The experimental results demonstrate that the proposed model achieves a weighted average accuracy of 91.43% on the EMO-DB database and 76.33% on the IEMOCAP database. Compared to other state-of-the-art models, it exhibits superior classification performance.

Key words: Speech emotion recognition, coordinate attention mechanism, multi-head attention mechanism, specific feature learning, shared feature learning

中图分类号:

TP183

高鹏淇黄鹤鸣樊永红. 融合坐标与多头注意力机制的交互语音情感识别[J]. 计算机应用.

[1]	徐欣然张绍兵成苗张洋曾尚. 基于多路层次化混合专家模型的轴承故障诊断方法[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[2]	张思齐张金俊王天一秦小林. 基于信号时态逻辑的深度时序事件检测算法[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[3]	沈嫣然温昕张瑾昊张帅曹锐高保禄. 轻量级多尺度卷积网络的功能磁共振成像脑龄预测模型[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[4]	韦修喜, 彭茂松, 黄华娟. 基于多策略改进蝴蝶优化算法的无线传感网络节点覆盖优化[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1009-1017.
[5]	肖斌, 杨模, 汪敏, 秦光源, 李欢. 独立性视角下的相频融合领域泛化方法[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1002-1009.
[6]	张睿潘俊铭白晓露胡静张荣国张鹏云. 面向深度分类模型超参数自优化的代理模型[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[7]	范艺扬张洋曾尚曾渝付茂栗. 基于分解和频域特征提取的多变量长时间序列预测模型 #br#[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[8]	帅奇王海瑞朱贵富. 基于双向对比训练的中文故事结尾生成模型 [J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[9]	赵志强马培红黑新宏. 基于双重注意力机制的人群计数方法[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[10]	李金金桑国明张益嘉. APK-CNN和Transformer增强的多域虚假新闻检测模型[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[11]	张睿张鹏云高美蓉. 自优化双模态多通路非深度前庭神经鞘瘤识别模型[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[12]	吴相岚肖洋刘梦莹刘明铭. 基于语义增强模式链接的Text-to-SQL模型[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[13]	李云王富铕井佩光王粟肖澳. 基于不确定度感知的帧关联短视频事件检测方法[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[14]	任帅, 纪元法, 孙希延, 韦照川, 林子安. 基于改进灰狼优化与支持向量回归的滑坡位移预测[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 972-982.
[15]	孙滔, 段张甜, 朱浩楠, 郭沛豪, 孙鹤立. 基于新奇度量的社交事件推荐方法[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 760-766.