《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (6): 1869-1875.DOI: 10.11772/j.issn.1001-9081.2021040578

• 人工智能 • 上一篇    

基于多头注意力机制的端到端语音情感识别

杨磊1, 赵红东1(), 于快快2   

  1. 1.河北工业大学 电子信息工程学院,天津 300401
    2.光电信息控制和安全技术重点实验室,天津 300308
  • 收稿日期:2021-04-14 修回日期:2021-07-19 接受日期:2021-07-23 发布日期:2022-06-22 出版日期:2022-06-10
  • 通讯作者: 赵红东
  • 作者简介:杨磊(1978—),男,吉林敦化人,博士研究生,CCF会员,主要研究方向:智能信息处理
    于快快(1988—),男,天津人,工程师,硕士,主要研究方向:电子信息。
  • 基金资助:
    光电信息控制和安全技术重点实验室基金资助项目(614210701041705)

End-to-end speech emotion recognition based on multi-head attention

Lei YANG1, Hongdong ZHAO1(), Kuaikuai YU2   

  1. 1.School of Electronics and Information Engineering,Hebei University of Technology,Tianjin 300401,China
    2.Science and Technology on Electro-Optical Information Security Control Laboratory,Tianjin 300308,China
  • Received:2021-04-14 Revised:2021-07-19 Accepted:2021-07-23 Online:2022-06-22 Published:2022-06-10
  • Contact: Hongdong ZHAO
  • About author:YANG Lei,born in 1978,Ph. D. candidate. His research interests include intelligent information processing
    YU Kuaikuai,born in 1988,M. S.,engineer. His research interests include electronic information.
  • Supported by:
    Fund of Science and Technology on Electro-Optical Information Security Control Laboratory(614210701041705)

摘要:

针对语音情感数据集规模小且数据维度高的特点,为解决传统循环神经网络(RNN)长程依赖消失和卷积神经网络(CNN)关注局部信息导致输入序列内部各帧之间潜在关系没有被充分挖掘的问题,提出一个基于多头注意力(MHA)和支持向量机(SVM)的神经网络MHA-SVM用于语音情感识别(SER)。首先将原始音频数据输入MHA网络来训练MHA的参数并得到MHA的分类结果;然后将原始音频数据再次输入到预训练好的MHA中用于提取特征;最后通过全连接层后使用SVM对得到的特征进行分类获得MHA-SVM的分类结果。充分评估MHA模块中头数和层数对实验结果的影响后,发现MHA-SVM在IEMOCAP数据集上的识别准确率最高达到69.6%。实验结果表明同基于RNN和CNN的模型相比,基于MHA机制的端到端模型更适合处理SER任务。

关键词: 语音情感识别, 多头注意力, 卷积神经网络, 支持向量机, 端到端

Abstract:

Aiming at the characteristics of small size and high data dimensionality of speech emotion datasets, to solve the problem of long-range dependence disappearance in traditional Recurrent Neural Network (RNN) and insufficient excavation of potential relationship between frames within the input sequence because of focus on local information of Convolutional Neural Network (CNN), a new neural network MAH-SVM based on Multi-Head Attention (MHA) and Support Vector Machine (SVM) was proposed for Speech Emotion Recognition (SER). First, the original audio data were input into the MHA network to train the parameters of MHA and obtain the classification results of MHA. Then, the same original audio data were input into the pre-trained MHA again for feature extraction. Finally, these obtained features were fed into SVM after the fully connected layer to obtain classification results of MHA-SVM. After fully evaluating the effect of the heads and layers in the MHA module on the experimental results, it was found that MHA-SVM achieved the highest recognition accuracy of 69.6% on IEMOCAP dataset. Experimental results indicate that the end-to-end model based on MHA mechanism is more suitable for SER tasks compared with models based on RNN and CNN.

Key words: Speech Emotion Recognition (SER), Multi-Head Attention (MHA), Convolutional Neural Network (CNN), Support Vector Machine (SVM), end-to-end

中图分类号: