基于多头注意力机制的端到端语音情感识别

doi:10.11772/j.issn.1001-9081.2021040578

《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (6): 1869-1875.DOI: 10.11772/j.issn.1001-9081.2021040578

所属专题：人工智能

基于多头注意力机制的端到端语音情感识别

杨磊¹, 赵红东¹(), 于快快²

^1.河北工业大学电子信息工程学院，天津 300401
^2.光电信息控制和安全技术重点实验室，天津 300308

收稿日期:2021-04-14 修回日期:2021-07-19 接受日期:2021-07-23 发布日期:2022-06-22 出版日期:2022-06-10
通讯作者: 赵红东
作者简介:杨磊（1978—），男，吉林敦化人，博士研究生，CCF会员，主要研究方向：智能信息处理
于快快（1988—），男，天津人，工程师，硕士，主要研究方向：电子信息。
基金资助:
光电信息控制和安全技术重点实验室基金资助项目(614210701041705)

End-to-end speech emotion recognition based on multi-head attention

Lei YANG¹, Hongdong ZHAO¹(), Kuaikuai YU²

^1.School of Electronics and Information Engineering，Hebei University of Technology，Tianjin 300401，China
^2.Science and Technology on Electro-Optical Information Security Control Laboratory，Tianjin 300308，China

Received:2021-04-14 Revised:2021-07-19 Accepted:2021-07-23 Online:2022-06-22 Published:2022-06-10
Contact: Hongdong ZHAO
About author:YANG Lei，born in 1978，Ph. D. candidate. His research interests include intelligent information processing
YU Kuaikuai，born in 1988，M. S.，engineer. His research interests include electronic information.
Supported by:
Fund of Science and Technology on Electro-Optical Information Security Control Laboratory(614210701041705)

摘要/Abstract

摘要：

针对语音情感数据集规模小且数据维度高的特点，为解决传统循环神经网络（RNN）长程依赖消失和卷积神经网络（CNN）关注局部信息导致输入序列内部各帧之间潜在关系没有被充分挖掘的问题，提出一个基于多头注意力（MHA）和支持向量机（SVM）的神经网络MHA-SVM用于语音情感识别（SER）。首先将原始音频数据输入MHA网络来训练MHA的参数并得到MHA的分类结果；然后将原始音频数据再次输入到预训练好的MHA中用于提取特征；最后通过全连接层后使用SVM对得到的特征进行分类获得MHA-SVM的分类结果。充分评估MHA模块中头数和层数对实验结果的影响后，发现MHA-SVM在IEMOCAP数据集上的识别准确率最高达到69.6%。实验结果表明同基于RNN和CNN的模型相比，基于MHA机制的端到端模型更适合处理SER任务。

关键词: 语音情感识别, 多头注意力, 卷积神经网络, 支持向量机, 端到端

Abstract:

Aiming at the characteristics of small size and high data dimensionality of speech emotion datasets， to solve the problem of long-range dependence disappearance in traditional Recurrent Neural Network （RNN） and insufficient excavation of potential relationship between frames within the input sequence because of focus on local information of Convolutional Neural Network （CNN）， a new neural network MAH-SVM based on Multi-Head Attention （MHA） and Support Vector Machine （SVM） was proposed for Speech Emotion Recognition （SER）. First， the original audio data were input into the MHA network to train the parameters of MHA and obtain the classification results of MHA. Then， the same original audio data were input into the pre-trained MHA again for feature extraction. Finally， these obtained features were fed into SVM after the fully connected layer to obtain classification results of MHA-SVM. After fully evaluating the effect of the heads and layers in the MHA module on the experimental results， it was found that MHA-SVM achieved the highest recognition accuracy of 69.6% on IEMOCAP dataset. Experimental results indicate that the end-to-end model based on MHA mechanism is more suitable for SER tasks compared with models based on RNN and CNN.

Key words: Speech Emotion Recognition (SER), Multi-Head Attention (MHA), Convolutional Neural Network (CNN), Support Vector Machine (SVM), end-to-end

中图分类号:

TP183

杨磊, 赵红东, 于快快. 基于多头注意力机制的端到端语音情感识别[J]. 计算机应用, 2022, 42(6): 1869-1875.

Lei YANG, Hongdong ZHAO, Kuaikuai YU. End-to-end speech emotion recognition based on multi-head attention[J]. Journal of Computer Applications, 2022, 42(6): 1869-1875.

图/表 12

图1 放缩点积注意力结构

Fig. 1 Scaled dot-product attention structure

图2 多头注意力结构

Fig. 2 Multi-head attention structure

表1 IEMOCAP数据集类别情况

Tab. 1 IEMOCAP dataset description of different categories

序号	情绪类别	数量
合计		5 531
0	生气	1 103
1	悲伤	1 084
2	高兴	1 636
3	中性	1 708

图3 分帧图

Fig. 3 Framing diagram

图4 Transformer层结构

Fig. 4 Transformer layer structure

图5 MHA-SVM结构

Fig. 5 MHA-SVM structure

表2 MHA参数

Tab. 2 Parameters of MHA

层类型	输出尺寸	核大小/步长
输入层	277，160，1
卷积层1	277，128，32	1 × 33 / 1
卷积层2	277，128，1	1 × 1 / 1
Transformer层	277，128
全局平均池化层	128
全连接层	64
Dropout	64
softmax层	4

图6 消融实验中训练集和验证集的准确率曲线

Fig. 6 Accuracy curves of training and testing sets in ablation experiments

表3 不同头数和层数下模型的识别准确率比较 ( %)

Tab.3 Comparison of recognition accuracy of models under different numbers of heads and layers

头数	层数	MHA	MHA-SVM	MHA-LR	MHA-KNN
2	1	65.1	67.3	65.8	66.0
2	2	63.7	65.2	63.6	63.0
4	1	67.5	68.8	67.0	67.1
4	2	66.1	67.3	65.8	64.5
8	1	67.7	69.6	66.9	67.3
8	2	67.1	69.0	67.2	66.9

图7 实验结果

Fig. 7 Experiment results

表4 MHA-SVM在IEMOCAP数据集上四个情绪类别的性能比较 ( %)

Tab. 4 Performance comparison of MHA-SVM for 4 emotional categories on IEMOCAP dataset

情绪类别	精准率	召回率	F1分数
生气	77.3	68.2	72.4
悲伤	67.2	84.5	74.9
高兴	71.3	69.6	70.4
中性	72.4	68.2	70.2

表5 IEMOCAP数据集上7种模型的准确率对比

Tab. 5 Accuracy comparison of 7 models on IEMOCAP dataset

模型	输入特征	准确率/%
SVM ^［24］	LLDs	57.5
SVM tree ^［25］	原始声音序列	60.9
CNN-BLSTM ^［8］	原始声音序列	61.0
ACRNN ^［10］	梅尔谱图	64.7
A-BLSTM ^［11］	梅尔谱图	66.5
MHA	原始声音序列	67.7
MHA-SVM	原始声音序列	69.6

参考文献 25

1	SALAMON J， BELLO J P. Deep convolutional neural networks and data augmentation for environmental sound classification［J］. IEEE Signal Processing Letters， 2017， 24（3）： 279-283. 10.1109/lsp.2017.2657381
2	LIM W， JANG D， LEE T. Speech emotion recognition using convolutional and recurrent neural networks［C］// Proceedings of the 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. Piscataway： IEEE， 2016： 1-4. 10.1109/apsipa.2016.7820699
3	HINTON G， DENG L， YU D， et al. Deep neural networks for acoustic modeling in speech recognition： the shared views of four research groups［J］. IEEE Signal Processing Magazine， 2012， 29（6）： 82-97. 10.1109/msp.2012.2205597
4	SCHEDL M， GÓMEZ E， URBANO J. Music information retrieval： recent developments and applications［J］. Foundations and Trends in Information Retrieval， 2014， 8（2/3）： 127-261. 10.1561/1500000042
5	MAO Q R， DONG M， HUANG Z W， et al. Learning salient features for speech emotion recognition using convolutional neural networks［J］. IEEE Transactions on Multimedia， 2014， 16（8）： 2203-2213. 10.1109/tmm.2014.2360798
6	ISSA D， DEMIRCI M F， YAZICI A. Speech emotion recognition with deep convolutional neural networks［J］. Biomedical Signal Processing Control， 2020， 59： No.101894. 10.1016/j.bspc.2020.101894
7	XIE Y， LIANG R Y， LIANG Z L， et al. Speech emotion classification using attention-based LSTM［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2019， 27（11）： 1675-1685. 10.1109/taslp.2019.2925934
8	吕惠炼，胡维平. 基于端到端深度神经网络的语言情感识别研究［J］.广西师范大学学报（自然科学版）， 2021， 39（3）： 20-26.
	LYU H L， HU W P. Research on speech emotion recognition based on end-to-end deep neural network［J］. Journal of Guangxi Normal University （Natural Science Edition）， 2021， 39（3）： 20-26.
9	LATIF S， RANA R， KHALIFA S， et al. Direct modelling of speech emotion from raw speech［EB/OL］. （2020-07-28）［2021-01-25］.. 10.21437/interspeech.2019-3252
10	CHEN M Y， HE X J， YANG J， et al. 3-D convolutional recurrent neural networks with attention model for speech emotion recognition［J］. IEEE Signal Processing Letters， 2018， 25（10）： 1440-1444. 10.1109/lsp.2018.2860246
11	ZHAO Z P， BAO Z T， ZHAO Y Q， et al. Exploring deep spectrum representations via attention-based recurrent and convolutional neural networks for speech emotion recognition［J］. IEEE Access， 2019， 7： 97515-97525. 10.1109/ACCESS.2019.2928625
12	VASWANI A， SHAZEER N， PARMAR J， et al. Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2017： 6000-6010. 10.1016/s0262-4079(17)32358-8
13	RADFORD A， NARASIMHAN K， SALIMANS T， et al. Improving language understanding by generative pre-training ［EB/OL］. ［2021-01-25］..
14	DEVLIN J， CHANG M W， LEE K， et al. BERT： pre-training of deep bidirectional transformers for language understanding［C］// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 1 （Long and Short Papers）. Stroudsburg， PA： Association for Computational Linguistics， 2019： 4171-4186. 10.18653/v1/n19-1423
15	KARITA S， SOPLIN N E Y， WATANABE M， et al. Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration［C］// Proceedings of the 20th Annual Conference of the International Speech Communication Association. ［S.l.］： ISCA， 2019： 1408-1412. 10.21437/interspeech.2019-1938
16	陈闯， RYAD C，邢尹，等. 改进GWO优化SVM的语音情感识别研究［J］. 计算机工程与应用， 2018， 54（16）： 113-118. 10.3778/j.issn.1002-8331.1704-0361
	CHEN C， RYAD C， XING Y， et al. Research on speech emotion recognition based on improved GWO optimized SVM［J］. Computer Engineering and Applications， 2018， 54（16）： 113-118. 10.3778/j.issn.1002-8331.1704-0361
17	余华，颜丙聪. 基于CTC-RNN的语音情感识别方法［J］. 电子器件， 2020， 43（4）： 934-937. 10.3969/j.issn.1005-9490.2020.04.043
	YU H， YAN B C. Speech emotion recognition based on CTC-RNN［J］. Chinese Journal of Electron Devices， 2020， 43（4）： 934-937. 10.3969/j.issn.1005-9490.2020.04.043
18	YOON S， BYUN S， JUNG K. Multimodal speech emotion recognition using audio and text［C］// Proceedings of the 2018 IEEE Spoken Language Technology Workshop. Piscataway： IEEE， 2018： 112-118. 10.1109/slt.2018.8639583
19	CHO J， PAPPAGARI R， KULKARNI P， et al. Deep neural networks for emotion recognition combining audio and transcripts［C］// Proceedings of the 19th Annual Conference of the International Speech Communication Association. ［S.l.］： ISCA， 2018： 247-251. 10.21437/interspeech.2018-2466
20	ALDENEH Z， PROVOST E M. Using regional saliency for speech emotion recognition［C］// Proceedings of the 2017 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2017： 2741-2745. 10.1109/icassp.2017.7952655
21	WAN M T， McAULEY J. Item recommendation on monotonic behavior chains［C］// Proceedings of the 12th ACM Conference on Recommender Systems. New York： ACM， 2018： 86-94. 10.1145/3240323.3240369
22	XIA Q L， JIANG P， SUN F， et al. Modeling consumer buying decision for recommendation based on multi-task deep learning［C］// Proceedings of the 27th ACM International Conference on Information and Knowledge Management. New York： ACM， 2018： 1703-1706. 10.1145/3269206.3269285
23	CHERNYKH V， PRIKHODKO P. Emotion recognition from speech with recurrent neural networks［EB/OL］. （2018-07-05）［2021-01-25］..
24	TIAN L M， MOORE J D， CATHERINE L. Emotion recognition in spontaneous and acted dialogues［C］// Proceedings of the 2015 International Conference on Affective Computing and Intelligent Interaction. Piscataway： IEEE， 2015： 698-704. 10.1109/acii.2015.7344645
25	ROZGIĆ V， ANANTHAKRISHNAN S， SALEEM S， et al. Ensemble of SVM trees for multimodal emotion recognition［C］// Proceedings of the 2012 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. Piscataway： IEEE， 2012： 1-4.

[1]	李云, 王富铕, 井佩光, 王粟, 肖澳. 基于不确定度感知的帧关联短视频事件检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2903-2910.
[2]	秦璟, 秦志光, 李发礼, 彭悦恒. 基于概率稀疏自注意力神经网络的重性抑郁疾患诊断[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2970-2974.
[3]	陈虹, 齐兵, 金海波, 武聪, 张立昂. 融合1D-CNN与BiGRU的类不平衡流量异常检测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2493-2499.
[4]	赵宇博, 张丽萍, 闫盛, 侯敏, 高茂. 基于改进分段卷积神经网络和知识蒸馏的学科知识实体间关系抽取[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2421-2429.
[5]	张春雪, 仇丽青, 孙承爱, 荆彩霞. 基于两阶段动态兴趣识别的购买行为预测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2365-2371.
[6]	高鹏淇, 黄鹤鸣, 樊永红. 融合坐标与多头注意力机制的交互语音情感识别[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2400-2406.
[7]	汪才钦, 周渝皓, 张顺香, 王琰慧, 王小龙. 基于语境增强的新能源汽车投诉文本方面-观点对抽取[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2430-2436.
[8]	王东炜, 刘柏辰, 韩志, 王艳美, 唐延东. 基于低秩分解和向量量化的深度网络压缩方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 1987-1994.
[9]	高阳峄, 雷涛, 杜晓刚, 李岁永, 王营博, 闵重丹. 基于像素距离图和四维动态卷积网络的密集人群计数与定位方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2233-2242.
[10]	晁浩, 封舒琪, 刘永利. 脑电情感识别中多上下文向量优化的卷积递归神经网络[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2041-2046.
[11]	黄梦源, 常侃, 凌铭阳, 韦新杰, 覃团发. 基于层间引导的低光照图像渐进增强算法[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1911-1919.
[12]	李健京, 李贯峰, 秦飞舟, 李卫军. 基于不确定知识图谱嵌入的多关系近似推理模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1751-1759.
[13]	姚迅, 秦忠正, 杨捷. 生成式标签对抗的文本分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1781-1785.
[14]	沈君凤, 周星辰, 汤灿. 基于改进的提示学习方法的双通道情感分析模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1796-1806.
[15]	周菊香, 刘金生, 甘健侯, 吴迪, 李子杰. 基于多尺度时序感知网络的课堂语音情感识别方法[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1636-1643.

基于多头注意力机制的端到端语音情感识别

End-to-end speech emotion recognition based on multi-head attention

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 12

参考文献 25

相关文章 15

编辑推荐

Metrics