End-to-end speech emotion recognition based on multi-head attention

doi:10.11772/j.issn.1001-9081.2021040578

Journal of Computer Applications ›› 2022, Vol. 42 ›› Issue (6): 1869-1875.DOI: 10.11772/j.issn.1001-9081.2021040578

Special Issue: 人工智能

• Artificial intelligence • Previous Articles Next Articles

End-to-end speech emotion recognition based on multi-head attention

Lei YANG¹, Hongdong ZHAO¹(), Kuaikuai YU²

^1.School of Electronics and Information Engineering，Hebei University of Technology，Tianjin 300401，China
^2.Science and Technology on Electro-Optical Information Security Control Laboratory，Tianjin 300308，China

Received:2021-04-14 Revised:2021-07-19 Accepted:2021-07-23 Online:2022-06-22 Published:2022-06-10
Contact: Hongdong ZHAO
About author:YANG Lei，born in 1978，Ph. D. candidate. His research interests include intelligent information processing
YU Kuaikuai，born in 1988，M. S.，engineer. His research interests include electronic information.
Supported by:
Fund of Science and Technology on Electro-Optical Information Security Control Laboratory(614210701041705)

基于多头注意力机制的端到端语音情感识别

杨磊¹, 赵红东¹(), 于快快²

^1.河北工业大学电子信息工程学院，天津 300401
^2.光电信息控制和安全技术重点实验室，天津 300308

通讯作者: 赵红东
作者简介:杨磊（1978—），男，吉林敦化人，博士研究生，CCF会员，主要研究方向：智能信息处理
于快快（1988—），男，天津人，工程师，硕士，主要研究方向：电子信息。
基金资助:
光电信息控制和安全技术重点实验室基金资助项目(614210701041705)

Abstract

Abstract:

Aiming at the characteristics of small size and high data dimensionality of speech emotion datasets， to solve the problem of long-range dependence disappearance in traditional Recurrent Neural Network （RNN） and insufficient excavation of potential relationship between frames within the input sequence because of focus on local information of Convolutional Neural Network （CNN）， a new neural network MAH-SVM based on Multi-Head Attention （MHA） and Support Vector Machine （SVM） was proposed for Speech Emotion Recognition （SER）. First， the original audio data were input into the MHA network to train the parameters of MHA and obtain the classification results of MHA. Then， the same original audio data were input into the pre-trained MHA again for feature extraction. Finally， these obtained features were fed into SVM after the fully connected layer to obtain classification results of MHA-SVM. After fully evaluating the effect of the heads and layers in the MHA module on the experimental results， it was found that MHA-SVM achieved the highest recognition accuracy of 69.6% on IEMOCAP dataset. Experimental results indicate that the end-to-end model based on MHA mechanism is more suitable for SER tasks compared with models based on RNN and CNN.

Key words: Speech Emotion Recognition (SER), Multi-Head Attention (MHA), Convolutional Neural Network (CNN), Support Vector Machine (SVM), end-to-end

摘要：

针对语音情感数据集规模小且数据维度高的特点，为解决传统循环神经网络（RNN）长程依赖消失和卷积神经网络（CNN）关注局部信息导致输入序列内部各帧之间潜在关系没有被充分挖掘的问题，提出一个基于多头注意力（MHA）和支持向量机（SVM）的神经网络MHA-SVM用于语音情感识别（SER）。首先将原始音频数据输入MHA网络来训练MHA的参数并得到MHA的分类结果；然后将原始音频数据再次输入到预训练好的MHA中用于提取特征；最后通过全连接层后使用SVM对得到的特征进行分类获得MHA-SVM的分类结果。充分评估MHA模块中头数和层数对实验结果的影响后，发现MHA-SVM在IEMOCAP数据集上的识别准确率最高达到69.6%。实验结果表明同基于RNN和CNN的模型相比，基于MHA机制的端到端模型更适合处理SER任务。

关键词: 语音情感识别, 多头注意力, 卷积神经网络, 支持向量机, 端到端

CLC Number:

TP183

Lei YANG, Hongdong ZHAO, Kuaikuai YU. End-to-end speech emotion recognition based on multi-head attention[J]. Journal of Computer Applications, 2022, 42(6): 1869-1875.

杨磊, 赵红东, 于快快. 基于多头注意力机制的端到端语音情感识别[J]. 《计算机应用》唯一官方网站, 2022, 42(6): 1869-1875.

Figures/Tables 12

References 25

1	SALAMON J， BELLO J P. Deep convolutional neural networks and data augmentation for environmental sound classification［J］. IEEE Signal Processing Letters， 2017， 24（3）： 279-283. 10.1109/lsp.2017.2657381
2	LIM W， JANG D， LEE T. Speech emotion recognition using convolutional and recurrent neural networks［C］// Proceedings of the 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. Piscataway： IEEE， 2016： 1-4. 10.1109/apsipa.2016.7820699
3	HINTON G， DENG L， YU D， et al. Deep neural networks for acoustic modeling in speech recognition： the shared views of four research groups［J］. IEEE Signal Processing Magazine， 2012， 29（6）： 82-97. 10.1109/msp.2012.2205597
4	SCHEDL M， GÓMEZ E， URBANO J. Music information retrieval： recent developments and applications［J］. Foundations and Trends in Information Retrieval， 2014， 8（2/3）： 127-261. 10.1561/1500000042
5	MAO Q R， DONG M， HUANG Z W， et al. Learning salient features for speech emotion recognition using convolutional neural networks［J］. IEEE Transactions on Multimedia， 2014， 16（8）： 2203-2213. 10.1109/tmm.2014.2360798
6	ISSA D， DEMIRCI M F， YAZICI A. Speech emotion recognition with deep convolutional neural networks［J］. Biomedical Signal Processing Control， 2020， 59： No.101894. 10.1016/j.bspc.2020.101894
7	XIE Y， LIANG R Y， LIANG Z L， et al. Speech emotion classification using attention-based LSTM［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2019， 27（11）： 1675-1685. 10.1109/taslp.2019.2925934
8	吕惠炼，胡维平. 基于端到端深度神经网络的语言情感识别研究［J］.广西师范大学学报（自然科学版）， 2021， 39（3）： 20-26.
	LYU H L， HU W P. Research on speech emotion recognition based on end-to-end deep neural network［J］. Journal of Guangxi Normal University （Natural Science Edition）， 2021， 39（3）： 20-26.
9	LATIF S， RANA R， KHALIFA S， et al. Direct modelling of speech emotion from raw speech［EB/OL］. （2020-07-28）［2021-01-25］.. 10.21437/interspeech.2019-3252
10	CHEN M Y， HE X J， YANG J， et al. 3-D convolutional recurrent neural networks with attention model for speech emotion recognition［J］. IEEE Signal Processing Letters， 2018， 25（10）： 1440-1444. 10.1109/lsp.2018.2860246
11	ZHAO Z P， BAO Z T， ZHAO Y Q， et al. Exploring deep spectrum representations via attention-based recurrent and convolutional neural networks for speech emotion recognition［J］. IEEE Access， 2019， 7： 97515-97525. 10.1109/ACCESS.2019.2928625
12	VASWANI A， SHAZEER N， PARMAR J， et al. Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2017： 6000-6010. 10.1016/s0262-4079(17)32358-8
13	RADFORD A， NARASIMHAN K， SALIMANS T， et al. Improving language understanding by generative pre-training ［EB/OL］. ［2021-01-25］..
14	DEVLIN J， CHANG M W， LEE K， et al. BERT： pre-training of deep bidirectional transformers for language understanding［C］// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 1 （Long and Short Papers）. Stroudsburg， PA： Association for Computational Linguistics， 2019： 4171-4186. 10.18653/v1/n19-1423
15	KARITA S， SOPLIN N E Y， WATANABE M， et al. Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration［C］// Proceedings of the 20th Annual Conference of the International Speech Communication Association. ［S.l.］： ISCA， 2019： 1408-1412. 10.21437/interspeech.2019-1938
16	陈闯， RYAD C，邢尹，等. 改进GWO优化SVM的语音情感识别研究［J］. 计算机工程与应用， 2018， 54（16）： 113-118. 10.3778/j.issn.1002-8331.1704-0361
	CHEN C， RYAD C， XING Y， et al. Research on speech emotion recognition based on improved GWO optimized SVM［J］. Computer Engineering and Applications， 2018， 54（16）： 113-118. 10.3778/j.issn.1002-8331.1704-0361
17	余华，颜丙聪. 基于CTC-RNN的语音情感识别方法［J］. 电子器件， 2020， 43（4）： 934-937. 10.3969/j.issn.1005-9490.2020.04.043
	YU H， YAN B C. Speech emotion recognition based on CTC-RNN［J］. Chinese Journal of Electron Devices， 2020， 43（4）： 934-937. 10.3969/j.issn.1005-9490.2020.04.043
18	YOON S， BYUN S， JUNG K. Multimodal speech emotion recognition using audio and text［C］// Proceedings of the 2018 IEEE Spoken Language Technology Workshop. Piscataway： IEEE， 2018： 112-118. 10.1109/slt.2018.8639583
19	CHO J， PAPPAGARI R， KULKARNI P， et al. Deep neural networks for emotion recognition combining audio and transcripts［C］// Proceedings of the 19th Annual Conference of the International Speech Communication Association. ［S.l.］： ISCA， 2018： 247-251. 10.21437/interspeech.2018-2466
20	ALDENEH Z， PROVOST E M. Using regional saliency for speech emotion recognition［C］// Proceedings of the 2017 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2017： 2741-2745. 10.1109/icassp.2017.7952655
21	WAN M T， McAULEY J. Item recommendation on monotonic behavior chains［C］// Proceedings of the 12th ACM Conference on Recommender Systems. New York： ACM， 2018： 86-94. 10.1145/3240323.3240369
22	XIA Q L， JIANG P， SUN F， et al. Modeling consumer buying decision for recommendation based on multi-task deep learning［C］// Proceedings of the 27th ACM International Conference on Information and Knowledge Management. New York： ACM， 2018： 1703-1706. 10.1145/3269206.3269285
23	CHERNYKH V， PRIKHODKO P. Emotion recognition from speech with recurrent neural networks［EB/OL］. （2018-07-05）［2021-01-25］..
24	TIAN L M， MOORE J D， CATHERINE L. Emotion recognition in spontaneous and acted dialogues［C］// Proceedings of the 2015 International Conference on Affective Computing and Intelligent Interaction. Piscataway： IEEE， 2015： 698-704. 10.1109/acii.2015.7344645
25	ROZGIĆ V， ANANTHAKRISHNAN S， SALEEM S， et al. Ensemble of SVM trees for multimodal emotion recognition［C］// Proceedings of the 2012 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. Piscataway： IEEE， 2012： 1-4.

序号	情绪类别	数量
合计		5 531
0	生气	1 103
1	悲伤	1 084
2	高兴	1 636
3	中性	1 708

序号	情绪类别	数量
合计		5 531
0	生气	1 103
1	悲伤	1 084
2	高兴	1 636
3	中性	1 708

层类型	输出尺寸	核大小/步长
输入层	277，160，1
卷积层1	277，128，32	1 × 33 / 1
卷积层2	277，128，1	1 × 1 / 1
Transformer层	277，128
全局平均池化层	128
全连接层	64
Dropout	64
softmax层	4

层类型	输出尺寸	核大小/步长
输入层	277，160，1
卷积层1	277，128，32	1 × 33 / 1
卷积层2	277，128，1	1 × 1 / 1
Transformer层	277，128
全局平均池化层	128
全连接层	64
Dropout	64
softmax层	4

头数	层数	MHA	MHA-SVM	MHA-LR	MHA-KNN
2	1	65.1	67.3	65.8	66.0
2	2	63.7	65.2	63.6	63.0
4	1	67.5	68.8	67.0	67.1
4	2	66.1	67.3	65.8	64.5
8	1	67.7	69.6	66.9	67.3
8	2	67.1	69.0	67.2	66.9

End-to-end speech emotion recognition based on multi-head attention

基于多头注意力机制的端到端语音情感识别

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 12

References 25

Related Articles 15

Recommended Articles

Metrics

情绪类别	精准率	召回率	F1分数
生气	77.3	68.2	72.4
悲伤	67.2	84.5	74.9
高兴	71.3	69.6	70.4
中性	72.4	68.2	70.2

模型	输入特征	准确率/%
SVM ^［24］	LLDs	57.5
SVM tree ^［25］	原始声音序列	60.9
CNN-BLSTM ^［8］	原始声音序列	61.0
ACRNN ^［10］	梅尔谱图	64.7
A-BLSTM ^［11］	梅尔谱图	66.5
MHA	原始声音序列	67.7
MHA-SVM	原始声音序列	69.6

[1]	Yun LI, Fuyou WANG, Peiguang JING, Su WANG, Ao XIAO. Uncertainty-based frame associated short video event detection method [J]. Journal of Computer Applications, 2024, 44(9): 2903-2910.
[2]	Hong CHEN, Bing QI, Haibo JIN, Cong WU, Li’ang ZHANG. Class-imbalanced traffic abnormal detection based on 1D-CNN and BiGRU [J]. Journal of Computer Applications, 2024, 44(8): 2493-2499.
[3]	Pengqi GAO, Heming HUANG, Yonghong FAN. Fusion of coordinate and multi-head attention mechanisms for interactive speech emotion recognition [J]. Journal of Computer Applications, 2024, 44(8): 2400-2406.
[4]	Dongwei WANG, Baichen LIU, Zhi HAN, Yanmei WANG, Yandong TANG. Deep network compression method based on low-rank decomposition and vector quantization [J]. Journal of Computer Applications, 2024, 44(7): 1987-1994.
[5]	Yangyi GAO, Tao LEI, Xiaogang DU, Suiyong LI, Yingbo WANG, Chongdan MIN. Crowd counting and locating method based on pixel distance map and four-dimensional dynamic convolutional network [J]. Journal of Computer Applications, 2024, 44(7): 2233-2242.
[6]	Mengyuan HUANG, Kan CHANG, Mingyang LING, Xinjie WEI, Tuanfa QIN. Progressive enhancement algorithm for low-light images based on layer guidance [J]. Journal of Computer Applications, 2024, 44(6): 1911-1919.
[7]	Jianjing LI, Guanfeng LI, Feizhou QIN, Weijun LI. Multi-relation approximate reasoning model based on uncertain knowledge graph embedding [J]. Journal of Computer Applications, 2024, 44(6): 1751-1759.
[8]	Wenshuo GAO, Xiaoyun CHEN. Point cloud classification network based on node structure [J]. Journal of Computer Applications, 2024, 44(5): 1471-1478.
[9]	Min SUN, Qian CHENG, Xining DING. CBAM-CGRU-SVM based malware detection method for Android [J]. Journal of Computer Applications, 2024, 44(5): 1539-1545.
[10]	Tianhua CHEN, Jiaxuan ZHU, Jie YIN. Bird recognition algorithm based on attention mechanism [J]. Journal of Computer Applications, 2024, 44(4): 1114-1120.
[11]	Lijun XU, Hui LI, Zuyang LIU, Kansong CHEN, Weixuan MA. 3D-GA-Unet： MRI image segmentation algorithm for glioma based on 3D-Ghost CNN [J]. Journal of Computer Applications, 2024, 44(4): 1294-1302.
[12]	Jie WANG, Hua MENG. Image classification algorithm based on overall topological structure of point cloud [J]. Journal of Computer Applications, 2024, 44(4): 1107-1113.
[13]	Yongfeng DONG, Jiaming BAI, Liqin WANG, Xu WANG. Chinese named entity recognition combining prior knowledge and glyph features [J]. Journal of Computer Applications, 2024, 44(3): 702-708.
[14]	Jingxian ZHOU, Xina LI. UAV detection and recognition based on improved convolutional neural network and radio frequency fingerprint [J]. Journal of Computer Applications, 2024, 44(3): 876-882.
[15]	Ruifeng HOU, Pengcheng ZHANG, Liyuan ZHANG, Zhiguo GUI, Yi LIU, Haowen ZHANG, Shubin WANG. Iterative denoising network based on total variation regular term expansion [J]. Journal of Computer Applications, 2024, 44(3): 916-921.