Journal of Computer Applications ›› 2022, Vol. 42 ›› Issue (6): 1869-1875.DOI: 10.11772/j.issn.1001-9081.2021040578
Special Issue: 人工智能
• Artificial intelligence • Previous Articles Next Articles
Lei YANG1, Hongdong ZHAO1(), Kuaikuai YU2
Received:
2021-04-14
Revised:
2021-07-19
Accepted:
2021-07-23
Online:
2022-06-22
Published:
2022-06-10
Contact:
Hongdong ZHAO
About author:
YANG Lei,born in 1978,Ph. D. candidate. His research interests include intelligent information processingSupported by:
通讯作者:
赵红东
作者简介:
杨磊(1978—),男,吉林敦化人,博士研究生,CCF会员,主要研究方向:智能信息处理基金资助:
CLC Number:
Lei YANG, Hongdong ZHAO, Kuaikuai YU. End-to-end speech emotion recognition based on multi-head attention[J]. Journal of Computer Applications, 2022, 42(6): 1869-1875.
杨磊, 赵红东, 于快快. 基于多头注意力机制的端到端语音情感识别[J]. 《计算机应用》唯一官方网站, 2022, 42(6): 1869-1875.
Add to citation manager EndNote|Ris|BibTeX
URL: https://www.joca.cn/EN/10.11772/j.issn.1001-9081.2021040578
序号 | 情绪类别 | 数量 |
---|---|---|
合计 | 5 531 | |
0 | 生气 | 1 103 |
1 | 悲伤 | 1 084 |
2 | 高兴 | 1 636 |
3 | 中性 | 1 708 |
Tab. 1 IEMOCAP dataset description of different categories
序号 | 情绪类别 | 数量 |
---|---|---|
合计 | 5 531 | |
0 | 生气 | 1 103 |
1 | 悲伤 | 1 084 |
2 | 高兴 | 1 636 |
3 | 中性 | 1 708 |
层类型 | 输出尺寸 | 核大小/步长 |
---|---|---|
输入层 | 277,160,1 | |
卷积层1 | 277,128,32 | 1 × 33 / 1 |
卷积层2 | 277,128,1 | 1 × 1 / 1 |
Transformer层 | 277,128 | |
全局平均池化层 | 128 | |
全连接层 | 64 | |
Dropout | 64 | |
softmax层 | 4 |
Tab. 2 Parameters of MHA
层类型 | 输出尺寸 | 核大小/步长 |
---|---|---|
输入层 | 277,160,1 | |
卷积层1 | 277,128,32 | 1 × 33 / 1 |
卷积层2 | 277,128,1 | 1 × 1 / 1 |
Transformer层 | 277,128 | |
全局平均池化层 | 128 | |
全连接层 | 64 | |
Dropout | 64 | |
softmax层 | 4 |
头数 | 层数 | MHA | MHA-SVM | MHA-LR | MHA-KNN |
---|---|---|---|---|---|
2 | 1 | 65.1 | 67.3 | 65.8 | 66.0 |
2 | 63.7 | 65.2 | 63.6 | 63.0 | |
4 | 1 | 67.5 | 68.8 | 67.0 | 67.1 |
2 | 66.1 | 67.3 | 65.8 | 64.5 | |
8 | 1 | 67.7 | 69.6 | 66.9 | 67.3 |
2 | 67.1 | 69.0 | 67.2 | 66.9 |
Tab.3 Comparison of recognition accuracy of models under different numbers of heads and layers
头数 | 层数 | MHA | MHA-SVM | MHA-LR | MHA-KNN |
---|---|---|---|---|---|
2 | 1 | 65.1 | 67.3 | 65.8 | 66.0 |
2 | 63.7 | 65.2 | 63.6 | 63.0 | |
4 | 1 | 67.5 | 68.8 | 67.0 | 67.1 |
2 | 66.1 | 67.3 | 65.8 | 64.5 | |
8 | 1 | 67.7 | 69.6 | 66.9 | 67.3 |
2 | 67.1 | 69.0 | 67.2 | 66.9 |
情绪类别 | 精准率 | 召回率 | F1分数 |
---|---|---|---|
生气 | 77.3 | 68.2 | 72.4 |
悲伤 | 67.2 | 84.5 | 74.9 |
高兴 | 71.3 | 69.6 | 70.4 |
中性 | 72.4 | 68.2 | 70.2 |
Tab. 4 Performance comparison of MHA-SVM for 4 emotional categories on IEMOCAP dataset
情绪类别 | 精准率 | 召回率 | F1分数 |
---|---|---|---|
生气 | 77.3 | 68.2 | 72.4 |
悲伤 | 67.2 | 84.5 | 74.9 |
高兴 | 71.3 | 69.6 | 70.4 |
中性 | 72.4 | 68.2 | 70.2 |
模型 | 输入特征 | 准确率/% |
---|---|---|
SVM [ | LLDs | 57.5 |
SVM tree [ | 原始声音序列 | 60.9 |
CNN-BLSTM [ | 原始声音序列 | 61.0 |
ACRNN [ | 梅尔谱图 | 64.7 |
A-BLSTM [ | 梅尔谱图 | 66.5 |
MHA | 原始声音序列 | 67.7 |
MHA-SVM | 原始声音序列 | 69.6 |
Tab. 5 Accuracy comparison of 7 models on IEMOCAP dataset
模型 | 输入特征 | 准确率/% |
---|---|---|
SVM [ | LLDs | 57.5 |
SVM tree [ | 原始声音序列 | 60.9 |
CNN-BLSTM [ | 原始声音序列 | 61.0 |
ACRNN [ | 梅尔谱图 | 64.7 |
A-BLSTM [ | 梅尔谱图 | 66.5 |
MHA | 原始声音序列 | 67.7 |
MHA-SVM | 原始声音序列 | 69.6 |
1 | SALAMON J, BELLO J P. Deep convolutional neural networks and data augmentation for environmental sound classification[J]. IEEE Signal Processing Letters, 2017, 24(3): 279-283. 10.1109/lsp.2017.2657381 |
2 | LIM W, JANG D, LEE T. Speech emotion recognition using convolutional and recurrent neural networks[C]// Proceedings of the 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. Piscataway: IEEE, 2016: 1-4. 10.1109/apsipa.2016.7820699 |
3 | HINTON G, DENG L, YU D, et al. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups[J]. IEEE Signal Processing Magazine, 2012, 29(6): 82-97. 10.1109/msp.2012.2205597 |
4 | SCHEDL M, GÓMEZ E, URBANO J. Music information retrieval: recent developments and applications[J]. Foundations and Trends in Information Retrieval, 2014, 8(2/3): 127-261. 10.1561/1500000042 |
5 | MAO Q R, DONG M, HUANG Z W, et al. Learning salient features for speech emotion recognition using convolutional neural networks[J]. IEEE Transactions on Multimedia, 2014, 16(8): 2203-2213. 10.1109/tmm.2014.2360798 |
6 | ISSA D, DEMIRCI M F, YAZICI A. Speech emotion recognition with deep convolutional neural networks[J]. Biomedical Signal Processing Control, 2020, 59: No.101894. 10.1016/j.bspc.2020.101894 |
7 | XIE Y, LIANG R Y, LIANG Z L, et al. Speech emotion classification using attention-based LSTM[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019, 27(11): 1675-1685. 10.1109/taslp.2019.2925934 |
8 | 吕惠炼,胡维平. 基于端到端深度神经网络的语言情感识别研究[J].广西师范大学学报(自然科学版), 2021, 39(3): 20-26. |
LYU H L, HU W P. Research on speech emotion recognition based on end-to-end deep neural network[J]. Journal of Guangxi Normal University (Natural Science Edition), 2021, 39(3): 20-26. | |
9 | LATIF S, RANA R, KHALIFA S, et al. Direct modelling of speech emotion from raw speech[EB/OL]. (2020-07-28) [2021-01-25].. 10.21437/interspeech.2019-3252 |
10 | CHEN M Y, HE X J, YANG J, et al. 3-D convolutional recurrent neural networks with attention model for speech emotion recognition[J]. IEEE Signal Processing Letters, 2018, 25(10): 1440-1444. 10.1109/lsp.2018.2860246 |
11 | ZHAO Z P, BAO Z T, ZHAO Y Q, et al. Exploring deep spectrum representations via attention-based recurrent and convolutional neural networks for speech emotion recognition[J]. IEEE Access, 2019, 7: 97515-97525. 10.1109/ACCESS.2019.2928625 |
12 | VASWANI A, SHAZEER N, PARMAR J, et al. Attention is all you need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY: Curran Associates Inc., 2017: 6000-6010. 10.1016/s0262-4079(17)32358-8 |
13 | RADFORD A, NARASIMHAN K, SALIMANS T, et al. Improving language understanding by generative pre-training [EB/OL]. [2021-01-25].. |
14 | DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Stroudsburg, PA: Association for Computational Linguistics, 2019: 4171-4186. 10.18653/v1/n19-1423 |
15 | KARITA S, SOPLIN N E Y, WATANABE M, et al. Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration[C]// Proceedings of the 20th Annual Conference of the International Speech Communication Association. [S.l.]: ISCA, 2019: 1408-1412. 10.21437/interspeech.2019-1938 |
16 | 陈闯, RYAD C,邢尹,等. 改进GWO优化SVM的语音情感识别研究[J]. 计算机工程与应用, 2018, 54(16): 113-118. 10.3778/j.issn.1002-8331.1704-0361 |
CHEN C, RYAD C, XING Y, et al. Research on speech emotion recognition based on improved GWO optimized SVM[J]. Computer Engineering and Applications, 2018, 54(16): 113-118. 10.3778/j.issn.1002-8331.1704-0361 | |
17 | 余华,颜丙聪. 基于CTC-RNN的语音情感识别方法[J]. 电子器件, 2020, 43(4): 934-937. 10.3969/j.issn.1005-9490.2020.04.043 |
YU H, YAN B C. Speech emotion recognition based on CTC-RNN[J]. Chinese Journal of Electron Devices, 2020, 43(4): 934-937. 10.3969/j.issn.1005-9490.2020.04.043 | |
18 | YOON S, BYUN S, JUNG K. Multimodal speech emotion recognition using audio and text[C]// Proceedings of the 2018 IEEE Spoken Language Technology Workshop. Piscataway: IEEE, 2018: 112-118. 10.1109/slt.2018.8639583 |
19 | CHO J, PAPPAGARI R, KULKARNI P, et al. Deep neural networks for emotion recognition combining audio and transcripts[C]// Proceedings of the 19th Annual Conference of the International Speech Communication Association. [S.l.]: ISCA, 2018: 247-251. 10.21437/interspeech.2018-2466 |
20 | ALDENEH Z, PROVOST E M. Using regional saliency for speech emotion recognition[C]// Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2017: 2741-2745. 10.1109/icassp.2017.7952655 |
21 | WAN M T, McAULEY J. Item recommendation on monotonic behavior chains[C]// Proceedings of the 12th ACM Conference on Recommender Systems. New York: ACM, 2018: 86-94. 10.1145/3240323.3240369 |
22 | XIA Q L, JIANG P, SUN F, et al. Modeling consumer buying decision for recommendation based on multi-task deep learning[C]// Proceedings of the 27th ACM International Conference on Information and Knowledge Management. New York: ACM, 2018: 1703-1706. 10.1145/3269206.3269285 |
23 | CHERNYKH V, PRIKHODKO P. Emotion recognition from speech with recurrent neural networks[EB/OL]. (2018-07-05) [2021-01-25].. |
24 | TIAN L M, MOORE J D, CATHERINE L. Emotion recognition in spontaneous and acted dialogues[C]// Proceedings of the 2015 International Conference on Affective Computing and Intelligent Interaction. Piscataway: IEEE, 2015: 698-704. 10.1109/acii.2015.7344645 |
25 | ROZGIĆ V, ANANTHAKRISHNAN S, SALEEM S, et al. Ensemble of SVM trees for multimodal emotion recognition[C]// Proceedings of the 2012 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. Piscataway: IEEE, 2012: 1-4. |
[1] | Yun LI, Fuyou WANG, Peiguang JING, Su WANG, Ao XIAO. Uncertainty-based frame associated short video event detection method [J]. Journal of Computer Applications, 2024, 44(9): 2903-2910. |
[2] | Hong CHEN, Bing QI, Haibo JIN, Cong WU, Li’ang ZHANG. Class-imbalanced traffic abnormal detection based on 1D-CNN and BiGRU [J]. Journal of Computer Applications, 2024, 44(8): 2493-2499. |
[3] | Pengqi GAO, Heming HUANG, Yonghong FAN. Fusion of coordinate and multi-head attention mechanisms for interactive speech emotion recognition [J]. Journal of Computer Applications, 2024, 44(8): 2400-2406. |
[4] | Dongwei WANG, Baichen LIU, Zhi HAN, Yanmei WANG, Yandong TANG. Deep network compression method based on low-rank decomposition and vector quantization [J]. Journal of Computer Applications, 2024, 44(7): 1987-1994. |
[5] | Yangyi GAO, Tao LEI, Xiaogang DU, Suiyong LI, Yingbo WANG, Chongdan MIN. Crowd counting and locating method based on pixel distance map and four-dimensional dynamic convolutional network [J]. Journal of Computer Applications, 2024, 44(7): 2233-2242. |
[6] | Mengyuan HUANG, Kan CHANG, Mingyang LING, Xinjie WEI, Tuanfa QIN. Progressive enhancement algorithm for low-light images based on layer guidance [J]. Journal of Computer Applications, 2024, 44(6): 1911-1919. |
[7] | Jianjing LI, Guanfeng LI, Feizhou QIN, Weijun LI. Multi-relation approximate reasoning model based on uncertain knowledge graph embedding [J]. Journal of Computer Applications, 2024, 44(6): 1751-1759. |
[8] | Wenshuo GAO, Xiaoyun CHEN. Point cloud classification network based on node structure [J]. Journal of Computer Applications, 2024, 44(5): 1471-1478. |
[9] | Min SUN, Qian CHENG, Xining DING. CBAM-CGRU-SVM based malware detection method for Android [J]. Journal of Computer Applications, 2024, 44(5): 1539-1545. |
[10] | Tianhua CHEN, Jiaxuan ZHU, Jie YIN. Bird recognition algorithm based on attention mechanism [J]. Journal of Computer Applications, 2024, 44(4): 1114-1120. |
[11] | Lijun XU, Hui LI, Zuyang LIU, Kansong CHEN, Weixuan MA. 3D-GA-Unet: MRI image segmentation algorithm for glioma based on 3D-Ghost CNN [J]. Journal of Computer Applications, 2024, 44(4): 1294-1302. |
[12] | Jie WANG, Hua MENG. Image classification algorithm based on overall topological structure of point cloud [J]. Journal of Computer Applications, 2024, 44(4): 1107-1113. |
[13] | Yongfeng DONG, Jiaming BAI, Liqin WANG, Xu WANG. Chinese named entity recognition combining prior knowledge and glyph features [J]. Journal of Computer Applications, 2024, 44(3): 702-708. |
[14] | Jingxian ZHOU, Xina LI. UAV detection and recognition based on improved convolutional neural network and radio frequency fingerprint [J]. Journal of Computer Applications, 2024, 44(3): 876-882. |
[15] | Ruifeng HOU, Pengcheng ZHANG, Liyuan ZHANG, Zhiguo GUI, Yi LIU, Haowen ZHANG, Shubin WANG. Iterative denoising network based on total variation regular term expansion [J]. Journal of Computer Applications, 2024, 44(3): 916-921. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||