《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (6): 1869-1875.DOI: 10.11772/j.issn.1001-9081.2021040578
所属专题: 人工智能
收稿日期:
2021-04-14
修回日期:
2021-07-19
接受日期:
2021-07-23
发布日期:
2022-06-22
出版日期:
2022-06-10
通讯作者:
赵红东
作者简介:
杨磊(1978—),男,吉林敦化人,博士研究生,CCF会员,主要研究方向:智能信息处理基金资助:
Lei YANG1, Hongdong ZHAO1(), Kuaikuai YU2
Received:
2021-04-14
Revised:
2021-07-19
Accepted:
2021-07-23
Online:
2022-06-22
Published:
2022-06-10
Contact:
Hongdong ZHAO
About author:
YANG Lei,born in 1978,Ph. D. candidate. His research interests include intelligent information processingSupported by:
摘要:
针对语音情感数据集规模小且数据维度高的特点,为解决传统循环神经网络(RNN)长程依赖消失和卷积神经网络(CNN)关注局部信息导致输入序列内部各帧之间潜在关系没有被充分挖掘的问题,提出一个基于多头注意力(MHA)和支持向量机(SVM)的神经网络MHA-SVM用于语音情感识别(SER)。首先将原始音频数据输入MHA网络来训练MHA的参数并得到MHA的分类结果;然后将原始音频数据再次输入到预训练好的MHA中用于提取特征;最后通过全连接层后使用SVM对得到的特征进行分类获得MHA-SVM的分类结果。充分评估MHA模块中头数和层数对实验结果的影响后,发现MHA-SVM在IEMOCAP数据集上的识别准确率最高达到69.6%。实验结果表明同基于RNN和CNN的模型相比,基于MHA机制的端到端模型更适合处理SER任务。
中图分类号:
杨磊, 赵红东, 于快快. 基于多头注意力机制的端到端语音情感识别[J]. 计算机应用, 2022, 42(6): 1869-1875.
Lei YANG, Hongdong ZHAO, Kuaikuai YU. End-to-end speech emotion recognition based on multi-head attention[J]. Journal of Computer Applications, 2022, 42(6): 1869-1875.
序号 | 情绪类别 | 数量 |
---|---|---|
合计 | 5 531 | |
0 | 生气 | 1 103 |
1 | 悲伤 | 1 084 |
2 | 高兴 | 1 636 |
3 | 中性 | 1 708 |
表1 IEMOCAP数据集类别情况
Tab. 1 IEMOCAP dataset description of different categories
序号 | 情绪类别 | 数量 |
---|---|---|
合计 | 5 531 | |
0 | 生气 | 1 103 |
1 | 悲伤 | 1 084 |
2 | 高兴 | 1 636 |
3 | 中性 | 1 708 |
层类型 | 输出尺寸 | 核大小/步长 |
---|---|---|
输入层 | 277,160,1 | |
卷积层1 | 277,128,32 | 1 × 33 / 1 |
卷积层2 | 277,128,1 | 1 × 1 / 1 |
Transformer层 | 277,128 | |
全局平均池化层 | 128 | |
全连接层 | 64 | |
Dropout | 64 | |
softmax层 | 4 |
表2 MHA参数
Tab. 2 Parameters of MHA
层类型 | 输出尺寸 | 核大小/步长 |
---|---|---|
输入层 | 277,160,1 | |
卷积层1 | 277,128,32 | 1 × 33 / 1 |
卷积层2 | 277,128,1 | 1 × 1 / 1 |
Transformer层 | 277,128 | |
全局平均池化层 | 128 | |
全连接层 | 64 | |
Dropout | 64 | |
softmax层 | 4 |
头数 | 层数 | MHA | MHA-SVM | MHA-LR | MHA-KNN |
---|---|---|---|---|---|
2 | 1 | 65.1 | 67.3 | 65.8 | 66.0 |
2 | 63.7 | 65.2 | 63.6 | 63.0 | |
4 | 1 | 67.5 | 68.8 | 67.0 | 67.1 |
2 | 66.1 | 67.3 | 65.8 | 64.5 | |
8 | 1 | 67.7 | 69.6 | 66.9 | 67.3 |
2 | 67.1 | 69.0 | 67.2 | 66.9 |
表3 不同头数和层数下模型的识别准确率比较 ( %)
Tab.3 Comparison of recognition accuracy of models under different numbers of heads and layers
头数 | 层数 | MHA | MHA-SVM | MHA-LR | MHA-KNN |
---|---|---|---|---|---|
2 | 1 | 65.1 | 67.3 | 65.8 | 66.0 |
2 | 63.7 | 65.2 | 63.6 | 63.0 | |
4 | 1 | 67.5 | 68.8 | 67.0 | 67.1 |
2 | 66.1 | 67.3 | 65.8 | 64.5 | |
8 | 1 | 67.7 | 69.6 | 66.9 | 67.3 |
2 | 67.1 | 69.0 | 67.2 | 66.9 |
情绪类别 | 精准率 | 召回率 | F1分数 |
---|---|---|---|
生气 | 77.3 | 68.2 | 72.4 |
悲伤 | 67.2 | 84.5 | 74.9 |
高兴 | 71.3 | 69.6 | 70.4 |
中性 | 72.4 | 68.2 | 70.2 |
表4 MHA-SVM在IEMOCAP数据集上四个情绪类别的性能比较 ( %)
Tab. 4 Performance comparison of MHA-SVM for 4 emotional categories on IEMOCAP dataset
情绪类别 | 精准率 | 召回率 | F1分数 |
---|---|---|---|
生气 | 77.3 | 68.2 | 72.4 |
悲伤 | 67.2 | 84.5 | 74.9 |
高兴 | 71.3 | 69.6 | 70.4 |
中性 | 72.4 | 68.2 | 70.2 |
模型 | 输入特征 | 准确率/% |
---|---|---|
SVM [ | LLDs | 57.5 |
SVM tree [ | 原始声音序列 | 60.9 |
CNN-BLSTM [ | 原始声音序列 | 61.0 |
ACRNN [ | 梅尔谱图 | 64.7 |
A-BLSTM [ | 梅尔谱图 | 66.5 |
MHA | 原始声音序列 | 67.7 |
MHA-SVM | 原始声音序列 | 69.6 |
表5 IEMOCAP数据集上7种模型的准确率对比
Tab. 5 Accuracy comparison of 7 models on IEMOCAP dataset
模型 | 输入特征 | 准确率/% |
---|---|---|
SVM [ | LLDs | 57.5 |
SVM tree [ | 原始声音序列 | 60.9 |
CNN-BLSTM [ | 原始声音序列 | 61.0 |
ACRNN [ | 梅尔谱图 | 64.7 |
A-BLSTM [ | 梅尔谱图 | 66.5 |
MHA | 原始声音序列 | 67.7 |
MHA-SVM | 原始声音序列 | 69.6 |
1 | SALAMON J, BELLO J P. Deep convolutional neural networks and data augmentation for environmental sound classification[J]. IEEE Signal Processing Letters, 2017, 24(3): 279-283. 10.1109/lsp.2017.2657381 |
2 | LIM W, JANG D, LEE T. Speech emotion recognition using convolutional and recurrent neural networks[C]// Proceedings of the 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. Piscataway: IEEE, 2016: 1-4. 10.1109/apsipa.2016.7820699 |
3 | HINTON G, DENG L, YU D, et al. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups[J]. IEEE Signal Processing Magazine, 2012, 29(6): 82-97. 10.1109/msp.2012.2205597 |
4 | SCHEDL M, GÓMEZ E, URBANO J. Music information retrieval: recent developments and applications[J]. Foundations and Trends in Information Retrieval, 2014, 8(2/3): 127-261. 10.1561/1500000042 |
5 | MAO Q R, DONG M, HUANG Z W, et al. Learning salient features for speech emotion recognition using convolutional neural networks[J]. IEEE Transactions on Multimedia, 2014, 16(8): 2203-2213. 10.1109/tmm.2014.2360798 |
6 | ISSA D, DEMIRCI M F, YAZICI A. Speech emotion recognition with deep convolutional neural networks[J]. Biomedical Signal Processing Control, 2020, 59: No.101894. 10.1016/j.bspc.2020.101894 |
7 | XIE Y, LIANG R Y, LIANG Z L, et al. Speech emotion classification using attention-based LSTM[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019, 27(11): 1675-1685. 10.1109/taslp.2019.2925934 |
8 | 吕惠炼,胡维平. 基于端到端深度神经网络的语言情感识别研究[J].广西师范大学学报(自然科学版), 2021, 39(3): 20-26. |
LYU H L, HU W P. Research on speech emotion recognition based on end-to-end deep neural network[J]. Journal of Guangxi Normal University (Natural Science Edition), 2021, 39(3): 20-26. | |
9 | LATIF S, RANA R, KHALIFA S, et al. Direct modelling of speech emotion from raw speech[EB/OL]. (2020-07-28) [2021-01-25].. 10.21437/interspeech.2019-3252 |
10 | CHEN M Y, HE X J, YANG J, et al. 3-D convolutional recurrent neural networks with attention model for speech emotion recognition[J]. IEEE Signal Processing Letters, 2018, 25(10): 1440-1444. 10.1109/lsp.2018.2860246 |
11 | ZHAO Z P, BAO Z T, ZHAO Y Q, et al. Exploring deep spectrum representations via attention-based recurrent and convolutional neural networks for speech emotion recognition[J]. IEEE Access, 2019, 7: 97515-97525. 10.1109/ACCESS.2019.2928625 |
12 | VASWANI A, SHAZEER N, PARMAR J, et al. Attention is all you need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY: Curran Associates Inc., 2017: 6000-6010. 10.1016/s0262-4079(17)32358-8 |
13 | RADFORD A, NARASIMHAN K, SALIMANS T, et al. Improving language understanding by generative pre-training [EB/OL]. [2021-01-25].. |
14 | DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Stroudsburg, PA: Association for Computational Linguistics, 2019: 4171-4186. 10.18653/v1/n19-1423 |
15 | KARITA S, SOPLIN N E Y, WATANABE M, et al. Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration[C]// Proceedings of the 20th Annual Conference of the International Speech Communication Association. [S.l.]: ISCA, 2019: 1408-1412. 10.21437/interspeech.2019-1938 |
16 | 陈闯, RYAD C,邢尹,等. 改进GWO优化SVM的语音情感识别研究[J]. 计算机工程与应用, 2018, 54(16): 113-118. 10.3778/j.issn.1002-8331.1704-0361 |
CHEN C, RYAD C, XING Y, et al. Research on speech emotion recognition based on improved GWO optimized SVM[J]. Computer Engineering and Applications, 2018, 54(16): 113-118. 10.3778/j.issn.1002-8331.1704-0361 | |
17 | 余华,颜丙聪. 基于CTC-RNN的语音情感识别方法[J]. 电子器件, 2020, 43(4): 934-937. 10.3969/j.issn.1005-9490.2020.04.043 |
YU H, YAN B C. Speech emotion recognition based on CTC-RNN[J]. Chinese Journal of Electron Devices, 2020, 43(4): 934-937. 10.3969/j.issn.1005-9490.2020.04.043 | |
18 | YOON S, BYUN S, JUNG K. Multimodal speech emotion recognition using audio and text[C]// Proceedings of the 2018 IEEE Spoken Language Technology Workshop. Piscataway: IEEE, 2018: 112-118. 10.1109/slt.2018.8639583 |
19 | CHO J, PAPPAGARI R, KULKARNI P, et al. Deep neural networks for emotion recognition combining audio and transcripts[C]// Proceedings of the 19th Annual Conference of the International Speech Communication Association. [S.l.]: ISCA, 2018: 247-251. 10.21437/interspeech.2018-2466 |
20 | ALDENEH Z, PROVOST E M. Using regional saliency for speech emotion recognition[C]// Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2017: 2741-2745. 10.1109/icassp.2017.7952655 |
21 | WAN M T, McAULEY J. Item recommendation on monotonic behavior chains[C]// Proceedings of the 12th ACM Conference on Recommender Systems. New York: ACM, 2018: 86-94. 10.1145/3240323.3240369 |
22 | XIA Q L, JIANG P, SUN F, et al. Modeling consumer buying decision for recommendation based on multi-task deep learning[C]// Proceedings of the 27th ACM International Conference on Information and Knowledge Management. New York: ACM, 2018: 1703-1706. 10.1145/3269206.3269285 |
23 | CHERNYKH V, PRIKHODKO P. Emotion recognition from speech with recurrent neural networks[EB/OL]. (2018-07-05) [2021-01-25].. |
24 | TIAN L M, MOORE J D, CATHERINE L. Emotion recognition in spontaneous and acted dialogues[C]// Proceedings of the 2015 International Conference on Affective Computing and Intelligent Interaction. Piscataway: IEEE, 2015: 698-704. 10.1109/acii.2015.7344645 |
25 | ROZGIĆ V, ANANTHAKRISHNAN S, SALEEM S, et al. Ensemble of SVM trees for multimodal emotion recognition[C]// Proceedings of the 2012 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. Piscataway: IEEE, 2012: 1-4. |
[1] | 李云, 王富铕, 井佩光, 王粟, 肖澳. 基于不确定度感知的帧关联短视频事件检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2903-2910. |
[2] | 秦璟, 秦志光, 李发礼, 彭悦恒. 基于概率稀疏自注意力神经网络的重性抑郁疾患诊断[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2970-2974. |
[3] | 陈虹, 齐兵, 金海波, 武聪, 张立昂. 融合1D-CNN与BiGRU的类不平衡流量异常检测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2493-2499. |
[4] | 赵宇博, 张丽萍, 闫盛, 侯敏, 高茂. 基于改进分段卷积神经网络和知识蒸馏的学科知识实体间关系抽取[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2421-2429. |
[5] | 张春雪, 仇丽青, 孙承爱, 荆彩霞. 基于两阶段动态兴趣识别的购买行为预测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2365-2371. |
[6] | 高鹏淇, 黄鹤鸣, 樊永红. 融合坐标与多头注意力机制的交互语音情感识别[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2400-2406. |
[7] | 汪才钦, 周渝皓, 张顺香, 王琰慧, 王小龙. 基于语境增强的新能源汽车投诉文本方面-观点对抽取[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2430-2436. |
[8] | 王东炜, 刘柏辰, 韩志, 王艳美, 唐延东. 基于低秩分解和向量量化的深度网络压缩方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 1987-1994. |
[9] | 高阳峄, 雷涛, 杜晓刚, 李岁永, 王营博, 闵重丹. 基于像素距离图和四维动态卷积网络的密集人群计数与定位方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2233-2242. |
[10] | 晁浩, 封舒琪, 刘永利. 脑电情感识别中多上下文向量优化的卷积递归神经网络[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2041-2046. |
[11] | 黄梦源, 常侃, 凌铭阳, 韦新杰, 覃团发. 基于层间引导的低光照图像渐进增强算法[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1911-1919. |
[12] | 李健京, 李贯峰, 秦飞舟, 李卫军. 基于不确定知识图谱嵌入的多关系近似推理模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1751-1759. |
[13] | 姚迅, 秦忠正, 杨捷. 生成式标签对抗的文本分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1781-1785. |
[14] | 沈君凤, 周星辰, 汤灿. 基于改进的提示学习方法的双通道情感分析模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1796-1806. |
[15] | 周菊香, 刘金生, 甘健侯, 吴迪, 李子杰. 基于多尺度时序感知网络的课堂语音情感识别方法[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1636-1643. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||