《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (8): 2515-2521.DOI: 10.11772/j.issn.1001-9081.2024081142
• 人工智能 • 上一篇
收稿日期:
2024-08-16
修回日期:
2024-11-04
接受日期:
2024-11-12
发布日期:
2024-11-19
出版日期:
2025-08-10
通讯作者:
曾维
作者简介:
彭鹏(1987—),男,陕西渭南人,副教授,博士,主要研究方向:自然语言处理、人工智能基金资助:
Peng PENG, Ziting CAI, Wenling LIU, Caihua CHEN, Wei ZENG(), Baolai HUANG
Received:
2024-08-16
Revised:
2024-11-04
Accepted:
2024-11-12
Online:
2024-11-19
Published:
2025-08-10
Contact:
Wei ZENG
About author:
PENG Peng, born in 1987, Ph. D., associate professor. His research interests include natural language processing, artificial intelligence.Supported by:
摘要:
针对现有语音情感识别(SER)模型精度较低、泛化能力较差的问题,提出一种孪生的Multi-scale CNN-BiGRU网络。该网络通过引入多尺度特征提取器(MSFE)和多维度注意力(MDA)模块构建孪生网络,并利用样本对的形式增加模型训练量,从而提高模型的识别精度,使它能更好地适应复杂的真实应用场景。在IEMOCAP和EMO-DB这2个公开数据集上的实验结果表明,所提模型在识别精确率上较CNN-BiGRU分别提升了8.28和7.79个百分点。此外,通过收集客服真实语音对话录音构建一个客服语音情感数据集,在该数据集上的实验结果表明,所提模型的识别精确率可达到87.85%,证明所提模型具有良好的泛化性。
中图分类号:
彭鹏, 蔡子婷, 刘雯玲, 陈才华, 曾维, 黄宝来. 基于CNN和双向GRU混合孪生网络的语音情感识别方法[J]. 计算机应用, 2025, 45(8): 2515-2521.
Peng PENG, Ziting CAI, Wenling LIU, Caihua CHEN, Wei ZENG, Baolai HUANG. Speech emotion recognition method based on hybrid Siamese network with CNN and bidirectional GRU[J]. Journal of Computer Applications, 2025, 45(8): 2515-2521.
情感类别 | 样本数 | 总样本数 | ||
---|---|---|---|---|
训练集 | 验证集 | 测试集 | ||
中性(neutral) | 5 124 | 808 | 900 | 6 832 |
伤心(sad) | 3 321 | 545 | 562 | 4 428 |
生气(angry) | 3 362 | 524 | 526 | 4 412 |
高兴(happy) | 5 021 | 814 | 709 | 6 544 |
表1 数据扩充后IEMOCAP英语语音情感数据库
Tab. 1 Data expanded IEMOCAP English speech emotion database
情感类别 | 样本数 | 总样本数 | ||
---|---|---|---|---|
训练集 | 验证集 | 测试集 | ||
中性(neutral) | 5 124 | 808 | 900 | 6 832 |
伤心(sad) | 3 321 | 545 | 562 | 4 428 |
生气(angry) | 3 362 | 524 | 526 | 4 412 |
高兴(happy) | 5 021 | 814 | 709 | 6 544 |
模型 | 准确率 | 精确率 | 召回率 | F1值 |
---|---|---|---|---|
CNN-BiGRU | 78.91 | 78.85 | 78.04 | 78.44 |
Multi-scale CNN-BiGRU | 80.82 | 80.76 | 81.03 | 80.89 |
Multi-scale CNN-BiGRU +MDA | 83.62 | 83.58 | 83.65 | 83.61 |
VQ-MAE-S | 84.11 | 84.09 | 84.02 | 84.43 |
孪生Multi-scale CNN-BiGRU | 87.38 | 87.13 | 86.82 | 86.97 |
表2 不同模型在带噪IEMOCAP数据集上的识别效果对比 (%)
Tab. 2 Comparison of recognition effects of different models on noisy IEMOCAP dataset
模型 | 准确率 | 精确率 | 召回率 | F1值 |
---|---|---|---|---|
CNN-BiGRU | 78.91 | 78.85 | 78.04 | 78.44 |
Multi-scale CNN-BiGRU | 80.82 | 80.76 | 81.03 | 80.89 |
Multi-scale CNN-BiGRU +MDA | 83.62 | 83.58 | 83.65 | 83.61 |
VQ-MAE-S | 84.11 | 84.09 | 84.02 | 84.43 |
孪生Multi-scale CNN-BiGRU | 87.38 | 87.13 | 86.82 | 86.97 |
模型 | 准确率 | 精确率 | 召回率 | F1值 |
---|---|---|---|---|
CNN-BiGRU | 76.25 | 76.13 | 76.42 | 76.39 |
Multi-scale CNN-BiGRU | 79.02 | 79.25 | 79.12 | 79.19 |
Multi-scale CNN-BiGRU +MDA | 81.29 | 81.19 | 81.23 | 81.21 |
VQ-MAE-S | 84.70 | 84.46 | 84.29 | 84.83 |
孪生Multi-scale CNN-BiGRU | 84.08 | 83.92 | 83.82 | 83.85 |
表3 不同模型在带噪EMO-DB数据集上的识别效果对比 (%)
Tab. 3 Comparison of recognition effects of different models on noisy EMO-DB dataset
模型 | 准确率 | 精确率 | 召回率 | F1值 |
---|---|---|---|---|
CNN-BiGRU | 76.25 | 76.13 | 76.42 | 76.39 |
Multi-scale CNN-BiGRU | 79.02 | 79.25 | 79.12 | 79.19 |
Multi-scale CNN-BiGRU +MDA | 81.29 | 81.19 | 81.23 | 81.21 |
VQ-MAE-S | 84.70 | 84.46 | 84.29 | 84.83 |
孪生Multi-scale CNN-BiGRU | 84.08 | 83.92 | 83.82 | 83.85 |
模型 | 准确率 | 精确率 | 召回率 | F1值 |
---|---|---|---|---|
CNN-BiGRU | 79.98 | 79.92 | 78.31 | 78.69 |
Multi-scale CNN-BiGRU | 82.73 | 82.69 | 82.14 | 82.49 |
Multi-scale CNN-BiGRU +MDA | 85.33 | 85.21 | 85.30 | 85.27 |
Multi-scale CNN-BiGRU +MSFE | 86.08 | 85.76 | 85.92 | 85.86 |
孪生Multi-scale CNN-BiGRU | 87.81 | 87.85 | 87.81 | 87.83 |
表4 不同模型在CSSED上的识别效果对比 (%)
Tab. 4 Comparison of recognition effects of different models on CSSED
模型 | 准确率 | 精确率 | 召回率 | F1值 |
---|---|---|---|---|
CNN-BiGRU | 79.98 | 79.92 | 78.31 | 78.69 |
Multi-scale CNN-BiGRU | 82.73 | 82.69 | 82.14 | 82.49 |
Multi-scale CNN-BiGRU +MDA | 85.33 | 85.21 | 85.30 | 85.27 |
Multi-scale CNN-BiGRU +MSFE | 86.08 | 85.76 | 85.92 | 85.86 |
孪生Multi-scale CNN-BiGRU | 87.81 | 87.85 | 87.81 | 87.83 |
模型 | 积极 | 中性 | 消极 |
---|---|---|---|
ARCNN-GAP | 76.25 | 79.20 | 77.43 |
CNN+Bi-LSTM | 77.63 | 78.03 | 80.17 |
MLANet | 77.32 | 79.89 | 80.11 |
孪生LSTM | 82.73 | 81.39 | 82.80 |
孪生Multi-scale CNN-BiGRU | 86.27 | 84.63 | 87.60 |
表5 不同模型在CSSED上的准确率(未进行语音增强) (%)
Tab. 5 Accuracy of different models on CSSED (without speech enhancement)
模型 | 积极 | 中性 | 消极 |
---|---|---|---|
ARCNN-GAP | 76.25 | 79.20 | 77.43 |
CNN+Bi-LSTM | 77.63 | 78.03 | 80.17 |
MLANet | 77.32 | 79.89 | 80.11 |
孪生LSTM | 82.73 | 81.39 | 82.80 |
孪生Multi-scale CNN-BiGRU | 86.27 | 84.63 | 87.60 |
模型 | 积极 | 中性 | 消极 |
---|---|---|---|
ARCNN-GAP | 78.76 | 82.03 | 79.74 |
CNN+Bi-LSTM | 80.10 | 80.13 | 80.92 |
MLANet | 79.89 | 81.42 | 81.26 |
孪生LSTM | 85.26 | 82.81 | 87.15 |
孪生Multi-scale CNN-BiGRU | 88.75 | 85.09 | 89.12 |
表6 不同模型在CSSED上的准确率(语音增强) (%)
Tab. 6 Accuracy of different models on CSSED(with speech enhancement)
模型 | 积极 | 中性 | 消极 |
---|---|---|---|
ARCNN-GAP | 78.76 | 82.03 | 79.74 |
CNN+Bi-LSTM | 80.10 | 80.13 | 80.92 |
MLANet | 79.89 | 81.42 | 81.26 |
孪生LSTM | 85.26 | 82.81 | 87.15 |
孪生Multi-scale CNN-BiGRU | 88.75 | 85.09 | 89.12 |
[1] | ZHAO S, JIA G, YANG J, et al. Emotion recognition from multiple modalities: fundamentals and methodologies[J]. IEEE Signal Processing Magazine, 2021, 38(6): 59-73. |
[2] | SHEN S, GAO Y, LIU F, et al. Emotion neural transducer for fine-grained speech emotion recognition[C]// Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2024: 10111-10115. |
[3] | BHANGALE K, KOTHANDARAMAN M. Speech emotion recognition based on multiple acoustic features and deep convolutional neural network[J]. Electronics, 2023, 12(4): No.839. |
[4] | ULGEN I R, DU Z, BUSSO C, et al. Revealing emotional clusters in speaker embeddings: a contrastive learning strategy for speech emotion recognition[C]// Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2024: 12081-12085. |
[5] | YI L, MAK M W. Improving speech emotion recognition with adversarial data augmentation network[J]. IEEE Transactions on Neural Networks and Learning Systems, 2022, 33(1): 172-184. |
[6] | WANG B, WANG D. Plant leaves classification: a few-shot learning method based on Siamese network[J]. IEEE Access, 2019, 7: 151754-151763. |
[7] | NIU Z, ZHONG G, YU H. A review on the attention mechanism of deep learning[J]. Neurocomputing, 2021, 452: 48-62. |
[8] | LI D, LIU J, YANG Z, et al. Speech emotion recognition using recurrent neural networks with directional self-attention[J]. Expert Systems with Applications, 2021, 173: No.114683. |
[9] | XU H, ZHANG H, HAN K, et al. Learning alignment for multimodal emotion recognition from speech[C]// Proceedings of the INTERSPEECH 2019. [S.l.]: International Speech Communication Association, 2019: 3569-3573. |
[10] | SIRIWARDHANA S, KALUARACHCHI T, BILLINGHURST M, et al. Multimodal emotion recognition with Transformer-based self supervised feature fusion[J]. IEEE Access, 2020, 8: 176274-176285. |
[11] | HO N H, YANG H J, KIM S H, et al. Multimodal approach of speech emotion recognition using multi-level multi-head fusion attention-based recurrent neural network[J]. IEEE Access, 2020, 8: 61672-61686. |
[12] | LIU K, WANG D, WU D, et al. Speech emotion recognition via multi-level attention network[J]. IEEE Signal Processing Letters, 2022, 29: 2278-2282. |
[13] | 杨磊,赵红东,于快快. 基于多头注意力机制的端到端语音情感识别[J]. 计算机应用, 2022, 42(6): 1869-1875. |
YANG L, ZHAO H D, YU K K. End-to-end speech emotion recognition based on multi-head attention[J]. Journal of Computer Applications, 2022, 42(6): 1869-1875. | |
[14] | DE LOPE J, GRAÑA M. An ongoing review of speech emotion recognition[J]. Neurocomputing, 2023, 528: 1-11. |
[15] | LIU Z T, WU B H, LI D Y, et al. Speech emotion recognition based on selective interpolation synthetic minority over-sampling technique in small sample environment[J]. Sensors, 2020, 20(8): No.2297. |
[16] | CHEN S, WANG J, WANG J, et al. MDAM: multi-dimensional attention module for anomalous sound detection[C]// Proceedings of the 2023 International Conference on Neural Information Processing, CCIS 1967. Singapore: Springer, 2024: 48-60. |
[17] | LIU R. Convolutional Siamese network-based few-shot learning for monkeypox detection under data scarcity[C]// Proceedings of the SPIE 12611, 2nd International Conference on Biological Engineering and Medical Science. Bellingham, WA: SPIE, 2023: No.126115O. |
[18] | FENG K, CHASPARI T. Few-shot learning in emotion recognition of spontaneous speech using a Siamese neural network with adaptive sample pair formation[J]. IEEE Transactions on Affective Computing, 2023, 14(2): 1627-1633. |
[19] | TORRES L, MONTEIRO N, OLIVEIRA J, et al. Exploring a Siamese neural network architecture for one-shot drug discovery[C]// Proceedings of the IEEE 20th International Conference on Bioinformatics and Bioengineering. Piscataway: IEEE, 2020: 168-175. |
[20] | XU L, MA H, GUAN Y, et al. A Siamese network with node convolution for individualized predictions based on connectivity Maps extracted from resting-state fMRI data[J]. IEEE Journal of Biomedical and Health Informatics, 2023, 27(11): 5418-5429. |
[21] | 姜钧舰,刘达维,刘逸凡,等. 基于孪生网络的小样本目标检测算法[J]. 计算机应用, 2023, 43(8): 2325-2329. |
JIANG J J, LIU D W, LIU Y F, et al. Few-shot object detection algorithm based on Siamese network[J]. Journal of Computer Applications, 2023, 43(8): 2325-2329. | |
[22] | SPERBER M, NIEHUES J, NEUBIG G, et al. Self-attentional acoustic models[C]// Proceedings of the INTERSPEECH 2018. [S.l.]: International Speech Communication Association, 2018: 3723-3727. |
[23] | CHAN W, JAITLY N, LE Q, et al. Listen, attend and spell: a neural network for large vocabulary conversational speech recognition[C]// Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2016: 4960-4964. |
[24] | 吴虹蕾. 基于深度学习的语音情感识别算法的设计与实现[D]. 哈尔滨:黑龙江大学, 2021. |
WU H L. Design and implementation of speech emotion recognition algorithm based on deep learning[D]. Harbin: Heilongjiang University, 2021. | |
[25] | VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2017: 6000-6010. |
[26] | GAO Z, LI Z, LUO J, et al. Short text aspect-based sentiment analysis based on CNN+BiGRU[J]. Applied Sciences, 2022, 12(5): No.2707. |
[27] | ZHU G, FAN Y, LI F, et al. GSRNet, an adversarial training-based deep framework with multi-scale CNN and BiGRU for predicting genomic signals and regions[J]. Expert Systems with Applications, 2023, 229(Pt A): No.120439. |
[28] | BUSSO C, BULUT M, LEE C C, et al. IEMOCAP: interactive emotional dyadic motion capture database[J]. Language Resources and Evaluation, 2008, 42(4): 335-359. |
[29] | 王雨,袁玉波,过弋,等. 情感增强的对话文本情绪识别模型[J]. 计算机应用, 2023, 43(3): 706-712. |
WANG Y, YUAN Y B, GUO Y, et al. Sentiment boosting model for emotion recognition in conversation text[J]. Journal of Computer Applications, 2023, 43(3): 706-712. | |
[30] | SADOK S, LEGLAIVE S, SÉGUIER R. A vector quantized masked autoencoder for speech emotion recognition[EB/OL]. [2024-10-17].. |
[31] | 钱佳琪,黄鹤鸣,张会云. 基于ARCNN-GAP网络的语音情感识别[J].计算机与现代化, 2021(12): 91-95. |
QIAN J Q, HUANG H M, ZHANG H Y. Speech emotion recognition based on ARCNN-GAP network[J]. Computer and Modernization, 2021(12): 91-95. | |
[32] | MURUGAIYAN S, UYYALA S R. Aspect-based sentiment analysis of customer speech data using deep convolutional neural network and BiLSTM[J]. Cognitive Computation, 2023, 15(3): 914-931. |
[33] | DUTT A, GADER P. Wavelet multiresolution analysis based speech emotion recognition system using 1D CNN LSTM networks[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, 31: 2043-2054. |
[34] | NIZAMIDIN T, ZHAO L, LIANG R, et al. Siamese attention-based LSTM for speech emotion recognition[J]. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 2020, E103-A(7): 937-941. |
[1] | 葛丽娜, 王明禹, 田蕾. 联邦学习的高效性研究综述[J]. 《计算机应用》唯一官方网站, 2025, 45(8): 2387-2398. |
[2] | 张硕, 孙国凯, 庄园, 冯小雨, 王敬之. 面向区块链节点分析的eclipse攻击动态检测方法[J]. 《计算机应用》唯一官方网站, 2025, 45(8): 2428-2436. |
[3] | 陶永鹏, 柏诗淇, 周正文. 基于卷积和Transformer神经网络架构搜索的脑胶质瘤多组织分割网络[J]. 《计算机应用》唯一官方网站, 2025, 45(7): 2378-2386. |
[4] | 索晋贤, 张丽萍, 闫盛, 王东奇, 张雅雯. 可解释的深度知识追踪方法综述[J]. 《计算机应用》唯一官方网站, 2025, 45(7): 2043-2055. |
[5] | 王震洲, 郭方方, 宿景芳, 苏鹤, 王建超. 面向智能巡检的视觉模型鲁棒性优化方法[J]. 《计算机应用》唯一官方网站, 2025, 45(7): 2361-2368. |
[6] | 张英俊, 闫薇薇, 谢斌红, 张睿, 陆望东. 梯度区分与特征范数驱动的开放世界目标检测[J]. 《计算机应用》唯一官方网站, 2025, 45(7): 2203-2210. |
[7] | 齐巧玲, 王啸啸, 张茜茜, 汪鹏, 董永峰. 基于元学习的标签噪声自适应学习算法[J]. 《计算机应用》唯一官方网站, 2025, 45(7): 2113-2122. |
[8] | 赵小阳, 许新征, 李仲年. 物联网应用中的可解释人工智能研究综述[J]. 《计算机应用》唯一官方网站, 2025, 45(7): 2169-2179. |
[9] | 花天辰, 马晓宁, 智慧. 基于浅层人工神经网络的可移植执行恶意软件静态检测模型[J]. 《计算机应用》唯一官方网站, 2025, 45(6): 1911-1921. |
[10] | 吴宗航, 张东, 李冠宇. 基于联合自监督学习的多模态融合推荐算法[J]. 《计算机应用》唯一官方网站, 2025, 45(6): 1858-1868. |
[11] | 李岚皓, 严皓钧, 周号益, 孙庆赟, 李建欣. 基于神经网络的多尺度信息融合时间序列长期预测模型[J]. 《计算机应用》唯一官方网站, 2025, 45(6): 1776-1783. |
[12] | 王丹, 张文豪, 彭丽娟. 基于深度学习的智能反射面辅助通信系统信道估计[J]. 《计算机应用》唯一官方网站, 2025, 45(5): 1613-1618. |
[13] | 龙雨菲, 牟宇辰, 刘晔. 基于张量化图卷积网络和对比学习的多源数据表示学习模型[J]. 《计算机应用》唯一官方网站, 2025, 45(5): 1372-1378. |
[14] | 牛四杰, 刘昱良. 基于知识蒸馏双分支结构的视网膜病变辅助诊断方法[J]. 《计算机应用》唯一官方网站, 2025, 45(5): 1410-1414. |
[15] | 王文鹏, 秦寅畅, 师文轩. 工业缺陷检测无监督深度学习方法综述[J]. 《计算机应用》唯一官方网站, 2025, 45(5): 1658-1670. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||