Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (8): 2515-2521.DOI: 10.11772/j.issn.1001-9081.2024081142
• Artificial intelligence • Previous Articles
Peng PENG, Ziting CAI, Wenling LIU, Caihua CHEN, Wei ZENG(), Baolai HUANG
Received:
2024-08-16
Revised:
2024-11-04
Accepted:
2024-11-12
Online:
2024-11-19
Published:
2025-08-10
Contact:
Wei ZENG
About author:
PENG Peng, born in 1987, Ph. D., associate professor. His research interests include natural language processing, artificial intelligence.Supported by:
通讯作者:
曾维
作者简介:
彭鹏(1987—),男,陕西渭南人,副教授,博士,主要研究方向:自然语言处理、人工智能基金资助:
CLC Number:
Peng PENG, Ziting CAI, Wenling LIU, Caihua CHEN, Wei ZENG, Baolai HUANG. Speech emotion recognition method based on hybrid Siamese network with CNN and bidirectional GRU[J]. Journal of Computer Applications, 2025, 45(8): 2515-2521.
彭鹏, 蔡子婷, 刘雯玲, 陈才华, 曾维, 黄宝来. 基于CNN和双向GRU混合孪生网络的语音情感识别方法[J]. 《计算机应用》唯一官方网站, 2025, 45(8): 2515-2521.
Add to citation manager EndNote|Ris|BibTeX
URL: https://www.joca.cn/EN/10.11772/j.issn.1001-9081.2024081142
情感类别 | 样本数 | 总样本数 | ||
---|---|---|---|---|
训练集 | 验证集 | 测试集 | ||
中性(neutral) | 5 124 | 808 | 900 | 6 832 |
伤心(sad) | 3 321 | 545 | 562 | 4 428 |
生气(angry) | 3 362 | 524 | 526 | 4 412 |
高兴(happy) | 5 021 | 814 | 709 | 6 544 |
Tab. 1 Data expanded IEMOCAP English speech emotion database
情感类别 | 样本数 | 总样本数 | ||
---|---|---|---|---|
训练集 | 验证集 | 测试集 | ||
中性(neutral) | 5 124 | 808 | 900 | 6 832 |
伤心(sad) | 3 321 | 545 | 562 | 4 428 |
生气(angry) | 3 362 | 524 | 526 | 4 412 |
高兴(happy) | 5 021 | 814 | 709 | 6 544 |
模型 | 准确率 | 精确率 | 召回率 | F1值 |
---|---|---|---|---|
CNN-BiGRU | 78.91 | 78.85 | 78.04 | 78.44 |
Multi-scale CNN-BiGRU | 80.82 | 80.76 | 81.03 | 80.89 |
Multi-scale CNN-BiGRU +MDA | 83.62 | 83.58 | 83.65 | 83.61 |
VQ-MAE-S | 84.11 | 84.09 | 84.02 | 84.43 |
孪生Multi-scale CNN-BiGRU | 87.38 | 87.13 | 86.82 | 86.97 |
Tab. 2 Comparison of recognition effects of different models on noisy IEMOCAP dataset
模型 | 准确率 | 精确率 | 召回率 | F1值 |
---|---|---|---|---|
CNN-BiGRU | 78.91 | 78.85 | 78.04 | 78.44 |
Multi-scale CNN-BiGRU | 80.82 | 80.76 | 81.03 | 80.89 |
Multi-scale CNN-BiGRU +MDA | 83.62 | 83.58 | 83.65 | 83.61 |
VQ-MAE-S | 84.11 | 84.09 | 84.02 | 84.43 |
孪生Multi-scale CNN-BiGRU | 87.38 | 87.13 | 86.82 | 86.97 |
模型 | 准确率 | 精确率 | 召回率 | F1值 |
---|---|---|---|---|
CNN-BiGRU | 76.25 | 76.13 | 76.42 | 76.39 |
Multi-scale CNN-BiGRU | 79.02 | 79.25 | 79.12 | 79.19 |
Multi-scale CNN-BiGRU +MDA | 81.29 | 81.19 | 81.23 | 81.21 |
VQ-MAE-S | 84.70 | 84.46 | 84.29 | 84.83 |
孪生Multi-scale CNN-BiGRU | 84.08 | 83.92 | 83.82 | 83.85 |
Tab. 3 Comparison of recognition effects of different models on noisy EMO-DB dataset
模型 | 准确率 | 精确率 | 召回率 | F1值 |
---|---|---|---|---|
CNN-BiGRU | 76.25 | 76.13 | 76.42 | 76.39 |
Multi-scale CNN-BiGRU | 79.02 | 79.25 | 79.12 | 79.19 |
Multi-scale CNN-BiGRU +MDA | 81.29 | 81.19 | 81.23 | 81.21 |
VQ-MAE-S | 84.70 | 84.46 | 84.29 | 84.83 |
孪生Multi-scale CNN-BiGRU | 84.08 | 83.92 | 83.82 | 83.85 |
模型 | 准确率 | 精确率 | 召回率 | F1值 |
---|---|---|---|---|
CNN-BiGRU | 79.98 | 79.92 | 78.31 | 78.69 |
Multi-scale CNN-BiGRU | 82.73 | 82.69 | 82.14 | 82.49 |
Multi-scale CNN-BiGRU +MDA | 85.33 | 85.21 | 85.30 | 85.27 |
Multi-scale CNN-BiGRU +MSFE | 86.08 | 85.76 | 85.92 | 85.86 |
孪生Multi-scale CNN-BiGRU | 87.81 | 87.85 | 87.81 | 87.83 |
Tab. 4 Comparison of recognition effects of different models on CSSED
模型 | 准确率 | 精确率 | 召回率 | F1值 |
---|---|---|---|---|
CNN-BiGRU | 79.98 | 79.92 | 78.31 | 78.69 |
Multi-scale CNN-BiGRU | 82.73 | 82.69 | 82.14 | 82.49 |
Multi-scale CNN-BiGRU +MDA | 85.33 | 85.21 | 85.30 | 85.27 |
Multi-scale CNN-BiGRU +MSFE | 86.08 | 85.76 | 85.92 | 85.86 |
孪生Multi-scale CNN-BiGRU | 87.81 | 87.85 | 87.81 | 87.83 |
模型 | 积极 | 中性 | 消极 |
---|---|---|---|
ARCNN-GAP | 76.25 | 79.20 | 77.43 |
CNN+Bi-LSTM | 77.63 | 78.03 | 80.17 |
MLANet | 77.32 | 79.89 | 80.11 |
孪生LSTM | 82.73 | 81.39 | 82.80 |
孪生Multi-scale CNN-BiGRU | 86.27 | 84.63 | 87.60 |
Tab. 5 Accuracy of different models on CSSED (without speech enhancement)
模型 | 积极 | 中性 | 消极 |
---|---|---|---|
ARCNN-GAP | 76.25 | 79.20 | 77.43 |
CNN+Bi-LSTM | 77.63 | 78.03 | 80.17 |
MLANet | 77.32 | 79.89 | 80.11 |
孪生LSTM | 82.73 | 81.39 | 82.80 |
孪生Multi-scale CNN-BiGRU | 86.27 | 84.63 | 87.60 |
模型 | 积极 | 中性 | 消极 |
---|---|---|---|
ARCNN-GAP | 78.76 | 82.03 | 79.74 |
CNN+Bi-LSTM | 80.10 | 80.13 | 80.92 |
MLANet | 79.89 | 81.42 | 81.26 |
孪生LSTM | 85.26 | 82.81 | 87.15 |
孪生Multi-scale CNN-BiGRU | 88.75 | 85.09 | 89.12 |
Tab. 6 Accuracy of different models on CSSED(with speech enhancement)
模型 | 积极 | 中性 | 消极 |
---|---|---|---|
ARCNN-GAP | 78.76 | 82.03 | 79.74 |
CNN+Bi-LSTM | 80.10 | 80.13 | 80.92 |
MLANet | 79.89 | 81.42 | 81.26 |
孪生LSTM | 85.26 | 82.81 | 87.15 |
孪生Multi-scale CNN-BiGRU | 88.75 | 85.09 | 89.12 |
[1] | ZHAO S, JIA G, YANG J, et al. Emotion recognition from multiple modalities: fundamentals and methodologies[J]. IEEE Signal Processing Magazine, 2021, 38(6): 59-73. |
[2] | SHEN S, GAO Y, LIU F, et al. Emotion neural transducer for fine-grained speech emotion recognition[C]// Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2024: 10111-10115. |
[3] | BHANGALE K, KOTHANDARAMAN M. Speech emotion recognition based on multiple acoustic features and deep convolutional neural network[J]. Electronics, 2023, 12(4): No.839. |
[4] | ULGEN I R, DU Z, BUSSO C, et al. Revealing emotional clusters in speaker embeddings: a contrastive learning strategy for speech emotion recognition[C]// Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2024: 12081-12085. |
[5] | YI L, MAK M W. Improving speech emotion recognition with adversarial data augmentation network[J]. IEEE Transactions on Neural Networks and Learning Systems, 2022, 33(1): 172-184. |
[6] | WANG B, WANG D. Plant leaves classification: a few-shot learning method based on Siamese network[J]. IEEE Access, 2019, 7: 151754-151763. |
[7] | NIU Z, ZHONG G, YU H. A review on the attention mechanism of deep learning[J]. Neurocomputing, 2021, 452: 48-62. |
[8] | LI D, LIU J, YANG Z, et al. Speech emotion recognition using recurrent neural networks with directional self-attention[J]. Expert Systems with Applications, 2021, 173: No.114683. |
[9] | XU H, ZHANG H, HAN K, et al. Learning alignment for multimodal emotion recognition from speech[C]// Proceedings of the INTERSPEECH 2019. [S.l.]: International Speech Communication Association, 2019: 3569-3573. |
[10] | SIRIWARDHANA S, KALUARACHCHI T, BILLINGHURST M, et al. Multimodal emotion recognition with Transformer-based self supervised feature fusion[J]. IEEE Access, 2020, 8: 176274-176285. |
[11] | HO N H, YANG H J, KIM S H, et al. Multimodal approach of speech emotion recognition using multi-level multi-head fusion attention-based recurrent neural network[J]. IEEE Access, 2020, 8: 61672-61686. |
[12] | LIU K, WANG D, WU D, et al. Speech emotion recognition via multi-level attention network[J]. IEEE Signal Processing Letters, 2022, 29: 2278-2282. |
[13] | 杨磊,赵红东,于快快. 基于多头注意力机制的端到端语音情感识别[J]. 计算机应用, 2022, 42(6): 1869-1875. |
YANG L, ZHAO H D, YU K K. End-to-end speech emotion recognition based on multi-head attention[J]. Journal of Computer Applications, 2022, 42(6): 1869-1875. | |
[14] | DE LOPE J, GRAÑA M. An ongoing review of speech emotion recognition[J]. Neurocomputing, 2023, 528: 1-11. |
[15] | LIU Z T, WU B H, LI D Y, et al. Speech emotion recognition based on selective interpolation synthetic minority over-sampling technique in small sample environment[J]. Sensors, 2020, 20(8): No.2297. |
[16] | CHEN S, WANG J, WANG J, et al. MDAM: multi-dimensional attention module for anomalous sound detection[C]// Proceedings of the 2023 International Conference on Neural Information Processing, CCIS 1967. Singapore: Springer, 2024: 48-60. |
[17] | LIU R. Convolutional Siamese network-based few-shot learning for monkeypox detection under data scarcity[C]// Proceedings of the SPIE 12611, 2nd International Conference on Biological Engineering and Medical Science. Bellingham, WA: SPIE, 2023: No.126115O. |
[18] | FENG K, CHASPARI T. Few-shot learning in emotion recognition of spontaneous speech using a Siamese neural network with adaptive sample pair formation[J]. IEEE Transactions on Affective Computing, 2023, 14(2): 1627-1633. |
[19] | TORRES L, MONTEIRO N, OLIVEIRA J, et al. Exploring a Siamese neural network architecture for one-shot drug discovery[C]// Proceedings of the IEEE 20th International Conference on Bioinformatics and Bioengineering. Piscataway: IEEE, 2020: 168-175. |
[20] | XU L, MA H, GUAN Y, et al. A Siamese network with node convolution for individualized predictions based on connectivity Maps extracted from resting-state fMRI data[J]. IEEE Journal of Biomedical and Health Informatics, 2023, 27(11): 5418-5429. |
[21] | 姜钧舰,刘达维,刘逸凡,等. 基于孪生网络的小样本目标检测算法[J]. 计算机应用, 2023, 43(8): 2325-2329. |
JIANG J J, LIU D W, LIU Y F, et al. Few-shot object detection algorithm based on Siamese network[J]. Journal of Computer Applications, 2023, 43(8): 2325-2329. | |
[22] | SPERBER M, NIEHUES J, NEUBIG G, et al. Self-attentional acoustic models[C]// Proceedings of the INTERSPEECH 2018. [S.l.]: International Speech Communication Association, 2018: 3723-3727. |
[23] | CHAN W, JAITLY N, LE Q, et al. Listen, attend and spell: a neural network for large vocabulary conversational speech recognition[C]// Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2016: 4960-4964. |
[24] | 吴虹蕾. 基于深度学习的语音情感识别算法的设计与实现[D]. 哈尔滨:黑龙江大学, 2021. |
WU H L. Design and implementation of speech emotion recognition algorithm based on deep learning[D]. Harbin: Heilongjiang University, 2021. | |
[25] | VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2017: 6000-6010. |
[26] | GAO Z, LI Z, LUO J, et al. Short text aspect-based sentiment analysis based on CNN+BiGRU[J]. Applied Sciences, 2022, 12(5): No.2707. |
[27] | ZHU G, FAN Y, LI F, et al. GSRNet, an adversarial training-based deep framework with multi-scale CNN and BiGRU for predicting genomic signals and regions[J]. Expert Systems with Applications, 2023, 229(Pt A): No.120439. |
[28] | BUSSO C, BULUT M, LEE C C, et al. IEMOCAP: interactive emotional dyadic motion capture database[J]. Language Resources and Evaluation, 2008, 42(4): 335-359. |
[29] | 王雨,袁玉波,过弋,等. 情感增强的对话文本情绪识别模型[J]. 计算机应用, 2023, 43(3): 706-712. |
WANG Y, YUAN Y B, GUO Y, et al. Sentiment boosting model for emotion recognition in conversation text[J]. Journal of Computer Applications, 2023, 43(3): 706-712. | |
[30] | SADOK S, LEGLAIVE S, SÉGUIER R. A vector quantized masked autoencoder for speech emotion recognition[EB/OL]. [2024-10-17].. |
[31] | 钱佳琪,黄鹤鸣,张会云. 基于ARCNN-GAP网络的语音情感识别[J].计算机与现代化, 2021(12): 91-95. |
QIAN J Q, HUANG H M, ZHANG H Y. Speech emotion recognition based on ARCNN-GAP network[J]. Computer and Modernization, 2021(12): 91-95. | |
[32] | MURUGAIYAN S, UYYALA S R. Aspect-based sentiment analysis of customer speech data using deep convolutional neural network and BiLSTM[J]. Cognitive Computation, 2023, 15(3): 914-931. |
[33] | DUTT A, GADER P. Wavelet multiresolution analysis based speech emotion recognition system using 1D CNN LSTM networks[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, 31: 2043-2054. |
[34] | NIZAMIDIN T, ZHAO L, LIANG R, et al. Siamese attention-based LSTM for speech emotion recognition[J]. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 2020, E103-A(7): 937-941. |
[1] | Lina GE, Mingyu WANG, Lei TIAN. Review of research on efficiency of federated learning [J]. Journal of Computer Applications, 2025, 45(8): 2387-2398. |
[2] | Shuo ZHANG, Guokai SUN, Yuan ZHUANG, Xiaoyu FENG, Jingzhi WANG. Dynamic detection method of eclipse attacks for blockchain node analysis [J]. Journal of Computer Applications, 2025, 45(8): 2428-2436. |
[3] | Yongpeng TAO, Shiqi BAI, Zhengwen ZHOU. Neural architecture search for multi-tissue segmentation using convolutional and transformer-based networks in glioma segmentation [J]. Journal of Computer Applications, 2025, 45(7): 2378-2386. |
[4] | Jinxian SUO, Liping ZHANG, Sheng YAN, Dongqi WANG, Yawen ZHANG. Review of interpretable deep knowledge tracing methods [J]. Journal of Computer Applications, 2025, 45(7): 2043-2055. |
[5] | Zhenzhou WANG, Fangfang GUO, Jingfang SU, He SU, Jianchao WANG. Robustness optimization method of visual model for intelligent inspection [J]. Journal of Computer Applications, 2025, 45(7): 2361-2368. |
[6] | Yingjun ZHANG, Weiwei YAN, Binhong XIE, Rui ZHANG, Wangdong LU. Gradient-discriminative and feature norm-driven open-world object detection [J]. Journal of Computer Applications, 2025, 45(7): 2203-2210. |
[7] | Qiaoling QI, Xiaoxiao WANG, Qianqian ZHANG, Peng WANG, Yongfeng DONG. Label noise adaptive learning algorithm based on meta-learning [J]. Journal of Computer Applications, 2025, 45(7): 2113-2122. |
[8] | Xiaoyang ZHAO, Xinzheng XU, Zhongnian LI. Research review on explainable artificial intelligence in internet of things applications [J]. Journal of Computer Applications, 2025, 45(7): 2169-2179. |
[9] | Lanhao LI, Haojun YAN, Haoyi ZHOU, Qingyun SUN, Jianxin LI. Multi-scale information fusion time series long-term forecasting model based on neural network [J]. Journal of Computer Applications, 2025, 45(6): 1776-1783. |
[10] | Tianchen HUA, Xiaoning MA, Hui ZHI. Portable executable malware static detection model based on shallow artificial neural network [J]. Journal of Computer Applications, 2025, 45(6): 1911-1921. |
[11] | Sijie NIU, Yuliang LIU. Auxiliary diagnostic method for retinopathy based on dual-branch structure with knowledge distillation [J]. Journal of Computer Applications, 2025, 45(5): 1410-1414. |
[12] | Wenpeng WANG, Yinchang QIN, Wenxuan SHI. Review of unsupervised deep learning methods for industrial defect detection [J]. Journal of Computer Applications, 2025, 45(5): 1658-1670. |
[13] | Xueying LI, Kun YANG, Guoqing TU, Shubo LIU. Adversarial sample generation method for time-series data based on local augmentation [J]. Journal of Computer Applications, 2025, 45(5): 1573-1581. |
[14] | Dan WANG, Wenhao ZHANG, Lijuan PENG. Channel estimation of reconfigurable intelligent surface assisted communication system based on deep learning [J]. Journal of Computer Applications, 2025, 45(5): 1613-1618. |
[15] | Kai CHEN, Hailiang YE, Feilong CAO. Classification algorithm for point cloud based on local-global interaction and structural Transformer [J]. Journal of Computer Applications, 2025, 45(5): 1671-1676. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||