Speech emotion recognition method based on hybrid Siamese network with CNN and bidirectional GRU

doi:10.11772/j.issn.1001-9081.2024081142

Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (8): 2515-2521.DOI: 10.11772/j.issn.1001-9081.2024081142

• Artificial intelligence • Previous Articles

Speech emotion recognition method based on hybrid Siamese network with CNN and bidirectional GRU

Peng PENG, Ziting CAI, Wenling LIU, Caihua CHEN, Wei ZENG(), Baolai HUANG

College of Computer and Network Security （Model Software College），Chengdu University of Technology，Chengdu Sichuan 610059，China

Received:2024-08-16 Revised:2024-11-04 Accepted:2024-11-12 Online:2024-11-19 Published:2025-08-10
Contact: Wei ZENG
About author:PENG Peng， born in 1987， Ph. D.， associate professor. His research interests include natural language processing， artificial intelligence.
CAI Ziting， born in 2000， M. S. candidate. Her research interests include emotion recognition， artificial intelligence.
LIU Wenling， born in 2001， M. S. candidate. Her research interests include emotion recognition， artificial intelligence.
CHEN Caihua， born in 1989， Ph. D.， research fellow. His research interests include artificial intelligence.
HUANG Baolai， born in 2001， M. S. candidate. His research interests include emotion recognition， artificial intelligence.
Supported by:
Sichuan Province Science and Technology Program(2023YFN0053)

基于CNN和双向GRU混合孪生网络的语音情感识别方法

彭鹏, 蔡子婷, 刘雯玲, 陈才华, 曾维(), 黄宝来

成都理工大学计算机与网络安全学院（示范性软件学院），成都 610059

通讯作者: 曾维
作者简介:彭鹏（1987—），男，陕西渭南人，副教授，博士，主要研究方向：自然语言处理、人工智能
蔡子婷（2000—），女，江西上饶人，硕士研究生，主要研究方向：情感识别、人工智能
刘雯玲（2001—），女，四川郫县人，硕士研究生，主要研究方向：情感识别、人工智能
陈才华（1989—），男，四川德阳人，研究员，博士，主要研究方向：人工智能
黄宝来（2001—），男，江西上饶人，硕士研究生，主要研究方向：情感识别、人工智能。
基金资助:
四川省科技计划项目(2023YFN0053)

Abstract

Abstract:

In order to solve the problems of low accuracy and poor generalization ability in the existing Speech Emotion Recognition （SER） models， a hybrid Siamese Multi-scale CNN-BiGRU network was proposed. In this network， a Multi-Scale Feature Extractor （MSFE） and a Multi-Dimensional Attention （MDA） module were introduced to construct a Siamese network， and the training data were increased by utilizing sample pairs， thereby improving the model’s recognition accuracy and enabling it to better adapt to complex real-world application scenarios. Experimental results on IEMOCAP and EMO-DB public datasets show that the recognition accuracy of the proposed model is enhanced by 8.28 and 7.79 percentage points， respectively， compared to that of CNN-BiGRU model. Furthermore， a customer service speech emotion dataset was constructed by collecting real customer service conversation recordings. Experimental results on this dataset show that the recognition accuracy of the proposed model can reach 87.85%， indicating that the proposed model has good generalization ability.

Key words: Speech Emotion Recognition (SER), Convolutional Neural Network (CNN), Bidirectional Gated Recurrent Unit (BiGRU), hybrid Siamese network, deep learning

摘要：

针对现有语音情感识别（SER）模型精度较低、泛化能力较差的问题，提出一种孪生的Multi-scale CNN-BiGRU网络。该网络通过引入多尺度特征提取器（MSFE）和多维度注意力（MDA）模块构建孪生网络，并利用样本对的形式增加模型训练量，从而提高模型的识别精度，使它能更好地适应复杂的真实应用场景。在IEMOCAP和EMO-DB这2个公开数据集上的实验结果表明，所提模型在识别精确率上较CNN-BiGRU分别提升了8.28和7.79个百分点。此外，通过收集客服真实语音对话录音构建一个客服语音情感数据集，在该数据集上的实验结果表明，所提模型的识别精确率可达到87.85%，证明所提模型具有良好的泛化性。

关键词: 语音情感识别, 卷积神经网络, 双向GRU, 混合孪生网络, 深度学习

CLC Number:

TP391

Peng PENG, Ziting CAI, Wenling LIU, Caihua CHEN, Wei ZENG, Baolai HUANG. Speech emotion recognition method based on hybrid Siamese network with CNN and bidirectional GRU[J]. Journal of Computer Applications, 2025, 45(8): 2515-2521.

彭鹏, 蔡子婷, 刘雯玲, 陈才华, 曾维, 黄宝来. 基于CNN和双向GRU混合孪生网络的语音情感识别方法[J]. 《计算机应用》唯一官方网站, 2025, 45(8): 2515-2521.

Figures/Tables 12

References 34

[1]	ZHAO S， JIA G， YANG J， et al. Emotion recognition from multiple modalities： fundamentals and methodologies［J］. IEEE Signal Processing Magazine， 2021， 38（6）： 59-73.
[2]	SHEN S， GAO Y， LIU F， et al. Emotion neural transducer for fine-grained speech emotion recognition［C］// Proceedings of the 2024 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2024： 10111-10115.
[3]	BHANGALE K， KOTHANDARAMAN M. Speech emotion recognition based on multiple acoustic features and deep convolutional neural network［J］. Electronics， 2023， 12（4）： No.839.
[4]	ULGEN I R， DU Z， BUSSO C， et al. Revealing emotional clusters in speaker embeddings： a contrastive learning strategy for speech emotion recognition［C］// Proceedings of the 2024 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2024： 12081-12085.
[5]	YI L， MAK M W. Improving speech emotion recognition with adversarial data augmentation network［J］. IEEE Transactions on Neural Networks and Learning Systems， 2022， 33（1）： 172-184.
[6]	WANG B， WANG D. Plant leaves classification： a few-shot learning method based on Siamese network［J］. IEEE Access， 2019， 7： 151754-151763.
[7]	NIU Z， ZHONG G， YU H. A review on the attention mechanism of deep learning［J］. Neurocomputing， 2021， 452： 48-62.
[8]	LI D， LIU J， YANG Z， et al. Speech emotion recognition using recurrent neural networks with directional self-attention［J］. Expert Systems with Applications， 2021， 173： No.114683.
[9]	XU H， ZHANG H， HAN K， et al. Learning alignment for multimodal emotion recognition from speech［C］// Proceedings of the INTERSPEECH 2019. ［S.l.］： International Speech Communication Association， 2019： 3569-3573.
[10]	SIRIWARDHANA S， KALUARACHCHI T， BILLINGHURST M， et al. Multimodal emotion recognition with Transformer-based self supervised feature fusion［J］. IEEE Access， 2020， 8： 176274-176285.
[11]	HO N H， YANG H J， KIM S H， et al. Multimodal approach of speech emotion recognition using multi-level multi-head fusion attention-based recurrent neural network［J］. IEEE Access， 2020， 8： 61672-61686.
[12]	LIU K， WANG D， WU D， et al. Speech emotion recognition via multi-level attention network［J］. IEEE Signal Processing Letters， 2022， 29： 2278-2282.
[13]	杨磊，赵红东，于快快. 基于多头注意力机制的端到端语音情感识别［J］. 计算机应用， 2022， 42（6）： 1869-1875.
	YANG L， ZHAO H D， YU K K. End-to-end speech emotion recognition based on multi-head attention［J］. Journal of Computer Applications， 2022， 42（6）： 1869-1875.
[14]	DE LOPE J， GRAÑA M. An ongoing review of speech emotion recognition［J］. Neurocomputing， 2023， 528： 1-11.
[15]	LIU Z T， WU B H， LI D Y， et al. Speech emotion recognition based on selective interpolation synthetic minority over-sampling technique in small sample environment［J］. Sensors， 2020， 20（8）： No.2297.
[16]	CHEN S， WANG J， WANG J， et al. MDAM： multi-dimensional attention module for anomalous sound detection［C］// Proceedings of the 2023 International Conference on Neural Information Processing， CCIS 1967. Singapore： Springer， 2024： 48-60.
[17]	LIU R. Convolutional Siamese network-based few-shot learning for monkeypox detection under data scarcity［C］// Proceedings of the SPIE 12611， 2nd International Conference on Biological Engineering and Medical Science. Bellingham， WA： SPIE， 2023： No.126115O.
[18]	FENG K， CHASPARI T. Few-shot learning in emotion recognition of spontaneous speech using a Siamese neural network with adaptive sample pair formation［J］. IEEE Transactions on Affective Computing， 2023， 14（2）： 1627-1633.
[19]	TORRES L， MONTEIRO N， OLIVEIRA J， et al. Exploring a Siamese neural network architecture for one-shot drug discovery［C］// Proceedings of the IEEE 20th International Conference on Bioinformatics and Bioengineering. Piscataway： IEEE， 2020： 168-175.
[20]	XU L， MA H， GUAN Y， et al. A Siamese network with node convolution for individualized predictions based on connectivity Maps extracted from resting-state fMRI data［J］. IEEE Journal of Biomedical and Health Informatics， 2023， 27（11）： 5418-5429.
[21]	姜钧舰，刘达维，刘逸凡，等. 基于孪生网络的小样本目标检测算法［J］. 计算机应用， 2023， 43（8）： 2325-2329.
	JIANG J J， LIU D W， LIU Y F， et al. Few-shot object detection algorithm based on Siamese network［J］. Journal of Computer Applications， 2023， 43（8）： 2325-2329.
[22]	SPERBER M， NIEHUES J， NEUBIG G， et al. Self-attentional acoustic models［C］// Proceedings of the INTERSPEECH 2018. ［S.l.］： International Speech Communication Association， 2018： 3723-3727.
[23]	CHAN W， JAITLY N， LE Q， et al. Listen， attend and spell： a neural network for large vocabulary conversational speech recognition［C］// Proceedings of the 2016 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2016： 4960-4964.
[24]	吴虹蕾. 基于深度学习的语音情感识别算法的设计与实现［D］. 哈尔滨：黑龙江大学， 2021.
	WU H L. Design and implementation of speech emotion recognition algorithm based on deep learning［D］. Harbin： Heilongjiang University， 2021.
[25]	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2017： 6000-6010.
[26]	GAO Z， LI Z， LUO J， et al. Short text aspect-based sentiment analysis based on CNN+BiGRU［J］. Applied Sciences， 2022， 12（5）： No.2707.
[27]	ZHU G， FAN Y， LI F， et al. GSRNet， an adversarial training-based deep framework with multi-scale CNN and BiGRU for predicting genomic signals and regions［J］. Expert Systems with Applications， 2023， 229（Pt A）： No.120439.
[28]	BUSSO C， BULUT M， LEE C C， et al. IEMOCAP： interactive emotional dyadic motion capture database［J］. Language Resources and Evaluation， 2008， 42（4）： 335-359.
[29]	王雨，袁玉波，过弋，等. 情感增强的对话文本情绪识别模型［J］. 计算机应用， 2023， 43（3）： 706-712.
	WANG Y， YUAN Y B， GUO Y， et al. Sentiment boosting model for emotion recognition in conversation text［J］. Journal of Computer Applications， 2023， 43（3）： 706-712.
[30]	SADOK S， LEGLAIVE S， SÉGUIER R. A vector quantized masked autoencoder for speech emotion recognition［EB/OL］. ［2024-10-17］..
[31]	钱佳琪，黄鹤鸣，张会云. 基于ARCNN-GAP网络的语音情感识别［J］.计算机与现代化， 2021（12）： 91-95.
	QIAN J Q， HUANG H M， ZHANG H Y. Speech emotion recognition based on ARCNN-GAP network［J］. Computer and Modernization， 2021（12）： 91-95.
[32]	MURUGAIYAN S， UYYALA S R. Aspect-based sentiment analysis of customer speech data using deep convolutional neural network and BiLSTM［J］. Cognitive Computation， 2023， 15（3）： 914-931.
[33]	DUTT A， GADER P. Wavelet multiresolution analysis based speech emotion recognition system using 1D CNN LSTM networks［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2023， 31： 2043-2054.
[34]	NIZAMIDIN T， ZHAO L， LIANG R， et al. Siamese attention-based LSTM for speech emotion recognition［J］. IEICE Transactions on Fundamentals of Electronics， Communications and Computer Sciences， 2020， E103-A（7）： 937-941.

情感类别	样本数			总样本数
情感类别	训练集	验证集	测试集	总样本数
中性（neutral）	5 124	808	900	6 832
伤心（sad）	3 321	545	562	4 428
生气（angry）	3 362	524	526	4 412
高兴（happy）	5 021	814	709	6 544

情感类别	样本数			总样本数
情感类别	训练集	验证集	测试集	总样本数
中性（neutral）	5 124	808	900	6 832
伤心（sad）	3 321	545	562	4 428
生气（angry）	3 362	524	526	4 412
高兴（happy）	5 021	814	709	6 544

模型	准确率	精确率	召回率	F1值
CNN-BiGRU	78.91	78.85	78.04	78.44
Multi-scale CNN-BiGRU	80.82	80.76	81.03	80.89
Multi-scale CNN-BiGRU +MDA	83.62	83.58	83.65	83.61
VQ-MAE-S	84.11	84.09	84.02	84.43
孪生Multi-scale CNN-BiGRU	87.38	87.13	86.82	86.97

模型	准确率	精确率	召回率	F1值
CNN-BiGRU	78.91	78.85	78.04	78.44
Multi-scale CNN-BiGRU	80.82	80.76	81.03	80.89
Multi-scale CNN-BiGRU +MDA	83.62	83.58	83.65	83.61
VQ-MAE-S	84.11	84.09	84.02	84.43
孪生Multi-scale CNN-BiGRU	87.38	87.13	86.82	86.97

模型	准确率	精确率	召回率	F1值
CNN-BiGRU	76.25	76.13	76.42	76.39
Multi-scale CNN-BiGRU	79.02	79.25	79.12	79.19
Multi-scale CNN-BiGRU +MDA	81.29	81.19	81.23	81.21
VQ-MAE-S	84.70	84.46	84.29	84.83
孪生Multi-scale CNN-BiGRU	84.08	83.92	83.82	83.85

Speech emotion recognition method based on hybrid Siamese network with CNN and bidirectional GRU

基于CNN和双向GRU混合孪生网络的语音情感识别方法

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 12

References 34

Related Articles 15

Recommended Articles

Metrics

模型	准确率	精确率	召回率	F1值
CNN-BiGRU	79.98	79.92	78.31	78.69
Multi-scale CNN-BiGRU	82.73	82.69	82.14	82.49
Multi-scale CNN-BiGRU +MDA	85.33	85.21	85.30	85.27
Multi-scale CNN-BiGRU +MSFE	86.08	85.76	85.92	85.86
孪生Multi-scale CNN-BiGRU	87.81	87.85	87.81	87.83

模型	积极	中性	消极
ARCNN-GAP	76.25	79.20	77.43
CNN+Bi-LSTM	77.63	78.03	80.17
MLANet	77.32	79.89	80.11
孪生LSTM	82.73	81.39	82.80
孪生Multi-scale CNN-BiGRU	86.27	84.63	87.60

模型	积极	中性	消极
ARCNN-GAP	78.76	82.03	79.74
CNN+Bi-LSTM	80.10	80.13	80.92
MLANet	79.89	81.42	81.26
孪生LSTM	85.26	82.81	87.15
孪生Multi-scale CNN-BiGRU	88.75	85.09	89.12

[1]	Lina GE, Mingyu WANG, Lei TIAN. Review of research on efficiency of federated learning [J]. Journal of Computer Applications, 2025, 45(8): 2387-2398.
[2]	Shuo ZHANG, Guokai SUN, Yuan ZHUANG, Xiaoyu FENG, Jingzhi WANG. Dynamic detection method of eclipse attacks for blockchain node analysis [J]. Journal of Computer Applications, 2025, 45(8): 2428-2436.
[3]	Yongpeng TAO, Shiqi BAI, Zhengwen ZHOU. Neural architecture search for multi-tissue segmentation using convolutional and transformer-based networks in glioma segmentation [J]. Journal of Computer Applications, 2025, 45(7): 2378-2386.
[4]	Jinxian SUO, Liping ZHANG, Sheng YAN, Dongqi WANG, Yawen ZHANG. Review of interpretable deep knowledge tracing methods [J]. Journal of Computer Applications, 2025, 45(7): 2043-2055.
[5]	Zhenzhou WANG, Fangfang GUO, Jingfang SU, He SU, Jianchao WANG. Robustness optimization method of visual model for intelligent inspection [J]. Journal of Computer Applications, 2025, 45(7): 2361-2368.
[6]	Yingjun ZHANG, Weiwei YAN, Binhong XIE, Rui ZHANG, Wangdong LU. Gradient-discriminative and feature norm-driven open-world object detection [J]. Journal of Computer Applications, 2025, 45(7): 2203-2210.
[7]	Qiaoling QI, Xiaoxiao WANG, Qianqian ZHANG, Peng WANG, Yongfeng DONG. Label noise adaptive learning algorithm based on meta-learning [J]. Journal of Computer Applications, 2025, 45(7): 2113-2122.
[8]	Xiaoyang ZHAO, Xinzheng XU, Zhongnian LI. Research review on explainable artificial intelligence in internet of things applications [J]. Journal of Computer Applications, 2025, 45(7): 2169-2179.
[9]	Lanhao LI, Haojun YAN, Haoyi ZHOU, Qingyun SUN, Jianxin LI. Multi-scale information fusion time series long-term forecasting model based on neural network [J]. Journal of Computer Applications, 2025, 45(6): 1776-1783.
[10]	Tianchen HUA, Xiaoning MA, Hui ZHI. Portable executable malware static detection model based on shallow artificial neural network [J]. Journal of Computer Applications, 2025, 45(6): 1911-1921.
[11]	Sijie NIU, Yuliang LIU. Auxiliary diagnostic method for retinopathy based on dual-branch structure with knowledge distillation [J]. Journal of Computer Applications, 2025, 45(5): 1410-1414.
[12]	Wenpeng WANG, Yinchang QIN, Wenxuan SHI. Review of unsupervised deep learning methods for industrial defect detection [J]. Journal of Computer Applications, 2025, 45(5): 1658-1670.
[13]	Xueying LI, Kun YANG, Guoqing TU, Shubo LIU. Adversarial sample generation method for time-series data based on local augmentation [J]. Journal of Computer Applications, 2025, 45(5): 1573-1581.
[14]	Dan WANG, Wenhao ZHANG, Lijuan PENG. Channel estimation of reconfigurable intelligent surface assisted communication system based on deep learning [J]. Journal of Computer Applications, 2025, 45(5): 1613-1618.
[15]	Kai CHEN, Hailiang YE, Feilong CAO. Classification algorithm for point cloud based on local-global interaction and structural Transformer [J]. Journal of Computer Applications, 2025, 45(5): 1671-1676.