基于CNN和双向GRU混合孪生网络的语音情感识别方法

doi:10.11772/j.issn.1001-9081.2024081142

《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (8): 2515-2521.DOI: 10.11772/j.issn.1001-9081.2024081142

• 人工智能 • 上一篇

基于CNN和双向GRU混合孪生网络的语音情感识别方法

彭鹏, 蔡子婷, 刘雯玲, 陈才华, 曾维(), 黄宝来

成都理工大学计算机与网络安全学院（示范性软件学院），成都 610059

收稿日期:2024-08-16 修回日期:2024-11-04 接受日期:2024-11-12 发布日期:2024-11-19 出版日期:2025-08-10
通讯作者: 曾维
作者简介:彭鹏（1987—），男，陕西渭南人，副教授，博士，主要研究方向：自然语言处理、人工智能
蔡子婷（2000—），女，江西上饶人，硕士研究生，主要研究方向：情感识别、人工智能
刘雯玲（2001—），女，四川郫县人，硕士研究生，主要研究方向：情感识别、人工智能
陈才华（1989—），男，四川德阳人，研究员，博士，主要研究方向：人工智能
黄宝来（2001—），男，江西上饶人，硕士研究生，主要研究方向：情感识别、人工智能。
基金资助:
四川省科技计划项目(2023YFN0053)

Speech emotion recognition method based on hybrid Siamese network with CNN and bidirectional GRU

Peng PENG, Ziting CAI, Wenling LIU, Caihua CHEN, Wei ZENG(), Baolai HUANG

College of Computer and Network Security （Model Software College），Chengdu University of Technology，Chengdu Sichuan 610059，China

Received:2024-08-16 Revised:2024-11-04 Accepted:2024-11-12 Online:2024-11-19 Published:2025-08-10
Contact: Wei ZENG
About author:PENG Peng， born in 1987， Ph. D.， associate professor. His research interests include natural language processing， artificial intelligence.
CAI Ziting， born in 2000， M. S. candidate. Her research interests include emotion recognition， artificial intelligence.
LIU Wenling， born in 2001， M. S. candidate. Her research interests include emotion recognition， artificial intelligence.
CHEN Caihua， born in 1989， Ph. D.， research fellow. His research interests include artificial intelligence.
HUANG Baolai， born in 2001， M. S. candidate. His research interests include emotion recognition， artificial intelligence.
Supported by:
Sichuan Province Science and Technology Program(2023YFN0053)

摘要/Abstract

摘要：

针对现有语音情感识别（SER）模型精度较低、泛化能力较差的问题，提出一种孪生的Multi-scale CNN-BiGRU网络。该网络通过引入多尺度特征提取器（MSFE）和多维度注意力（MDA）模块构建孪生网络，并利用样本对的形式增加模型训练量，从而提高模型的识别精度，使它能更好地适应复杂的真实应用场景。在IEMOCAP和EMO-DB这2个公开数据集上的实验结果表明，所提模型在识别精确率上较CNN-BiGRU分别提升了8.28和7.79个百分点。此外，通过收集客服真实语音对话录音构建一个客服语音情感数据集，在该数据集上的实验结果表明，所提模型的识别精确率可达到87.85%，证明所提模型具有良好的泛化性。

关键词: 语音情感识别, 卷积神经网络, 双向GRU, 混合孪生网络, 深度学习

Abstract:

In order to solve the problems of low accuracy and poor generalization ability in the existing Speech Emotion Recognition （SER） models， a hybrid Siamese Multi-scale CNN-BiGRU network was proposed. In this network， a Multi-Scale Feature Extractor （MSFE） and a Multi-Dimensional Attention （MDA） module were introduced to construct a Siamese network， and the training data were increased by utilizing sample pairs， thereby improving the model’s recognition accuracy and enabling it to better adapt to complex real-world application scenarios. Experimental results on IEMOCAP and EMO-DB public datasets show that the recognition accuracy of the proposed model is enhanced by 8.28 and 7.79 percentage points， respectively， compared to that of CNN-BiGRU model. Furthermore， a customer service speech emotion dataset was constructed by collecting real customer service conversation recordings. Experimental results on this dataset show that the recognition accuracy of the proposed model can reach 87.85%， indicating that the proposed model has good generalization ability.

Key words: Speech Emotion Recognition (SER), Convolutional Neural Network (CNN), Bidirectional Gated Recurrent Unit (BiGRU), hybrid Siamese network, deep learning

中图分类号:

TP391

彭鹏, 蔡子婷, 刘雯玲, 陈才华, 曾维, 黄宝来. 基于CNN和双向GRU混合孪生网络的语音情感识别方法[J]. 计算机应用, 2025, 45(8): 2515-2521.

Peng PENG, Ziting CAI, Wenling LIU, Caihua CHEN, Wei ZENG, Baolai HUANG. Speech emotion recognition method based on hybrid Siamese network with CNN and bidirectional GRU[J]. Journal of Computer Applications, 2025, 45(8): 2515-2521.

图/表 12

图1 多尺度特征提取器的结构

Fig. 1 Structure of multi-scale feature extractor

图2 CNN-BiGRU混合孪生语音情感识别网络的结构

Fig. 2 Structure of CNN-BiGRU hybrid Siamese speech emotion recognition network

图3 CNN-BiGRU模型的神经网络结构

Fig.3 Neural network structure of CNN-BiGRU model

图4 Multi-scale CNN-BiGRU网络的结构

Fig. 4 Structure of Multi-scale CNN-BiGRU network

图5 孪生Multi-scale CNN-BiGRU网络的结构

Fig. 5 Structure of Siamese Multi-scale CNN-BiGRU network

表1 数据扩充后IEMOCAP英语语音情感数据库

Tab. 1 Data expanded IEMOCAP English speech emotion database

情感类别	样本数			总样本数
情感类别	训练集	验证集	测试集	总样本数
中性（neutral）	5 124	808	900	6 832
伤心（sad）	3 321	545	562	4 428
生气（angry）	3 362	524	526	4 412
高兴（happy）	5 021	814	709	6 544

表2 不同模型在带噪IEMOCAP数据集上的识别效果对比 (%)

Tab. 2 Comparison of recognition effects of different models on noisy IEMOCAP dataset

模型	准确率	精确率	召回率	F1值
CNN-BiGRU	78.91	78.85	78.04	78.44
Multi-scale CNN-BiGRU	80.82	80.76	81.03	80.89
Multi-scale CNN-BiGRU +MDA	83.62	83.58	83.65	83.61
VQ-MAE-S	84.11	84.09	84.02	84.43
孪生Multi-scale CNN-BiGRU	87.38	87.13	86.82	86.97

表3 不同模型在带噪EMO-DB数据集上的识别效果对比 (%)

Tab. 3 Comparison of recognition effects of different models on noisy EMO-DB dataset

模型	准确率	精确率	召回率	F1值
CNN-BiGRU	76.25	76.13	76.42	76.39
Multi-scale CNN-BiGRU	79.02	79.25	79.12	79.19
Multi-scale CNN-BiGRU +MDA	81.29	81.19	81.23	81.21
VQ-MAE-S	84.70	84.46	84.29	84.83
孪生Multi-scale CNN-BiGRU	84.08	83.92	83.82	83.85

表4 不同模型在CSSED上的识别效果对比 (%)

Tab. 4 Comparison of recognition effects of different models on CSSED

模型	准确率	精确率	召回率	F1值
CNN-BiGRU	79.98	79.92	78.31	78.69
Multi-scale CNN-BiGRU	82.73	82.69	82.14	82.49
Multi-scale CNN-BiGRU +MDA	85.33	85.21	85.30	85.27
Multi-scale CNN-BiGRU +MSFE	86.08	85.76	85.92	85.86
孪生Multi-scale CNN-BiGRU	87.81	87.85	87.81	87.83

图6 在CSSED上的混淆矩阵

Fig. 6 Confusion matrix on CSSED

表5 不同模型在CSSED上的准确率（未进行语音增强） (%)

Tab. 5 Accuracy of different models on CSSED （without speech enhancement）

模型	积极	中性	消极
ARCNN-GAP	76.25	79.20	77.43
CNN+Bi-LSTM	77.63	78.03	80.17
MLANet	77.32	79.89	80.11
孪生LSTM	82.73	81.39	82.80
孪生Multi-scale CNN-BiGRU	86.27	84.63	87.60

表6 不同模型在CSSED上的准确率（语音增强） (%)

Tab. 6 Accuracy of different models on CSSED（with speech enhancement）

模型	积极	中性	消极
ARCNN-GAP	78.76	82.03	79.74
CNN+Bi-LSTM	80.10	80.13	80.92
MLANet	79.89	81.42	81.26
孪生LSTM	85.26	82.81	87.15
孪生Multi-scale CNN-BiGRU	88.75	85.09	89.12

参考文献 34

[1]	ZHAO S， JIA G， YANG J， et al. Emotion recognition from multiple modalities： fundamentals and methodologies［J］. IEEE Signal Processing Magazine， 2021， 38（6）： 59-73.
[2]	SHEN S， GAO Y， LIU F， et al. Emotion neural transducer for fine-grained speech emotion recognition［C］// Proceedings of the 2024 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2024： 10111-10115.
[3]	BHANGALE K， KOTHANDARAMAN M. Speech emotion recognition based on multiple acoustic features and deep convolutional neural network［J］. Electronics， 2023， 12（4）： No.839.
[4]	ULGEN I R， DU Z， BUSSO C， et al. Revealing emotional clusters in speaker embeddings： a contrastive learning strategy for speech emotion recognition［C］// Proceedings of the 2024 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2024： 12081-12085.
[5]	YI L， MAK M W. Improving speech emotion recognition with adversarial data augmentation network［J］. IEEE Transactions on Neural Networks and Learning Systems， 2022， 33（1）： 172-184.
[6]	WANG B， WANG D. Plant leaves classification： a few-shot learning method based on Siamese network［J］. IEEE Access， 2019， 7： 151754-151763.
[7]	NIU Z， ZHONG G， YU H. A review on the attention mechanism of deep learning［J］. Neurocomputing， 2021， 452： 48-62.
[8]	LI D， LIU J， YANG Z， et al. Speech emotion recognition using recurrent neural networks with directional self-attention［J］. Expert Systems with Applications， 2021， 173： No.114683.
[9]	XU H， ZHANG H， HAN K， et al. Learning alignment for multimodal emotion recognition from speech［C］// Proceedings of the INTERSPEECH 2019. ［S.l.］： International Speech Communication Association， 2019： 3569-3573.
[10]	SIRIWARDHANA S， KALUARACHCHI T， BILLINGHURST M， et al. Multimodal emotion recognition with Transformer-based self supervised feature fusion［J］. IEEE Access， 2020， 8： 176274-176285.
[11]	HO N H， YANG H J， KIM S H， et al. Multimodal approach of speech emotion recognition using multi-level multi-head fusion attention-based recurrent neural network［J］. IEEE Access， 2020， 8： 61672-61686.
[12]	LIU K， WANG D， WU D， et al. Speech emotion recognition via multi-level attention network［J］. IEEE Signal Processing Letters， 2022， 29： 2278-2282.
[13]	杨磊，赵红东，于快快. 基于多头注意力机制的端到端语音情感识别［J］. 计算机应用， 2022， 42（6）： 1869-1875.
	YANG L， ZHAO H D， YU K K. End-to-end speech emotion recognition based on multi-head attention［J］. Journal of Computer Applications， 2022， 42（6）： 1869-1875.
[14]	DE LOPE J， GRAÑA M. An ongoing review of speech emotion recognition［J］. Neurocomputing， 2023， 528： 1-11.
[15]	LIU Z T， WU B H， LI D Y， et al. Speech emotion recognition based on selective interpolation synthetic minority over-sampling technique in small sample environment［J］. Sensors， 2020， 20（8）： No.2297.
[16]	CHEN S， WANG J， WANG J， et al. MDAM： multi-dimensional attention module for anomalous sound detection［C］// Proceedings of the 2023 International Conference on Neural Information Processing， CCIS 1967. Singapore： Springer， 2024： 48-60.
[17]	LIU R. Convolutional Siamese network-based few-shot learning for monkeypox detection under data scarcity［C］// Proceedings of the SPIE 12611， 2nd International Conference on Biological Engineering and Medical Science. Bellingham， WA： SPIE， 2023： No.126115O.
[18]	FENG K， CHASPARI T. Few-shot learning in emotion recognition of spontaneous speech using a Siamese neural network with adaptive sample pair formation［J］. IEEE Transactions on Affective Computing， 2023， 14（2）： 1627-1633.
[19]	TORRES L， MONTEIRO N， OLIVEIRA J， et al. Exploring a Siamese neural network architecture for one-shot drug discovery［C］// Proceedings of the IEEE 20th International Conference on Bioinformatics and Bioengineering. Piscataway： IEEE， 2020： 168-175.
[20]	XU L， MA H， GUAN Y， et al. A Siamese network with node convolution for individualized predictions based on connectivity Maps extracted from resting-state fMRI data［J］. IEEE Journal of Biomedical and Health Informatics， 2023， 27（11）： 5418-5429.
[21]	姜钧舰，刘达维，刘逸凡，等. 基于孪生网络的小样本目标检测算法［J］. 计算机应用， 2023， 43（8）： 2325-2329.
	JIANG J J， LIU D W， LIU Y F， et al. Few-shot object detection algorithm based on Siamese network［J］. Journal of Computer Applications， 2023， 43（8）： 2325-2329.
[22]	SPERBER M， NIEHUES J， NEUBIG G， et al. Self-attentional acoustic models［C］// Proceedings of the INTERSPEECH 2018. ［S.l.］： International Speech Communication Association， 2018： 3723-3727.
[23]	CHAN W， JAITLY N， LE Q， et al. Listen， attend and spell： a neural network for large vocabulary conversational speech recognition［C］// Proceedings of the 2016 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2016： 4960-4964.
[24]	吴虹蕾. 基于深度学习的语音情感识别算法的设计与实现［D］. 哈尔滨：黑龙江大学， 2021.
	WU H L. Design and implementation of speech emotion recognition algorithm based on deep learning［D］. Harbin： Heilongjiang University， 2021.
[25]	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2017： 6000-6010.
[26]	GAO Z， LI Z， LUO J， et al. Short text aspect-based sentiment analysis based on CNN+BiGRU［J］. Applied Sciences， 2022， 12（5）： No.2707.
[27]	ZHU G， FAN Y， LI F， et al. GSRNet， an adversarial training-based deep framework with multi-scale CNN and BiGRU for predicting genomic signals and regions［J］. Expert Systems with Applications， 2023， 229（Pt A）： No.120439.
[28]	BUSSO C， BULUT M， LEE C C， et al. IEMOCAP： interactive emotional dyadic motion capture database［J］. Language Resources and Evaluation， 2008， 42（4）： 335-359.
[29]	王雨，袁玉波，过弋，等. 情感增强的对话文本情绪识别模型［J］. 计算机应用， 2023， 43（3）： 706-712.
	WANG Y， YUAN Y B， GUO Y， et al. Sentiment boosting model for emotion recognition in conversation text［J］. Journal of Computer Applications， 2023， 43（3）： 706-712.
[30]	SADOK S， LEGLAIVE S， SÉGUIER R. A vector quantized masked autoencoder for speech emotion recognition［EB/OL］. ［2024-10-17］..
[31]	钱佳琪，黄鹤鸣，张会云. 基于ARCNN-GAP网络的语音情感识别［J］.计算机与现代化， 2021（12）： 91-95.
	QIAN J Q， HUANG H M， ZHANG H Y. Speech emotion recognition based on ARCNN-GAP network［J］. Computer and Modernization， 2021（12）： 91-95.
[32]	MURUGAIYAN S， UYYALA S R. Aspect-based sentiment analysis of customer speech data using deep convolutional neural network and BiLSTM［J］. Cognitive Computation， 2023， 15（3）： 914-931.
[33]	DUTT A， GADER P. Wavelet multiresolution analysis based speech emotion recognition system using 1D CNN LSTM networks［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2023， 31： 2043-2054.
[34]	NIZAMIDIN T， ZHAO L， LIANG R， et al. Siamese attention-based LSTM for speech emotion recognition［J］. IEICE Transactions on Fundamentals of Electronics， Communications and Computer Sciences， 2020， E103-A（7）： 937-941.

基于CNN和双向GRU混合孪生网络的语音情感识别方法

Speech emotion recognition method based on hybrid Siamese network with CNN and bidirectional GRU

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 12

参考文献 34

相关文章 15

编辑推荐

Metrics

[1]	葛丽娜, 王明禹, 田蕾. 联邦学习的高效性研究综述[J]. 《计算机应用》唯一官方网站, 2025, 45(8): 2387-2398.
[2]	张硕, 孙国凯, 庄园, 冯小雨, 王敬之. 面向区块链节点分析的eclipse攻击动态检测方法[J]. 《计算机应用》唯一官方网站, 2025, 45(8): 2428-2436.
[3]	陶永鹏, 柏诗淇, 周正文. 基于卷积和Transformer神经网络架构搜索的脑胶质瘤多组织分割网络[J]. 《计算机应用》唯一官方网站, 2025, 45(7): 2378-2386.
[4]	索晋贤, 张丽萍, 闫盛, 王东奇, 张雅雯. 可解释的深度知识追踪方法综述[J]. 《计算机应用》唯一官方网站, 2025, 45(7): 2043-2055.
[5]	王震洲, 郭方方, 宿景芳, 苏鹤, 王建超. 面向智能巡检的视觉模型鲁棒性优化方法[J]. 《计算机应用》唯一官方网站, 2025, 45(7): 2361-2368.
[6]	张英俊, 闫薇薇, 谢斌红, 张睿, 陆望东. 梯度区分与特征范数驱动的开放世界目标检测[J]. 《计算机应用》唯一官方网站, 2025, 45(7): 2203-2210.
[7]	齐巧玲, 王啸啸, 张茜茜, 汪鹏, 董永峰. 基于元学习的标签噪声自适应学习算法[J]. 《计算机应用》唯一官方网站, 2025, 45(7): 2113-2122.
[8]	赵小阳, 许新征, 李仲年. 物联网应用中的可解释人工智能研究综述[J]. 《计算机应用》唯一官方网站, 2025, 45(7): 2169-2179.
[9]	花天辰, 马晓宁, 智慧. 基于浅层人工神经网络的可移植执行恶意软件静态检测模型[J]. 《计算机应用》唯一官方网站, 2025, 45(6): 1911-1921.
[10]	吴宗航, 张东, 李冠宇. 基于联合自监督学习的多模态融合推荐算法[J]. 《计算机应用》唯一官方网站, 2025, 45(6): 1858-1868.
[11]	李岚皓, 严皓钧, 周号益, 孙庆赟, 李建欣. 基于神经网络的多尺度信息融合时间序列长期预测模型[J]. 《计算机应用》唯一官方网站, 2025, 45(6): 1776-1783.
[12]	王丹, 张文豪, 彭丽娟. 基于深度学习的智能反射面辅助通信系统信道估计[J]. 《计算机应用》唯一官方网站, 2025, 45(5): 1613-1618.
[13]	龙雨菲, 牟宇辰, 刘晔. 基于张量化图卷积网络和对比学习的多源数据表示学习模型[J]. 《计算机应用》唯一官方网站, 2025, 45(5): 1372-1378.
[14]	牛四杰, 刘昱良. 基于知识蒸馏双分支结构的视网膜病变辅助诊断方法[J]. 《计算机应用》唯一官方网站, 2025, 45(5): 1410-1414.
[15]	王文鹏, 秦寅畅, 师文轩. 工业缺陷检测无监督深度学习方法综述[J]. 《计算机应用》唯一官方网站, 2025, 45(5): 1658-1670.