Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (8): 2515-2521.DOI: 10.11772/j.issn.1001-9081.2024081142

• Artificial intelligence • Previous Articles    

Speech emotion recognition method based on hybrid Siamese network with CNN and bidirectional GRU

Peng PENG, Ziting CAI, Wenling LIU, Caihua CHEN, Wei ZENG(), Baolai HUANG   

  1. College of Computer and Network Security (Model Software College),Chengdu University of Technology,Chengdu Sichuan 610059,China
  • Received:2024-08-16 Revised:2024-11-04 Accepted:2024-11-12 Online:2024-11-19 Published:2025-08-10
  • Contact: Wei ZENG
  • About author:PENG Peng, born in 1987, Ph. D., associate professor. His research interests include natural language processing, artificial intelligence.
    CAI Ziting, born in 2000, M. S. candidate. Her research interests include emotion recognition, artificial intelligence.
    LIU Wenling, born in 2001, M. S. candidate. Her research interests include emotion recognition, artificial intelligence.
    CHEN Caihua, born in 1989, Ph. D., research fellow. His research interests include artificial intelligence.
    HUANG Baolai, born in 2001, M. S. candidate. His research interests include emotion recognition, artificial intelligence.
  • Supported by:
    Sichuan Province Science and Technology Program(2023YFN0053)

基于CNN和双向GRU混合孪生网络的语音情感识别方法

彭鹏, 蔡子婷, 刘雯玲, 陈才华, 曾维(), 黄宝来   

  1. 成都理工大学 计算机与网络安全学院(示范性软件学院),成都 610059
  • 通讯作者: 曾维
  • 作者简介:彭鹏(1987—),男,陕西渭南人,副教授,博士,主要研究方向:自然语言处理、人工智能
    蔡子婷(2000—),女,江西上饶人,硕士研究生,主要研究方向:情感识别、人工智能
    刘雯玲(2001—),女,四川郫县人,硕士研究生,主要研究方向:情感识别、人工智能
    陈才华(1989—),男,四川德阳人,研究员,博士,主要研究方向:人工智能
    黄宝来(2001—),男,江西上饶人,硕士研究生,主要研究方向:情感识别、人工智能。
  • 基金资助:
    四川省科技计划项目(2023YFN0053)

Abstract:

In order to solve the problems of low accuracy and poor generalization ability in the existing Speech Emotion Recognition (SER) models, a hybrid Siamese Multi-scale CNN-BiGRU network was proposed. In this network, a Multi-Scale Feature Extractor (MSFE) and a Multi-Dimensional Attention (MDA) module were introduced to construct a Siamese network, and the training data were increased by utilizing sample pairs, thereby improving the model’s recognition accuracy and enabling it to better adapt to complex real-world application scenarios. Experimental results on IEMOCAP and EMO-DB public datasets show that the recognition accuracy of the proposed model is enhanced by 8.28 and 7.79 percentage points, respectively, compared to that of CNN-BiGRU model. Furthermore, a customer service speech emotion dataset was constructed by collecting real customer service conversation recordings. Experimental results on this dataset show that the recognition accuracy of the proposed model can reach 87.85%, indicating that the proposed model has good generalization ability.

Key words: Speech Emotion Recognition (SER), Convolutional Neural Network (CNN), Bidirectional Gated Recurrent Unit (BiGRU), hybrid Siamese network, deep learning

摘要:

针对现有语音情感识别(SER)模型精度较低、泛化能力较差的问题,提出一种孪生的Multi-scale CNN-BiGRU网络。该网络通过引入多尺度特征提取器(MSFE)和多维度注意力(MDA)模块构建孪生网络,并利用样本对的形式增加模型训练量,从而提高模型的识别精度,使它能更好地适应复杂的真实应用场景。在IEMOCAP和EMO-DB这2个公开数据集上的实验结果表明,所提模型在识别精确率上较CNN-BiGRU分别提升了8.28和7.79个百分点。此外,通过收集客服真实语音对话录音构建一个客服语音情感数据集,在该数据集上的实验结果表明,所提模型的识别精确率可达到87.85%,证明所提模型具有良好的泛化性。

关键词: 语音情感识别, 卷积神经网络, 双向GRU, 混合孪生网络, 深度学习

CLC Number: