Journal of Computer Applications ›› 2026, Vol. 46 ›› Issue (1): 69-76.DOI: 10.11772/j.issn.1001-9081.2025010042

• Artificial intelligence • Previous Articles     Next Articles

Multi-feature fusion speech emotion recognition method based on SAA-CNN-BiLSTM network

Zhihui ZAN1, Yajing WANG1(), Ke LI1, Zhixiang YANG2, Guangyu YANG2   

  1. 1.School of Computer Science and Technology,Shandong University of Technology,Zibo Shandong 255049,China
    2.School of Electrical and Electronic Engineering,Shandong University of Technology,Zibo Shandong 255000,China
  • Received:2025-01-13 Revised:2025-03-25 Accepted:2025-03-26 Online:2026-01-10 Published:2026-01-10
  • Contact: Yajing WANG
  • About author:ZAN Zhihui, born in 2001, M. S. candidate. His research interests include speech emotion recognition.
    LI Ke, born in 1999, M. S. candidate. Her research interests include speech enhancement.
    YANG Zhixiang, born in 1998, M. S. candidate. His research interests include optoelectronic precision testing.
    YANG Guangyu, born in 2002, M. S. candidate. His research interests include weak signal detection.
  • Supported by:
    Natural Science Foundation of Shandong Province(ZR2024MD031)

基于SAA-CNN-BiLSTM网络的多特征融合语音情感识别方法

昝志辉1, 王雅静1(), 李珂1, 杨智翔2, 杨光宇2   

  1. 1.山东理工大学 计算机科学与技术学院,山东 淄博 255049
    2.山东理工大学 电气与电子工程学院,山东 淄博 255000
  • 通讯作者: 王雅静
  • 作者简介:昝志辉(2001—),男,河北沧州人,硕士研究生,主要研究方向:语音情感识别
    李珂(1999—),女,河南新乡人,硕士研究生,主要研究方向:语音增强
    杨智翔(1998—),男,山东东营人,硕士研究生,主要研究方向:光电精密测试
    杨光宇(2002—),男,山东德州人,硕士研究生,主要研究方向:微弱信号检测。
  • 基金资助:
    山东省自然科学基金资助项目(ZR2024MD031)

Abstract:

Aiming at the problems of incomplete representation of speech information by single speech emotion features and low utilization of speech features by the model, a multi-feature fusion speech emotion recognition method based on SAA-CNN-BiLSTM network was proposed. The method enhanced data by introducing noise,volume and audio rate boosters,enabling the model to learn diverse data features,and integrated multiple features such as fundamental frequency, time domain and frequency domain features to comprehensively represent emotional information from different perspectives. Besides, based on Bidirectional Long Short-Term Memory (BiLSTM) network, Convolutional Neural Network (CNN) was introduced to capture the spatial correlation of the input data and extract more representative features. At the same time, a Simplified Additive Attention (SAA) mechanism was constructed to simplify the explicit query keys and query vectors, so that the calculation of attention weights did not depend on specific query information. Features of different dimensions were able to be correlated and influenced each other based on the attention weights. In this way, the information between features was able to be interacted and fused with each other, thus improving the effective utilization of features. Experimental results show that this method achieves the weighted precision of 87.02%, 82.59%, and 73.13%, respectively, on the EMO-DB, CASIA, and SAVEE datasets. Compared with the baseline methods such as Incremental Convolution (IncConv), Novel Heterogeneous Parallel Convolution BiLSTM (NHPC-BiLSTM), and Dynamic Convolutional Recurrent Neural Network (DCRNN), the improvements are 0.52-9.80, 2.92-23.09, and 3.13-16.63 percentage points, respectively.

Key words: speech emotion recognition, deep learning, multi-feature fusion, data augmentation, Long Short-Term Memory (LSTM) network, Simplified Additive Attention (SAA) mechanism

摘要:

针对单一语音情感特征对语音信息表征不全面及模型对语音特征利用率低的问题,提出一种基于SAA-CNN-BiLSTM网络的多特征融合语音情感识别方法。该方法引入噪声、音量和音速增强器对数据进行增强,以使模型学习到多样化数据特征,并将基频、时域以及频域特征进行多特征融合,从不同角度全面表达情感信息。此外,在双向长短时记忆(BiLSTM)网络的基础上引入卷积神经网络(CNN)捕获输入数据的空间相关性,并提取更具代表性的特征。同时,构建简化加性注意力(SAA)机制,简化显式查询键和查询向量,使注意力权重计算不依赖于特定查询信息,而不同维度的特征能基于注意力权重进行相互关联和影响,特征之间的信息得以交互和融合,从而提高特征的有效利用率。实验结果表明,该方法在EMO-DB、CASIA和SAVEE数据集上分别达到了87.02%、82.59%和73.13%的加权精度,相较于增量卷积(IncConv)、异构并行卷积双向长短期记忆(NHPC-BiLSTM)和动态卷积递归神经网络(DCRNN)等基线方法,分别提升了0.52~9.80、2.92~23.09和3.13~16.63个百分点。

关键词: 语音情感识别, 深度学习, 多特征融合, 数据增强, 长短时记忆网络, 简化加性注意力机制

CLC Number: