《计算机应用》唯一官方网站

• •    下一篇

基于SAA-CNN-BiLSTM网络的多特征融合语音情感识别方法

昝志辉,王雅静,李珂,杨智翔,杨光宇   

  1. 山东理工大学
  • 收稿日期:2025-01-13 修回日期:2025-03-21 发布日期:2025-04-27 出版日期:2025-04-27
  • 通讯作者: 昝志辉

Multi-feature fusion speech emotion recognition method based on SAA-CNN-BiLSTM network

  • Received:2025-01-13 Revised:2025-03-21 Online:2025-04-27 Published:2025-04-27

摘要: 针对单一语音情感特征对语音信息表征不全面及模型对语音特征利用率低的问题,提出了一种基于SAA-CNN-BiLSTM网络的多特征融合语音情感识别方法。该方法引入噪声、音量和音速增强器对数据增强,使模型学习到多样化数据特征,将基频、时域以及频域特征进行多特征融合,从不同角度全面表达情感信息。在BiLSTM网络基础上引入CNN捕获输入数据的空间相关性,提取更具代表性的特征,构建简化加性注意力机制,简化显式查询键和查询向量,使注意力权重计算不依赖于特定查询信息,不同维度的特征能基于注意力权重进行相互关联和影响,特征之间的信息得以交互和融合,提高特征有效利用率。实验结果表明,该方法在EMO-DB、CASIA、SAVEE数据集上分别达到了87.02%、82.59%、73.13%的效果,相较于IncConv、NHPC-BiLSTM和DCRNN等方法,分别提升0.52~9.80个百分点、2.92~23.09个百分点、3.13~16.63个百分点。

关键词: 语音情感识别, 深度学习, 多特征融合, 数据增强, 长短时记忆神经网络, 简化加性注意力机制

Abstract: Aiming at the problems of incomplete representation of speech information by single speech emotion features and low utilization rate of speech features by the model, a multi-feature fusion speech emotion recognition method based on SAA-CNN-BiLSTM network was proposed. In this method, noise, volume and sound speed boosters were introduced to enhance the data, enabling the model to learn diverse data features. Multiple features including fundamental frequency, time domain and frequency domain were fused to comprehensively express emotional information from different perspectives. Based on BiLSTM network, CNN was introduced to capture the spatial correlation of the input data and extract more representative features. A simplified additive attention mechanism was constructed, and the explicit query keys and query vectors were simplified, so that the calculation of attention weights does not depend on specific query information. Features of different dimensions can be correlated and influenced based on the attention weights, and the information between features can be interacted and fused, thus improving the effective utilization rate of features. The experimental results show that this method achieves accuracies of 87.02%, 82.59%, and 73.13% on the EMO-DB, CASIA, and SAVEE datasets respectively. Compared with the baseline methods such as IncConv, NHPC-BiLSTM, and DCRNN, it improves the performance by 0.52-9.80 percentage points, 2.92-23.09 percentage points, and 3.13-16.63 percentage points respectively.

Key words: speech emotion recognition, deep learning, multi-feature fusion, data augmentation, long short-term memory neural network, simplified additive attention mechanism

中图分类号: