《计算机应用》唯一官方网站

• •    下一篇

渐进式表征学习语音情感识别方法CnnPRL

樊永红,黄鹤鸣   

  1. 青海师范大学
  • 收稿日期:2024-12-06 修回日期:2025-03-08 接受日期:2025-03-18 发布日期:2025-03-21 出版日期:2025-03-21
  • 通讯作者: 黄鹤鸣
  • 基金资助:
    青海省自然科学基金;国家自然科学基金;“111”项目

CnnPRL: progressive representation learning for speech emotion recognition method

  • Received:2024-12-06 Revised:2025-03-08 Accepted:2025-03-18 Online:2025-03-21 Published:2025-03-21

摘要: 语音情感识别旨在赋予计算机准确识别语音信号中情感状态的能力,如何高效地表征语音中的情感特征一直是语音情感识别的研究热点。目前,大多数研究都致力于利用深度学习方法直接从原始语音或语谱图中学习最优特征,这种学习模式可以提取到更完整的特征信息,但忽略了对特定特征更深层细化信息的学习,同时不能保证特征的可解释性。为了解决上述问题,提出一种基于卷积神经网络的渐进式表征学习语音情感识别方法CnnPRL,该方法在语音声学特征 的基础上利用卷积神经网络渐进式地提取具有可解释性的精细化情感特征。首先,手工提取可解释的浅层特征并选择出最优的特征集;其次,提出级联卷积网络和动态融合结构,细化浅层特征,学习深层情感表征;最后,构建并行异构卷积网络提取不同尺度的互补特征,利用融合模块实现多特征融合,捕获多粒度特征,整合来自不同特征尺度的深层情感信息。在保证时间复杂度的前提下,在数据集IEMOCAP、CASIA和EMODB上,相较于BiGRU-Focal,TLFMRF以及TIM-Net方法,CnnPRL在指标WAR上分别取得了1.63%、2.92%和2.82%的提升,说明方法CnnPRL有效;消融实验表明CnnPRL的每个模块都有利于提升模型的整体性能。

关键词: 关键词: 语音情感识别, 渐进式情感表征学习, 卷积神经网络, 动态融合, 多尺度融合

Abstract: Speech emotion recognition aims to equip computers with the ability of accurately identifying emotional states conveyed in speech signals, and how to efficiently represent emotional information has been a key research focus. Currently, most research efforts are dedicated to leveraging deep learning methods to directly learn optimal features from raw speech or spectrogram data. These approaches are good at extracting more comprehensive representation information. However, they may overlook the learning of more refined information within specific features and may not guarantee feature interpretability. Therefore, CnnPRL, a convolutional neural networks-based progressive representation learning method, is proposed for speech emotion recognition. Firstly, interpretable shallow features are hand-crafted and the optimal feature set is selected. Secondly, a cascaded convolution network and dynamic fusion structure are proposed to refine the shallow features and learn deep emotional representations. Lastly, a parallel heterogeneous CNN network is constructed to extract complementary information at different scales by utilizing a fusion module to achieve multi-feature fusion, to capture multi-granularity features, and integrate deep emotional information from different feature scales. Under the premise of ensuring time complexity, the experimental results on the datasets IEMOCAP, CASIA, and EMODB demonstrate that, compared to BiGRU-Focal,TLFMRF and TIM-Net, CnnPRL achieves 1.63%, 2.92%, and 2.82% improvement in WAR index, respectively, and it is proven to be effective. Ablation experiments demonstrate that each module of the CnnPRL contributes positively to the overall performance of the model.

Key words: Keywords: speech emotion recognition, progressive emotional representation learning, convolutional neural network, dynamic fusion, multi-scale fusion

中图分类号: