《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (12): 3804-3812.DOI: 10.11772/j.issn.1001-9081.2024111704

• 人工智能 • 上一篇    下一篇

渐进式表征学习语音情感识别方法CnnPRL

樊永红1,2,3, 黄鹤鸣1,2,3   

  1. 1.青海师范大学 计算机学院,西宁 810008
    2.省部共建藏文智能信息处理及应用国家重点实验室(青海师范大学),西宁 810008
    3.藏文信息处理教育部重点实验室(青海师范大学),西宁 810008
  • 收稿日期:2024-12-06 修回日期:2025-03-08 接受日期:2025-03-18 发布日期:2025-03-21 出版日期:2025-12-10
  • 通讯作者: 黄鹤鸣
  • 作者简介:樊永红(1997—),女,宁夏吴忠人,博士研究生,CCF会员,主要研究方向:模式识别、智能系统、语音情感识别
    黄鹤鸣(1969—),男(藏族),青海海东人,教授,博士,CCF会员,主要研究方向:模式识别、智能系统、语音情感识别、语音增强、图像处理。
  • 基金资助:
    国家自然科学基金资助项目(62066039);青海省自然科学基金资助项目(2022-ZJ-925);“111”项目(D20035)

CnnPRL: progressive representation learning method for speech emotion recognition

Yonghong FAN1,2,3, Heming HUANG1,2,3   

  1. 1.College of Computer,Qinghai Normal University,Xining Qinghai 810008,China
    2.The State Key Laboratory of Tibetan Intelligent Information Processing and Application (Qinghai Normal University),Xining Qinghai 810008,China
    3.Key Laboratory of Tibetan Information Processing,Ministry of Education (Qinghai Normal University),Xining Qinghai 810008,China
  • Received:2024-12-06 Revised:2025-03-08 Accepted:2025-03-18 Online:2025-03-21 Published:2025-12-10
  • Contact: Heming HUANG
  • About author:FAN Yonghong, born in 1997, Ph. D. candidate. Her research interests include pattern recognition, intelligent systems, speech emotion recognition.
    HUANG Heming, born in 1969, Ph. D., professor. His research interests include pattern recognition, intelligent systems, speech emotion recognition, speech enhancement, image processing.
  • Supported by:
    National Natural Science Foundation of China(62066039);Natural Science Foundation of Qinghai Province(2022-ZJ-925);“111” Project(D20035)

摘要:

语音情感识别(SER)旨在赋予计算机准确识别语音信号中的情感状态的能力,而如何高效地表征语音中的情感特征一直是SER的研究热点。目前,大多数研究都致力于利用深度学习方法直接从原始语音或语谱图中学习最优特征,这种学习模式可以提取到更完整的特征信息,但忽略了对特定特征更深层细化信息的学习,同时不能保证特征的可解释性。为了解决上述问题,提出一种基于卷积神经网络的渐进式表征学习SER方法(CnnPRL),在语音声学特征的基础上利用卷积神经网络(CNN)渐进式地提取具有可解释性的精细化情感特征。首先,手工提取可解释的浅层特征并选择出最优的特征集;其次,提出级联CNN和动态融合结构,以细化浅层特征,并学习深层情感表征;最后,构建并行异构CNN提取不同尺度的互补特征,以利用融合模块实现多特征融合,捕获多粒度特征,并整合来自不同特征尺度的深层情感信息。实验结果表明,在保证时间复杂度的前提下,在数据集IEMOCAP (Interactive EMOtional dyadic motion CAPture database)、CASIA(Institute of Automation, Chinese Academy of Sciences)和EMODB(Berlin EMOtional DataBase)上,相较于SpeechFormer++、TLFMRF(Two-Layer Fuzzy Multiple Random Forest)和TIM-Net(Temporal-aware bI-direction Multi-scale Network)等对比方法,CnnPRL在指标加权平均召回率(WAR)上分别至少取得了0.86、2.92和1.46个百分点的提升,验证了CnnPRL的有效性;消融实验结果验证了CnnPRL的每个模块都有利于提升模型的整体性能。

关键词: 语音情感识别, 渐进式情感表征学习, 卷积神经网络, 动态融合, 多尺度融合

Abstract:

Speech Emotion Recognition (SER) aims to equip computers with the ability of identifying emotional states in speech signals accurately, and how to represent emotional features in speech efficiently is always a key research focus in SER. Currently, most research efforts are dedicated to leveraging deep learning methods to learn optimal features directly from raw speech or spectrogram data. This way is good at extracting more comprehensive representation information, but it may overlook the learning of more refined information within specific features and may not guarantee feature interpretability. Therefore, a Progressive Representation Learning method for SER based on Convolutional neural network (CnnPRL) was proposed to extract interpretable fine-grained emotional features progressively by using Convolutional Neural Network (CNN) based on acoustic features of speech. Firstly, interpretable shallow features were extracted manually and the optimal feature set was selected. Secondly, a cascaded CNN and dynamic fusion structure were proposed to refine the shallow features and learn deep emotional representations. Finally, a parallel heterogeneous CNN was constructed to extract complementary features at different scales to achieve multi-feature fusion by utilizing a fusion module, capture multi-granularity features, and integrate deep emotional information from different feature scales. Under the premise of ensuring time complexity, experimental results on the datasets IEMOCAP (Interactive EMOtional dyadic motion CAPture database), CASIA (Institute of Automation, Chinese Academy of Sciences), and EMODB (Berlin EMOtional DataBase) demonstrate that compared to the comparison methods such as SpeechFormer++, TLFMRF (Two-Layer Fuzzy Multiple Random Forest) and TIM-Net (Temporal-aware bI-direction Multi-scale Network), CnnPRL achieves at least 0.86, 2.92, and 1.46 percentage points improvement in Weighted Average Recall (WAR) index, respectively, proving the effectiveness of CnnPRL. Ablation experimental results demonstrate that each module of CnnPRL contributes the overall performance improvement of the model.

Key words: Speech Emotion Recognition (SER), progressive emotional representation learning, Convolutional Neural Network (CNN), dynamic fusion, multi-scale fusion

中图分类号: