• •    

BigData2023-P00186 基于多路信息聚合协同解码的单通道语音增强

莫尚斌1,王文君2,董凌2,高盛祥1,余正涛1   

  1. 1. 昆明理工大学
    2. 昆明理工大学信息工程与自动化学院
  • 收稿日期:2023-08-25 修回日期:2023-09-05 发布日期:2023-12-18
  • 通讯作者: 高盛祥
  • 基金资助:
    云南高新技术产业发展项目;云南省重大科技专项计划;国家自然科学基金;云南省基础研究计划;云南省科技人才与平台计划

BigData2023-P00186 Monaural Speech Enhancement Based on Multi-Channel Information Aggregation and collaborative decoding

  • Received:2023-08-25 Revised:2023-09-05 Online:2023-12-18
  • Contact: Shengxiang SHENGGao
  • Supported by:
    Yunnan high-tech industry development project;Yunnan Provincial Key Research and Development Plan;National Natural Science Foundation of China;Yunnan Basic Research Project;Talents and Platform Program of Science and Technology of Yunnan

摘要: 针对基于卷积编解码架构的单通道语音增强网络对语音声学特征提取不充分、解码特征丢失严重的问题,提出了一种基于多路信息聚合协同解码的单通道语音增强网络(Multi-Channel Information Aggregation and Collaborative Decoding,MIACD)。通过双路编码器充分提取融入了语音自监督学习表征(Self-Supervised Learning ,SSL)的幅度谱和复数谱特征,由四层Conformer分别从时间和频率维度对提取特征进行建模,采用残差连接将双路编码器提取的语音幅度、复数特征引入三路信息聚合解码器中,并利用所提通道-时频注意力机制(Channel-Time-Frequency Attention, CTF-Attention)根据语音能量分布情况对解码器中聚合信息进行调节,有效改善了解码时可用声学信息缺失严重的问题。在公开数据集Voice Bank DEMAND上实验结果表明,相比用于单通道语音增强的协作学习框架(GaGNet),MIACD在客观评价指标WB-PESQ上提升了5.1%,STOI达到96.7,验证所提方法可充分利用语音信息进行信号重构,有效抑制噪声并提升语音可理解性。

关键词: 关键词: 声学特征, 多路信息聚合, 双路编码器, 三路信息聚合解码器, 通道-时频注意力机制

Abstract: In order to address the issues of insufficient acoustic feature extraction and severe decoding feature loss in single-channel speech enhancement networks based on convolutional encoder-decoder architecture, this paper proposes a Monaural speech enhancement network called Multi-Channel Information Aggregation and Collaborative Decoding (MIACD). The MIACD network utilizes a dual-channel encoder to extract the speech magnitude spectrum and complex spectrum features, which are enriched with self-supervised learning (SSL) representations. It employs a four-layer Conformer block to model the extracted features in both the time and frequency domains. By incorporating residual connections, the speech magnitude and complex features extracted by the dual-channel encoder are introduced into a three-channel information aggregation decoder. Additionally, a channel-time-frequency attention mechanism (CTF-Attention) is proposed to adjust the aggregated information in the decoder based on the distribution of speech energy, effectively improving the problem of severe acoustic information loss during decoding. Experimental results on the publicly available dataset Voice Bank DEMAND demonstrate that, compared to Glance and Gaze: A Collaborative Learning Framework for Single-channel Speech Enhancement (GaGNet), the proposed method achieves a 5.1% improvement on the objective metric WB-PESQ and 96.7 on STOI, validating that the proposed method effectively utilizes speech information for signal reconstruction, noise suppression, and speech intelligibility enhancement.

Key words: Keywords: acoustic features, multi-channel information aggregation, dual-channel encoder, three-channel information Aggregation decoder, channel-time-frequency attention mechanism

中图分类号: