《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (8): 2611-2617.DOI: 10.11772/j.issn.1001-9081.2023081141

• 多媒体计算与计算机仿真 • 上一篇    

基于多路信息聚合协同解码的单通道语音增强

莫尚斌1,2, 王文君1,2, 董凌1,2,3, 高盛祥1,2,3(), 余正涛1,2,3   

  1. 1.昆明理工大学 信息工程与自动化学院,昆明 650500
    2.云南省人工智能重点实验室(昆明理工大学),昆明 650500
    3.云南省媒体融合重点实验室,昆明 650228
  • 收稿日期:2023-08-25 修回日期:2023-09-20 接受日期:2023-10-08 发布日期:2024-08-22 出版日期:2024-08-10
  • 通讯作者: 高盛祥
  • 作者简介:莫尚斌(1996—),男,四川西昌人,硕士研究生,主要研究方向:语音增强、语音识别
    王文君(1988—),男,云南昆明人,博士研究生,主要研究方向:语音识别、自然语言处理
    董凌(1984—),男,云南大理人,讲师,博士研究生,主要研究方向:语音识别、自然语言处理
    高盛祥(1977—),女,云南洱源人,教授,博士,CCF会员,主要研究方向:自然语言处理、机器翻译、语音识别、语音合成 gaoshengxiang.yn@foxmail.com
    余正涛(1970—),男,云南曲靖人,教授,博士,CCF会员,主要研究方向:自然语言处理、机器翻译、信息检索。
  • 基金资助:
    国家自然科学基金资助项目(61972186);云南高新技术产业发展项目(201606);云南省重大科技专项计划项目(202103AA080015);云南省基础研究计划项目(202001AS070014);云南省科技人才与平台计划项目(202105AC160018);云南省媒体融合重点实验室开放课题(220225702)

Single-channel speech enhancement based on multi-channel information aggregation and collaborative decoding

Shangbin MO1,2, Wenjun WANG1,2, Ling DONG1,2,3, Shengxiang GAO1,2,3(), Zhengtao YU1,2,3   

  1. 1.Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming Yunnan 650500,China
    2.Yunnan Key Laboratory of Artificial Intelligence (Kunming University of Science and Technology),Kunming Yunnan 650500,China
    3.Yunnan Provincial Key Laboratory of Media Integration,Kunming Yunnan 650228,China
  • Received:2023-08-25 Revised:2023-09-20 Accepted:2023-10-08 Online:2024-08-22 Published:2024-08-10
  • Contact: Shengxiang GAO
  • About author:MO Shangbin, born in 1996, M. S. candidate. His research interests include speech enhancement, speech recognition.
    WANG Wenjun,born in 1988, Ph. D. candidate. His research interests include speech recognition, natural language processing.
    DONG Ling, born in 1984, Ph. D. candidate, lecturer. His research interests include speech recognition, natural language processing.
    YU Zhengtao, bron in 1970, Ph. D., professor. His research interests include natural language processing, machine translation, information retrieval.
  • Supported by:
    National Natural Science Foundation of China(61972186);Yunnan High-tech Industry Development Project(201606);Major Science and Technology Special Program of Yunnan Province(202103AA080015);Basic Research Program of Yunnan Province(202001AS070014);Yunnan Science and Technology Talents and Platform Program(202105AC160018);Open Project of Yunnan Provincial Key Laboratory of Media Integration(220225702)

摘要:

为了改善基于卷积编解码架构的单通道语音增强网络对语音声学特征提取不充分、解码特征丢失严重的问题,提出一种基于多路信息聚合协同解码的单通道语音增强网络MIACD,通过双路编码器充分提取融入了语音自监督学习(SSL)表征的幅度谱和复数谱特征,由4层Conformer分别从时间和频率维度对提取特征建模,采用残差连接将双路编码器提取的语音幅度、复数特征引入三路信息聚合解码器,并利用所提通道-时频注意力(CTF-Attention)机制根据语音能量分布情况调节解码器中聚合信息,有效缓解解码时可用声学信息缺失严重的问题。在公开数据集Voice Bank DEMAND上的实验结果表明,与用于单通道语音增强的协作学习框架(GaGNet)相比,MIACD在客观评价指标宽带感知评估语音质量(WB-PESQ)上提升了5.1%,短时客观可懂度(STOI)达到96.7%,验证所提方法可充分利用语音信息重构信号,有效抑制噪声并提升语音可理解性。

关键词: 声学特征, 多路信息聚合, 双路编码器, 三路信息聚合解码器, 通道-时频注意力机制

Abstract:

In order to address the issues of insufficient acoustic feature extraction and severe decoding feature loss in single-channel speech enhancement networks based on convolutional encoder-decoder architecture, a single-channel speech enhancement network called Multi-Channel Information Aggregation and Collaborative Decoding (MIACD) was proposed. A dual-channel encoder was utilized to extract the speech magnitude spectrum and complex spectrum features, which were enriched with Self-Supervised Learning (SSL) representations. A four-layer Conformer block was employed to model the extracted features in time and frequency domains. By incorporating residual connections, the speech magnitude and complex features extracted by the dual-channel encoder were introduced into a three-channel information aggregation decoder. Additionally, a Channel-Time-Frequency Attention (CTF-Attention) mechanism was proposed to adjust the aggregated information in the decoder based on the distribution of speech energy, effectively alleviating the problem of severe acoustic information loss during decoding. Experimental results on the publicly available dataset Voice Bank DEMAND demonstrate that, compared to Glance and Gaze: a collaborative learning framework for Single-channel speech enhancement (GaGNet), the proposed method achieves a 5.1% improvement on the objective metric WB-PESQ (Wide Band Perceptual Evaluation of Speech Quality) and 96.7% on STOI (Short-Time Objective Intelligibility), validating that the proposed method effectively utilizes speech information for signal reconstruction, noise suppression, and speech intelligibility enhancement.

Key words: acoustic feature, multi-channel information aggregation, dual-channel encoder, three-channel information aggregation decoder, channel-time-frequency attention mechanism

中图分类号: