《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (11): 3607-3615.DOI: 10.11772/j.issn.1001-9081.2022111734

• 多媒体计算与计算机仿真 • 上一篇    

基于多尺度阶梯时频Conformer GAN的语音增强算法

金玉堂1, 王以松1(), 王丽会1, 赵鹏利2   

  1. 1.公共大数据国家重点实验室(贵州大学),贵阳,550025
    2.许昌电气职业学院,河南 许昌 461000
  • 收稿日期:2022-11-22 修回日期:2023-02-27 接受日期:2023-02-28 发布日期:2023-03-20 出版日期:2023-11-10
  • 通讯作者: 王以松
  • 作者简介:金玉堂(1999—),男,贵州安顺人,硕士研究生,主要研究方向:数字信号处理、语音增强、信号去噪
    王以松(1975—),男,贵州思南人,教授,博士,CCF会员,主要研究方向:知识表示与推理、回答集程序设计、人工智能、机器学习 yswang@gzu.edu.cn
    王丽会(1982—),女,黑龙江哈尔滨人,教授,博士,主要研究方向:深度学习、机器学习、医学成像、医学图像处理、计算机视觉
    赵鹏利(1992—),女,河南许昌人,助教,硕士,主要研究方向:数据库、软件工程。
  • 基金资助:
    国家自然科学基金资助项目(U1836205)

Speech enhancement algorithm based on multi-scale ladder-type time-frequency Conformer GAN

Yutang JIN1, Yisong WANG1(), Lihui WANG1, Pengli ZHAO2   

  1. 1.State Key Laboratory of Public Big Data (Guizhou University),Guiyang Guizhou 550025,China
    2.Xuchang Electric Vocational College,Xuchang Henan 461000,China
  • Received:2022-11-22 Revised:2023-02-27 Accepted:2023-02-28 Online:2023-03-20 Published:2023-11-10
  • Contact: Yisong WANG
  • About author:JIN Yutang, born in 1999, M. S. candidate. His research interests include digital signal processing, speech enhancement, signal denoising.
    WANG Yisong, born in 1975, Ph. D., professor. His research interests include knowledge representation and reasoning, answer set programming design, artificial intelligence, machine learning.
    WANG Lihui, born in 1982, Ph. D., professor. Her research interests include deep learning, machine learning, medical imaging, medical image processing, computer vision.
    ZHAO Pengli, born in 1992, M. S., teaching assistant. Her research interests include database, software engineering.
  • Supported by:
    National Natural Science Foundation of China(U1836205)

摘要:

针对频率域语音增强算法中因相位混乱产生人工伪影,导致去噪性能受限、语音质量不高的问题,提出一种基于多尺度阶梯型时频Conformer生成对抗网络(MSLTF-CMGAN)的语音增强算法。将语音语谱图的实部、虚部和振幅谱作为输入,生成器首先在多个尺度上利用时间-频率Conformer学习时域和频域的全局及局部特征依赖;其次,利用Mask Decoder分支学习振幅掩码,而Complex Decoder分支则直接学习干净的语谱图,融合这两个Decoder分支的输出可得到重建后的语音;最后,利用指标判别器判别语音的评价指标得分,通过极大极小训练使生成器生成高质量的语音。采用主观评价平均意见得分(MOS)和客观评价指标在公开数据集VoiceBank+Demand上与各类语音增强模型进行对比,结果显示,所提算法的MOS信号失真(CSIG)和MOS噪声失真(CBAK)比目前最先进的方法CMGAN(基于Conformer的指标生成对抗网络语音增强模型)分别提高了0.04和0.07,尽管它的MOS整体语音质量(COVL)和语音质量的感知评估(PESQ)略低于CMGAN,但与其他对比模型相比在多项主客观语音质量评估方面的评分均处于领先水平。

关键词: 语音增强, 多尺度, Conformer, 生成对抗网络, 指标判别器, 深度学习

Abstract:

Aiming at the problem of artificial artifacts due to phase disorder in frequency-domain speech enhancement algorithms, which limits the denoising performance and decreases the speech quality, a speech enhancement algorithm based on Multi-Scale Ladder-type Time-Frequency Conformer Generative Adversarial Network (MSLTF-CMGAN) was proposed. Taking the real part, imaginary part and magnitude spectrum of the speech spectrogram as input, the generator first learned the local and global feature dependencies between temporal and frequency domains by using time-frequency Conformer at multiple scales. Secondly, the Mask Decoder branch was used to learn the amplitude mask, and the Complex Decoder branch was directly used to learn the clean spectrogram, and the outputs of the two decoder branches were fused to obtain the reconstructed speech. Finally, the metric discriminator was used to judge the scores of speech evaluation metrics, and high-quality speech was generated by the generator through minimax training. Comparison experiments with various types of speech enhancement models were conducted on the public dataset VoiceBank+Demand by subjective evaluation Mean Opinion Score (MOS) and objective evaluation metrics.Experimental results show that compared with current state-of-the-art speech enhancement method CMGAN (Comformer-based MetricGAN), MSLTF-CMGAN improves MOS prediction of the signal distortion (CSIG) and MOS predictor of intrusiveness of background noise (CBAK) by 0.04 and 0.07 respectively, even though its Perceptual Evaluation of Speech Quality (PESQ) and MOS prediction of the overall effect (COVL) are slightly lower than that of CMGAN, it still outperforms other comparison models in several subjective and objective speech evaluation metrics.

Key words: speech enhancement, multi-scale, Conformer, Generative Adversarial Network (GAN), metric discriminator, deep learning

中图分类号: