基于多尺度阶梯时频Conformer GAN的语音增强算法

doi:10.11772/j.issn.1001-9081.2022111734

《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (11): 3607-3615.DOI: 10.11772/j.issn.1001-9081.2022111734

• 多媒体计算与计算机仿真 • 上一篇

基于多尺度阶梯时频Conformer GAN的语音增强算法

金玉堂¹, 王以松¹(), 王丽会¹, 赵鹏利²

^1.公共大数据国家重点实验室（贵州大学），贵阳，550025
^2.许昌电气职业学院，河南许昌 461000

收稿日期:2022-11-22 修回日期:2023-02-27 接受日期:2023-02-28 发布日期:2023-03-20 出版日期:2023-11-10
通讯作者: 王以松
作者简介:金玉堂（1999—），男，贵州安顺人，硕士研究生，主要研究方向：数字信号处理、语音增强、信号去噪
王以松（1975—），男，贵州思南人，教授，博士，CCF会员，主要研究方向：知识表示与推理、回答集程序设计、人工智能、机器学习 yswang@gzu.edu.cn
王丽会（1982—），女，黑龙江哈尔滨人，教授，博士，主要研究方向：深度学习、机器学习、医学成像、医学图像处理、计算机视觉
赵鹏利（1992—），女，河南许昌人，助教，硕士，主要研究方向：数据库、软件工程。
基金资助:
国家自然科学基金资助项目(U1836205)

Speech enhancement algorithm based on multi-scale ladder-type time-frequency Conformer GAN

Yutang JIN¹, Yisong WANG¹(), Lihui WANG¹, Pengli ZHAO²

^1.State Key Laboratory of Public Big Data （Guizhou University），Guiyang Guizhou 550025，China
^2.Xuchang Electric Vocational College，Xuchang Henan 461000，China

Received:2022-11-22 Revised:2023-02-27 Accepted:2023-02-28 Online:2023-03-20 Published:2023-11-10
Contact: Yisong WANG
About author:JIN Yutang， born in 1999， M. S. candidate. His research interests include digital signal processing， speech enhancement， signal denoising.
WANG Yisong， born in 1975， Ph. D.， professor. His research interests include knowledge representation and reasoning， answer set programming design， artificial intelligence， machine learning.
WANG Lihui， born in 1982， Ph. D.， professor. Her research interests include deep learning， machine learning， medical imaging， medical image processing， computer vision.
ZHAO Pengli， born in 1992， M. S.， teaching assistant. Her research interests include database， software engineering.
Supported by:
National Natural Science Foundation of China(U1836205)

摘要/Abstract

摘要：

针对频率域语音增强算法中因相位混乱产生人工伪影，导致去噪性能受限、语音质量不高的问题，提出一种基于多尺度阶梯型时频Conformer生成对抗网络（MSLTF-CMGAN）的语音增强算法。将语音语谱图的实部、虚部和振幅谱作为输入，生成器首先在多个尺度上利用时间-频率Conformer学习时域和频域的全局及局部特征依赖；其次，利用Mask Decoder分支学习振幅掩码，而Complex Decoder分支则直接学习干净的语谱图，融合这两个Decoder分支的输出可得到重建后的语音；最后，利用指标判别器判别语音的评价指标得分，通过极大极小训练使生成器生成高质量的语音。采用主观评价平均意见得分（MOS）和客观评价指标在公开数据集VoiceBank+Demand上与各类语音增强模型进行对比，结果显示，所提算法的MOS信号失真（CSIG）和MOS噪声失真（CBAK）比目前最先进的方法CMGAN（基于Conformer的指标生成对抗网络语音增强模型）分别提高了0.04和0.07，尽管它的MOS整体语音质量（COVL）和语音质量的感知评估（PESQ）略低于CMGAN，但与其他对比模型相比在多项主客观语音质量评估方面的评分均处于领先水平。

关键词: 语音增强, 多尺度, Conformer, 生成对抗网络, 指标判别器, 深度学习

Abstract:

Aiming at the problem of artificial artifacts due to phase disorder in frequency-domain speech enhancement algorithms， which limits the denoising performance and decreases the speech quality， a speech enhancement algorithm based on Multi-Scale Ladder-type Time-Frequency Conformer Generative Adversarial Network （MSLTF-CMGAN） was proposed. Taking the real part， imaginary part and magnitude spectrum of the speech spectrogram as input， the generator first learned the local and global feature dependencies between temporal and frequency domains by using time-frequency Conformer at multiple scales. Secondly， the Mask Decoder branch was used to learn the amplitude mask， and the Complex Decoder branch was directly used to learn the clean spectrogram， and the outputs of the two decoder branches were fused to obtain the reconstructed speech. Finally， the metric discriminator was used to judge the scores of speech evaluation metrics， and high-quality speech was generated by the generator through minimax training. Comparison experiments with various types of speech enhancement models were conducted on the public dataset VoiceBank+Demand by subjective evaluation Mean Opinion Score （MOS） and objective evaluation metrics.Experimental results show that compared with current state-of-the-art speech enhancement method CMGAN （Comformer-based MetricGAN）， MSLTF-CMGAN improves MOS prediction of the signal distortion （CSIG） and MOS predictor of intrusiveness of background noise （CBAK） by 0.04 and 0.07 respectively， even though its Perceptual Evaluation of Speech Quality （PESQ） and MOS prediction of the overall effect （COVL） are slightly lower than that of CMGAN， it still outperforms other comparison models in several subjective and objective speech evaluation metrics.

Key words: speech enhancement, multi-scale, Conformer, Generative Adversarial Network (GAN), metric discriminator, deep learning

中图分类号:

TP391.9

金玉堂, 王以松, 王丽会, 赵鹏利. 基于多尺度阶梯时频Conformer GAN的语音增强算法[J]. 计算机应用, 2023, 43(11): 3607-3615.

Yutang JIN, Yisong WANG, Lihui WANG, Pengli ZHAO. Speech enhancement algorithm based on multi-scale ladder-type time-frequency Conformer GAN[J]. Journal of Computer Applications, 2023, 43(11): 3607-3615.

图/表 8

图1 生成器网络结构

Fig. 1 Network structure of generator

图2 MSLTFC和TFC模块的网络结构

Fig. 2 Network structures of MSLTFC and TFC models

图3 Conformer模块的网络结构

Fig. 3 Network structure of Conformer module

图4 指标判别器的网络结构

Fig. 4 Network structure of metric discriminator

表1 不同算法在VoiceBank+Demand数据集上的性能评估

Tab. 1 Performance evaluation of different algorithms on VoiceBank+Demand dataset

算法	CSIG	CBAK	COVL	PESQ	STOI
有噪声频	3.35	2.44	2.63	1.97	0.92
Wiener	3.23	2.68	2.67	2.22	—
SEGAN	3.48	2.94	2.80	2.16	0.92
HiFiGAN	4.18	2.55	3.51	2.84	0.94
MetricGAN	3.99	3.18	3.42	2.86	—
PHASEN	4.21	3.55	3.62	2.99	—
DVUGAN	4.24	3.53	3.57	2.92	0.95
TSTNN	4.10	3.77	3.52	2.96	0.95
MetricGAN+	4.14	3.16	3.64	3.15	—
DB-AIAT	4.61	3.75	3.96	3.31	0.96
DPT-FSNet	4.58	3.72	4.00	3.33	0.96
CMGAN	4.63	3.79	4.05	3.41	0.96
MSLTF-CMGAN	4.67	3.86	4.03	3.35	0.96

图5 不同算法增强的语音信号的语谱图可视化

Fig. 5 Visualization of spectrograms of enhanced speech signals obtained by different algorithms

表2 消融实验结果

Tab. 2 Ablation study results

方法	CSIG	CBAK	COVL	PESQ	STOI
Parallel-Conformer	4.53	3.77	3.92	3.29	0.96
Mask Decoder	4.45	3.72	3.87	3.23	0.96
Complex Decoder	4.62	3.79	4.01	3.27	0.96
Without Downsample	4.36	3.53	3.78	3.17	0.95
Without Discriminator	4.41	3.81	3.93	3.24	0.96
MSLTF-CMGAN	4.67	3.86	4.03	3.35	0.96

图6 消融实验的语谱图可视化

Fig. 6 Visualization of spectrograms of ablation study

参考文献 35

1	LOIZOU P C. Speech Enhancement： Theory and Practice［M］. Boca Raton， FL： CRC Press， 2007： 1-9.
2	BOLL S. Suppression of acoustic noise in speech using spectral subtraction［J］. IEEE Transactions on Acoustics， Speech， and Signal Processing， 1979， 27（2）： 113-120. 10.1109/tassp.1979.1163209
3	ZALEVSKY Z， MENDLOVIC D. Fractional Wiener filter［J］. Applied Optics， 1996， 35（20）： 3930-3936. 10.1364/ao.35.003930
4	EPHRAIM Y. Statistical-model-based speech enhancement systems［J］. Proceedings of the IEEE， 1992， 80（10）： 1526-1555. 10.1109/5.168664
5	EPHRAIM Y， H L VAN TREES. A signal subspace approach for speech enhancement［J］. IEEE Transactions on Speech and Audio Processing， 1995， 3（4）： 251-266. 10.1109/89.397090
6	TAMURA S， WAIBEL A. Noise reduction using connectionist models［C］// Proceedings of the 1988 International Conference on Acoustics， Speech， and Signal Processing — Volume 1. Piscataway： IEEE， 1988： 553-556.
7	WANG Y， WANG D. Towards scaling up classification-based speech separation［J］. IEEE Transactions on Audio， Speech， and Language Processing， 2013， 21（7）： 1381-1390. 10.1109/tasl.2013.2250961
8	HEALY E W， YOHO S E， WANG Y， et al. An algorithm to improve speech recognition in noise for hearing-impaired listeners［J］. The Journal of the Acoustical Society of America， 2013， 134（4）： 3029-3038. 10.1121/1.4820893
9	WENINGER F， HERSHEY J R， LE ROUX J， et al. Discriminatively trained recurrent neural networks for single-channel speech separation［C］// Proceedings of the 2014 IEEE Global Conference on Signal and Information Processing. Piscataway： IEEE， 2014： 577-581. 10.1109/globalsip.2014.7032183
10	WENINGER F， ERDOGAN H， WATANABE S， et al. Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR［C］// Proceedings of the 2015 International Conference on Latent Variable Analysis and Signal Separation， LNCS 9237. Cham： Springer， 2015： 91-99.
11	PARK S R， LEE J W. A fully convolutional neural network for speech enhancement［C］// Proceedings of the INTERSPEECH 2017. ［S.l.］： International Speech Communication Association， 2017： 1993-1997. 10.21437/interspeech.2017-1465
12	张天骐，柏浩钧，叶绍鹏，等. 基于门控残差卷积编解码网络的单通道语音增强方法［J］. 信号处理， 2021， 37（10）：1986-1995. 10.16798/j.issn.1003-0530.2021.10.023
	ZHANG T Q， BAI H J， YE S P， et al. Single-channel speech enhancement method based on gated residual convolution encoder-and-decoder network［J］. Journal of Signal Processing， 2021， 37（10）：1986-1995. 10.16798/j.issn.1003-0530.2021.10.023
13	PALIWAL K， WÓJCICKI K， SHANNON B. The importance of phase in speech enhancement［J］. Speech Communication， 2011， 53（4）： 465-494. 10.1016/j.specom.2010.12.003
14	TAN K， WANG D. Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2020， 28： 380-390. 10.1109/taslp.2019.2955276
15	PARVEEN S， GREEN P. Speech enhancement with missing data techniques using recurrent neural networks［C］// Proceedings of the 2004 IEEE International Conference on Acoustics， Speech， and Signal Processing — Volume 1. Piscataway： IEEE， 2004： 733-736.
16	PASCUAL S， BONAFONTE A， SERRÀ J. SEGAN： speech enhancement generative adversarial network［C］// Proceedings of the INTERSPEECH 2017. ［S.l.］： International Speech Communication Association， 2017： 3642-3646. 10.21437/interspeech.2017-1428
17	FU S W， LIAO C F， TSAO Y， et al. MetricGAN： generative adversarial networks based black-box metric scores optimization for speech enhancement［C］// Proceedings of the 36th International Conference on Machine Learning. New York： JMLR.org， 2019： 2031-2041.
18	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2017： 6000-6010.
19	KIM J， EL-KHAMY M， LEE J. T-GSA： Transformer with Gaussian-weighted self-attention for speech enhancement［C］// Proceedings of the 2020 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2020： 6649-6653. 10.1109/icassp40776.2020.9053591
20	GULATI A， QIN J， CHIU C C， et al. Conformer： convolution-augmented Transformer for speech recognition［C］// Proceedings of the INTERSPEECH 2020. ［S.l.］： International Speech Communication Association， 2020： 5036-5040. 10.21437/interspeech.2020-3015
21	CAO R， ABDULATIF S， YANG B. CMGAN： conformer-based metric GAN for speech enhancement［C］// Proceedings of the INTERSPEECH 2022. ［S.l.］： International Speech Communication Association， 2022： 936-940. 10.21437/interspeech.2022-517
22	VALENTINI-BOTINHAO C， WANG X， TAKAKI S， et al. Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech［C］// Proceedings of the 9th ISCA Speech Synthesis Workshop. ［S.l.］： International Speech Communication Association， 2016： 146-152. 10.21437/ssw.2016-24
23	BRAUN S， TASHEV I. A consolidated view of loss functions for supervised deep learning-based speech enhancement［C］// Proceedings of the 44th International Conference on Telecommunications and Signal Processing. Piscataway： IEEE， 2021： 72-76. 10.1109/tsp52935.2021.9522648
24	RIX A W， BEERENDS J G， HOLLIER M P， et al. Perceptual Evaluation of Speech Quality （PESQ） — a new method for speech quality assessment of telephone networks and codecs［C］// Proceedings of the 2001 IEEE International Conference on Acoustics， Speech， and Signal Processing — Volume 2. Piscataway： IEEE， 2001： 749-752.
25	TAAL C H， HENDRIKS R C， HEUSDENS R， et al. An algorithm for intelligibility prediction of time-frequency weighted noisy speech［J］. IEEE Transactions on Audio， Speech， and Language Processing， 2011， 19（7）： 2125-2136. 10.1109/tasl.2011.2114881
26	VEAUX C， YAMAGISHI J， KING S. The voice bank corpus： design， collection and data analysis of a large regional accent speech database［C］// Proceedings of the 2013 International Conference of the Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation. Piscataway： IEEE， 2013： 1-4. 10.1109/icsda.2013.6709856
27	THIEMANN J， ITO N， VINCENT E. The diverse environments multi-channel acoustic noise database： a database of multichannel environmental noise recordings［J］. The Journal of the Acoustical Society of America， 2013， 133（S5）： No.4806631. 10.1121/1.4806631
28	SU J， JIN Z， FINKELSTEIN A. HiFi-GAN： high-fidelity denoising and dereverberation based on speech deep features in adversarial networks［C］// Proceedings of the INTERSPEECH 2020. ［S.l.］： International Speech Communication Association， 2020： 4506-4510. 10.21437/interspeech.2020-2143
29	FU S W， YU C， HSIEH T A， et al. MetricGAN+： an improved version of MetricGAN for speech enhancement［C］// Proceedings of the INTERSPEECH 2021. ［S.l.］： International Speech Communication Association， 2021： 201-205. 10.21437/interspeech.2021-599
30	YIN D， LUO C， XIONG Z， et al. PHASEN： a phase-and-harmonics-aware speech enhancement network［C］// Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2020： 9458-9465. 10.1609/aaai.v34i05.6489
31	徐峰，李平. DVUGAN：基于STDCT的DDSP集成变分U-Net的语音增强［J］. 信号处理， 2022， 38（3）：582-589.
	XU F， LI P. DVUGAN： DDSP integrated variational U-Net speech enhancement based on STDCT［J］. Journal of Signal Processing， 2022， 38（3）：582-589.
32	WANG K， HE B， ZHU W P. TSTNN： two-stage transformer based neural network for speech enhancement in the time domain［C］// Proceedings of the 2021 IEEE International Conference on Acoustics， Speech， and Signal Processing. Piscataway： IEEE， 2021： 7098-7102. 10.1109/icassp39728.2021.9413740
33	YU G， LI A， ZHENG C， et al. Dual-branch attention-in-attention transformer for single-channel speech enhancement［C］// Proceedings of the 2022 IEEE International Conference on Acoustics， Speech， and Signal Processing. Piscataway： IEEE， 2022： 7847-7851. 10.1109/icassp43922.2022.9746273
34	DANG F， CHEN H， ZHANG P. DPT-FSNet： dual-path transformer based full-band and sub-band fusion network for speech enhancement［C］// Proceedings of the 2022 IEEE International Conference on Acoustics， Speech， and Signal Processing. Piscataway： IEEE， 2022： 6857-6861. 10.1109/icassp43922.2022.9746171
35	HU Y， LOIZOU P C. Evaluation of objective quality measures for speech enhancement［J］. IEEE Transactions on Audio， Speech， and Language Processing， 2008， 16（1）： 229-238. 10.1109/tasl.2007.911054

[1]	何子仪, 杨燕, 张熠玲. 深度融合多视图聚类网络[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2651-2656.
[2]	王宏, 钱清, 王欢, 龙永. 融合大核注意力卷积的轻量化图像篡改定位算法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2692-2699.
[3]	杨昊, 张轶. 基于上下文信息和多尺度融合重要性感知的特征金字塔网络算法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2727-2734.
[4]	张涵钰, 李振波, 李蔚然, 杨普. 基于机器视觉的水产养殖计数研究综述[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2970-2982.
[5]	陈俊韬, 朱子奇. 基于多尺度特征提取与融合的图像复制-粘贴伪造检测[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2919-2924.
[6]	李校林, 杨松佳. 基于深度学习的多用户毫米波中继网络混合波束赋形[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2511-2516.
[7]	段升位, 程欣宇, 王浩舟, 王飞. 基于改进的YOLOv5的大坝表面病害检测算法[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2619-2629.
[8]	王一, 谢杰, 程佳, 豆立伟. 基于深度学习的RGB图像目标位姿估计综述[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2546-2555.
[9]	郭祥, 姜文刚, 王宇航. 基于改进Inception-ResNet的加密流量分类方法[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2471-2476.
[10]	崔雨萌, 王靖亚, 刘晓文, 闫尚义, 陶知众. 融合注意力和裁剪机制的通用文本分类模型[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2396-2405.
[11]	齐爱玲, 王宣淋. 基于中层细微特征提取与多尺度特征融合细粒度图像识别[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2556-2563.
[12]	张琨, 杨丰玉, 钟发, 曾广东, 周世健. 基于混合代码表示的源代码脆弱性检测[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2517-2526.
[13]	拓雨欣, 薛涛. 融合指针网络与关系嵌入的三元组联合抽取模型[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2116-2124.
[14]	刘安阳, 赵怀慈, 蔡文龙, 许泽超, 解瑞灯. 基于主动判别机制的自适应生成对抗网络图像去模糊算法[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2288-2294.
[15]	郑帅, 张晓龙, 邓鹤, 任宏伟. 基于多尺度特征融合和网格注意力机制的三维肝脏影像分割方法[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2303-2310.

基于多尺度阶梯时频Conformer GAN的语音增强算法

Speech enhancement algorithm based on multi-scale ladder-type time-frequency Conformer GAN

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 8

参考文献 35

相关文章 15

编辑推荐

Metrics