Speech enhancement algorithm based on multi-scale ladder-type time-frequency Conformer GAN

doi:10.11772/j.issn.1001-9081.2022111734

Journal of Computer Applications ›› 2023, Vol. 43 ›› Issue (11): 3607-3615.DOI: 10.11772/j.issn.1001-9081.2022111734

Special Issue: 多媒体计算与计算机仿真

• Multimedia computing and computer simulation • Previous Articles Next Articles

Speech enhancement algorithm based on multi-scale ladder-type time-frequency Conformer GAN

Yutang JIN¹, Yisong WANG¹(), Lihui WANG¹, Pengli ZHAO²

^1.State Key Laboratory of Public Big Data （Guizhou University），Guiyang Guizhou 550025，China
^2.Xuchang Electric Vocational College，Xuchang Henan 461000，China

Received:2022-11-22 Revised:2023-02-27 Accepted:2023-02-28 Online:2023-03-20 Published:2023-11-10
Contact: Yisong WANG
About author:JIN Yutang， born in 1999， M. S. candidate. His research interests include digital signal processing， speech enhancement， signal denoising.
WANG Yisong， born in 1975， Ph. D.， professor. His research interests include knowledge representation and reasoning， answer set programming design， artificial intelligence， machine learning.
WANG Lihui， born in 1982， Ph. D.， professor. Her research interests include deep learning， machine learning， medical imaging， medical image processing， computer vision.
ZHAO Pengli， born in 1992， M. S.， teaching assistant. Her research interests include database， software engineering.
Supported by:
National Natural Science Foundation of China(U1836205)

基于多尺度阶梯时频Conformer GAN的语音增强算法

金玉堂¹, 王以松¹(), 王丽会¹, 赵鹏利²

^1.公共大数据国家重点实验室（贵州大学），贵阳，550025
^2.许昌电气职业学院，河南许昌 461000

通讯作者: 王以松
作者简介:金玉堂（1999—），男，贵州安顺人，硕士研究生，主要研究方向：数字信号处理、语音增强、信号去噪
王以松（1975—），男，贵州思南人，教授，博士，CCF会员，主要研究方向：知识表示与推理、回答集程序设计、人工智能、机器学习 yswang@gzu.edu.cn
王丽会（1982—），女，黑龙江哈尔滨人，教授，博士，主要研究方向：深度学习、机器学习、医学成像、医学图像处理、计算机视觉
赵鹏利（1992—），女，河南许昌人，助教，硕士，主要研究方向：数据库、软件工程。
基金资助:
国家自然科学基金资助项目(U1836205)

Abstract

Abstract:

Aiming at the problem of artificial artifacts due to phase disorder in frequency-domain speech enhancement algorithms， which limits the denoising performance and decreases the speech quality， a speech enhancement algorithm based on Multi-Scale Ladder-type Time-Frequency Conformer Generative Adversarial Network （MSLTF-CMGAN） was proposed. Taking the real part， imaginary part and magnitude spectrum of the speech spectrogram as input， the generator first learned the local and global feature dependencies between temporal and frequency domains by using time-frequency Conformer at multiple scales. Secondly， the Mask Decoder branch was used to learn the amplitude mask， and the Complex Decoder branch was directly used to learn the clean spectrogram， and the outputs of the two decoder branches were fused to obtain the reconstructed speech. Finally， the metric discriminator was used to judge the scores of speech evaluation metrics， and high-quality speech was generated by the generator through minimax training. Comparison experiments with various types of speech enhancement models were conducted on the public dataset VoiceBank+Demand by subjective evaluation Mean Opinion Score （MOS） and objective evaluation metrics.Experimental results show that compared with current state-of-the-art speech enhancement method CMGAN （Comformer-based MetricGAN）， MSLTF-CMGAN improves MOS prediction of the signal distortion （CSIG） and MOS predictor of intrusiveness of background noise （CBAK） by 0.04 and 0.07 respectively， even though its Perceptual Evaluation of Speech Quality （PESQ） and MOS prediction of the overall effect （COVL） are slightly lower than that of CMGAN， it still outperforms other comparison models in several subjective and objective speech evaluation metrics.

Key words: speech enhancement, multi-scale, Conformer, Generative Adversarial Network (GAN), metric discriminator, deep learning

摘要：

针对频率域语音增强算法中因相位混乱产生人工伪影，导致去噪性能受限、语音质量不高的问题，提出一种基于多尺度阶梯型时频Conformer生成对抗网络（MSLTF-CMGAN）的语音增强算法。将语音语谱图的实部、虚部和振幅谱作为输入，生成器首先在多个尺度上利用时间-频率Conformer学习时域和频域的全局及局部特征依赖；其次，利用Mask Decoder分支学习振幅掩码，而Complex Decoder分支则直接学习干净的语谱图，融合这两个Decoder分支的输出可得到重建后的语音；最后，利用指标判别器判别语音的评价指标得分，通过极大极小训练使生成器生成高质量的语音。采用主观评价平均意见得分（MOS）和客观评价指标在公开数据集VoiceBank+Demand上与各类语音增强模型进行对比，结果显示，所提算法的MOS信号失真（CSIG）和MOS噪声失真（CBAK）比目前最先进的方法CMGAN（基于Conformer的指标生成对抗网络语音增强模型）分别提高了0.04和0.07，尽管它的MOS整体语音质量（COVL）和语音质量的感知评估（PESQ）略低于CMGAN，但与其他对比模型相比在多项主客观语音质量评估方面的评分均处于领先水平。

关键词: 语音增强, 多尺度, Conformer, 生成对抗网络, 指标判别器, 深度学习

CLC Number:

TP391.9

Yutang JIN, Yisong WANG, Lihui WANG, Pengli ZHAO. Speech enhancement algorithm based on multi-scale ladder-type time-frequency Conformer GAN[J]. Journal of Computer Applications, 2023, 43(11): 3607-3615.

金玉堂, 王以松, 王丽会, 赵鹏利. 基于多尺度阶梯时频Conformer GAN的语音增强算法[J]. 《计算机应用》唯一官方网站, 2023, 43(11): 3607-3615.

Figures/Tables 8

References 35

1	LOIZOU P C. Speech Enhancement： Theory and Practice［M］. Boca Raton， FL： CRC Press， 2007： 1-9.
2	BOLL S. Suppression of acoustic noise in speech using spectral subtraction［J］. IEEE Transactions on Acoustics， Speech， and Signal Processing， 1979， 27（2）： 113-120. 10.1109/tassp.1979.1163209
3	ZALEVSKY Z， MENDLOVIC D. Fractional Wiener filter［J］. Applied Optics， 1996， 35（20）： 3930-3936. 10.1364/ao.35.003930
4	EPHRAIM Y. Statistical-model-based speech enhancement systems［J］. Proceedings of the IEEE， 1992， 80（10）： 1526-1555. 10.1109/5.168664
5	EPHRAIM Y， H L VAN TREES. A signal subspace approach for speech enhancement［J］. IEEE Transactions on Speech and Audio Processing， 1995， 3（4）： 251-266. 10.1109/89.397090
6	TAMURA S， WAIBEL A. Noise reduction using connectionist models［C］// Proceedings of the 1988 International Conference on Acoustics， Speech， and Signal Processing — Volume 1. Piscataway： IEEE， 1988： 553-556.
7	WANG Y， WANG D. Towards scaling up classification-based speech separation［J］. IEEE Transactions on Audio， Speech， and Language Processing， 2013， 21（7）： 1381-1390. 10.1109/tasl.2013.2250961
8	HEALY E W， YOHO S E， WANG Y， et al. An algorithm to improve speech recognition in noise for hearing-impaired listeners［J］. The Journal of the Acoustical Society of America， 2013， 134（4）： 3029-3038. 10.1121/1.4820893
9	WENINGER F， HERSHEY J R， LE ROUX J， et al. Discriminatively trained recurrent neural networks for single-channel speech separation［C］// Proceedings of the 2014 IEEE Global Conference on Signal and Information Processing. Piscataway： IEEE， 2014： 577-581. 10.1109/globalsip.2014.7032183
10	WENINGER F， ERDOGAN H， WATANABE S， et al. Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR［C］// Proceedings of the 2015 International Conference on Latent Variable Analysis and Signal Separation， LNCS 9237. Cham： Springer， 2015： 91-99.
11	PARK S R， LEE J W. A fully convolutional neural network for speech enhancement［C］// Proceedings of the INTERSPEECH 2017. ［S.l.］： International Speech Communication Association， 2017： 1993-1997. 10.21437/interspeech.2017-1465
12	张天骐，柏浩钧，叶绍鹏，等. 基于门控残差卷积编解码网络的单通道语音增强方法［J］. 信号处理， 2021， 37（10）：1986-1995. 10.16798/j.issn.1003-0530.2021.10.023
	ZHANG T Q， BAI H J， YE S P， et al. Single-channel speech enhancement method based on gated residual convolution encoder-and-decoder network［J］. Journal of Signal Processing， 2021， 37（10）：1986-1995. 10.16798/j.issn.1003-0530.2021.10.023
13	PALIWAL K， WÓJCICKI K， SHANNON B. The importance of phase in speech enhancement［J］. Speech Communication， 2011， 53（4）： 465-494. 10.1016/j.specom.2010.12.003
14	TAN K， WANG D. Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2020， 28： 380-390. 10.1109/taslp.2019.2955276
15	PARVEEN S， GREEN P. Speech enhancement with missing data techniques using recurrent neural networks［C］// Proceedings of the 2004 IEEE International Conference on Acoustics， Speech， and Signal Processing — Volume 1. Piscataway： IEEE， 2004： 733-736.
16	PASCUAL S， BONAFONTE A， SERRÀ J. SEGAN： speech enhancement generative adversarial network［C］// Proceedings of the INTERSPEECH 2017. ［S.l.］： International Speech Communication Association， 2017： 3642-3646. 10.21437/interspeech.2017-1428
17	FU S W， LIAO C F， TSAO Y， et al. MetricGAN： generative adversarial networks based black-box metric scores optimization for speech enhancement［C］// Proceedings of the 36th International Conference on Machine Learning. New York： JMLR.org， 2019： 2031-2041.
18	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2017： 6000-6010.
19	KIM J， EL-KHAMY M， LEE J. T-GSA： Transformer with Gaussian-weighted self-attention for speech enhancement［C］// Proceedings of the 2020 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2020： 6649-6653. 10.1109/icassp40776.2020.9053591
20	GULATI A， QIN J， CHIU C C， et al. Conformer： convolution-augmented Transformer for speech recognition［C］// Proceedings of the INTERSPEECH 2020. ［S.l.］： International Speech Communication Association， 2020： 5036-5040. 10.21437/interspeech.2020-3015
21	CAO R， ABDULATIF S， YANG B. CMGAN： conformer-based metric GAN for speech enhancement［C］// Proceedings of the INTERSPEECH 2022. ［S.l.］： International Speech Communication Association， 2022： 936-940. 10.21437/interspeech.2022-517
22	VALENTINI-BOTINHAO C， WANG X， TAKAKI S， et al. Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech［C］// Proceedings of the 9th ISCA Speech Synthesis Workshop. ［S.l.］： International Speech Communication Association， 2016： 146-152. 10.21437/ssw.2016-24
23	BRAUN S， TASHEV I. A consolidated view of loss functions for supervised deep learning-based speech enhancement［C］// Proceedings of the 44th International Conference on Telecommunications and Signal Processing. Piscataway： IEEE， 2021： 72-76. 10.1109/tsp52935.2021.9522648
24	RIX A W， BEERENDS J G， HOLLIER M P， et al. Perceptual Evaluation of Speech Quality （PESQ） — a new method for speech quality assessment of telephone networks and codecs［C］// Proceedings of the 2001 IEEE International Conference on Acoustics， Speech， and Signal Processing — Volume 2. Piscataway： IEEE， 2001： 749-752.
25	TAAL C H， HENDRIKS R C， HEUSDENS R， et al. An algorithm for intelligibility prediction of time-frequency weighted noisy speech［J］. IEEE Transactions on Audio， Speech， and Language Processing， 2011， 19（7）： 2125-2136. 10.1109/tasl.2011.2114881
26	VEAUX C， YAMAGISHI J， KING S. The voice bank corpus： design， collection and data analysis of a large regional accent speech database［C］// Proceedings of the 2013 International Conference of the Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation. Piscataway： IEEE， 2013： 1-4. 10.1109/icsda.2013.6709856
27	THIEMANN J， ITO N， VINCENT E. The diverse environments multi-channel acoustic noise database： a database of multichannel environmental noise recordings［J］. The Journal of the Acoustical Society of America， 2013， 133（S5）： No.4806631. 10.1121/1.4806631
28	SU J， JIN Z， FINKELSTEIN A. HiFi-GAN： high-fidelity denoising and dereverberation based on speech deep features in adversarial networks［C］// Proceedings of the INTERSPEECH 2020. ［S.l.］： International Speech Communication Association， 2020： 4506-4510. 10.21437/interspeech.2020-2143
29	FU S W， YU C， HSIEH T A， et al. MetricGAN+： an improved version of MetricGAN for speech enhancement［C］// Proceedings of the INTERSPEECH 2021. ［S.l.］： International Speech Communication Association， 2021： 201-205. 10.21437/interspeech.2021-599
30	YIN D， LUO C， XIONG Z， et al. PHASEN： a phase-and-harmonics-aware speech enhancement network［C］// Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2020： 9458-9465. 10.1609/aaai.v34i05.6489
31	徐峰，李平. DVUGAN：基于STDCT的DDSP集成变分U-Net的语音增强［J］. 信号处理， 2022， 38（3）：582-589.
	XU F， LI P. DVUGAN： DDSP integrated variational U-Net speech enhancement based on STDCT［J］. Journal of Signal Processing， 2022， 38（3）：582-589.
32	WANG K， HE B， ZHU W P. TSTNN： two-stage transformer based neural network for speech enhancement in the time domain［C］// Proceedings of the 2021 IEEE International Conference on Acoustics， Speech， and Signal Processing. Piscataway： IEEE， 2021： 7098-7102. 10.1109/icassp39728.2021.9413740
33	YU G， LI A， ZHENG C， et al. Dual-branch attention-in-attention transformer for single-channel speech enhancement［C］// Proceedings of the 2022 IEEE International Conference on Acoustics， Speech， and Signal Processing. Piscataway： IEEE， 2022： 7847-7851. 10.1109/icassp43922.2022.9746273
34	DANG F， CHEN H， ZHANG P. DPT-FSNet： dual-path transformer based full-band and sub-band fusion network for speech enhancement［C］// Proceedings of the 2022 IEEE International Conference on Acoustics， Speech， and Signal Processing. Piscataway： IEEE， 2022： 6857-6861. 10.1109/icassp43922.2022.9746171
35	HU Y， LOIZOU P C. Evaluation of objective quality measures for speech enhancement［J］. IEEE Transactions on Audio， Speech， and Language Processing， 2008， 16（1）： 229-238. 10.1109/tasl.2007.911054

算法	CSIG	CBAK	COVL	PESQ	STOI
有噪声频	3.35	2.44	2.63	1.97	0.92
Wiener	3.23	2.68	2.67	2.22	—
SEGAN	3.48	2.94	2.80	2.16	0.92
HiFiGAN	4.18	2.55	3.51	2.84	0.94
MetricGAN	3.99	3.18	3.42	2.86	—
PHASEN	4.21	3.55	3.62	2.99	—
DVUGAN	4.24	3.53	3.57	2.92	0.95
TSTNN	4.10	3.77	3.52	2.96	0.95
MetricGAN+	4.14	3.16	3.64	3.15	—
DB-AIAT	4.61	3.75	3.96	3.31	0.96
DPT-FSNet	4.58	3.72	4.00	3.33	0.96
CMGAN	4.63	3.79	4.05	3.41	0.96
MSLTF-CMGAN	4.67	3.86	4.03	3.35	0.96

算法	CSIG	CBAK	COVL	PESQ	STOI
有噪声频	3.35	2.44	2.63	1.97	0.92
Wiener	3.23	2.68	2.67	2.22	—
SEGAN	3.48	2.94	2.80	2.16	0.92
HiFiGAN	4.18	2.55	3.51	2.84	0.94
MetricGAN	3.99	3.18	3.42	2.86	—
PHASEN	4.21	3.55	3.62	2.99	—
DVUGAN	4.24	3.53	3.57	2.92	0.95
TSTNN	4.10	3.77	3.52	2.96	0.95
MetricGAN+	4.14	3.16	3.64	3.15	—
DB-AIAT	4.61	3.75	3.96	3.31	0.96
DPT-FSNet	4.58	3.72	4.00	3.33	0.96
CMGAN	4.63	3.79	4.05	3.41	0.96
MSLTF-CMGAN	4.67	3.86	4.03	3.35	0.96

方法	CSIG	CBAK	COVL	PESQ	STOI
Parallel-Conformer	4.53	3.77	3.92	3.29	0.96
Mask Decoder	4.45	3.72	3.87	3.23	0.96
Complex Decoder	4.62	3.79	4.01	3.27	0.96
Without Downsample	4.36	3.53	3.78	3.17	0.95
Without Discriminator	4.41	3.81	3.93	3.24	0.96
MSLTF-CMGAN	4.67	3.86	4.03	3.35	0.96

方法	CSIG	CBAK	COVL	PESQ	STOI
Parallel-Conformer	4.53	3.77	3.92	3.29	0.96
Mask Decoder	4.45	3.72	3.87	3.23	0.96
Complex Decoder	4.62	3.79	4.01	3.27	0.96
Without Downsample	4.36	3.53	3.78	3.17	0.95
Without Discriminator	4.41	3.81	3.93	3.24	0.96
MSLTF-CMGAN	4.67	3.86	4.03	3.35	0.96

[1]	Yunchuan HUANG, Yongquan JIANG, Juntao HUANG, Yan YANG. Molecular toxicity prediction based on meta graph isomorphism network [J]. Journal of Computer Applications, 2024, 44(9): 2964-2969.
[2]	Yexin PAN, Zhe YANG. Optimization model for small object detection based on multi-level feature bidirectional fusion [J]. Journal of Computer Applications, 2024, 44(9): 2871-2877.
[3]	Shunyong LI, Shiyi LI, Rui XU, Xingwang ZHAO. Incomplete multi-view clustering algorithm based on self-attention fusion [J]. Journal of Computer Applications, 2024, 44(9): 2696-2703.
[4]	Jing QIN, Zhiguang QIN, Fali LI, Yueheng PENG. Diagnosis of major depressive disorder based on probabilistic sparse self-attention neural network [J]. Journal of Computer Applications, 2024, 44(9): 2970-2974.
[5]	Xiyuan WANG, Zhancheng ZHANG, Shaokang XU, Baocheng ZHANG, Xiaoqing LUO, Fuyuan HU. Unsupervised cross-domain transfer network for 3D/2D registration in surgical navigation [J]. Journal of Computer Applications, 2024, 44(9): 2911-2918.
[6]	Yan RONG, Jiawen LIU, Xinlei LI. Adaptive hybrid network for affective computing in student classroom [J]. Journal of Computer Applications, 2024, 44(9): 2919-2930.
[7]	Tong CHEN, Fengyu YANG, Yu XIONG, Hong YAN, Fuxing QIU. Construction method of voiceprint library based on multi-scale frequency-channel attention fusion [J]. Journal of Computer Applications, 2024, 44(8): 2407-2413.
[8]	Yuhan LIU, Genlin JI, Hongping ZHANG. Video pedestrian anomaly detection method based on skeleton graph and mixed attention [J]. Journal of Computer Applications, 2024, 44(8): 2551-2557.
[9]	Chenqian LI, Jun LIU. Ultrasound carotid plaque segmentation method based on semi-supervision and multi-scale cascaded attention [J]. Journal of Computer Applications, 2024, 44(8): 2604-2610.
[10]	Yanjie GU, Yingjun ZHANG, Xiaoqian LIU, Wei ZHOU, Wei SUN. Traffic flow forecasting via spatial-temporal multi-graph fusion [J]. Journal of Computer Applications, 2024, 44(8): 2618-2625.
[11]	Qianhong SHI, Yan YANG, Yongquan JIANG, Xiaocao OUYANG, Wubo FAN, Qiang CHEN, Tao JIANG, Yuan LI. Multi-granularity abrupt change fitting network for air quality prediction [J]. Journal of Computer Applications, 2024, 44(8): 2643-2650.
[12]	Yuan TANG, Yanping CHEN, Ying HU, Ruizhang HUANG, Yongbin QIN. Relation extraction model based on multi-scale hybrid attention convolutional neural networks [J]. Journal of Computer Applications, 2024, 44(7): 2011-2017.
[13]	Sailong SHI, Zhiwen FANG. Gaze estimation model based on multi-scale aggregation and shared attention [J]. Journal of Computer Applications, 2024, 44(7): 2047-2054.
[14]	Yiqun ZHAO, Zhiyu ZHANG, Xue DONG. Anisotropic travel time computation method based on dense residual connection physical information neural networks [J]. Journal of Computer Applications, 2024, 44(7): 2310-2318.
[15]	Li LIU, Haijin HOU, Anhong WANG, Tao ZHANG. Generative data hiding algorithm based on multi-scale attention [J]. Journal of Computer Applications, 2024, 44(7): 2102-2109.

Speech enhancement algorithm based on multi-scale ladder-type time-frequency Conformer GAN

基于多尺度阶梯时频Conformer GAN的语音增强算法

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 8

References 35

Related Articles 15

Recommended Articles

Metrics