《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (11): 3607-3615.DOI: 10.11772/j.issn.1001-9081.2022111734
• 多媒体计算与计算机仿真 • 上一篇
收稿日期:
2022-11-22
修回日期:
2023-02-27
接受日期:
2023-02-28
发布日期:
2023-03-20
出版日期:
2023-11-10
通讯作者:
王以松
作者简介:
金玉堂(1999—),男,贵州安顺人,硕士研究生,主要研究方向:数字信号处理、语音增强、信号去噪基金资助:
Yutang JIN1, Yisong WANG1(), Lihui WANG1, Pengli ZHAO2
Received:
2022-11-22
Revised:
2023-02-27
Accepted:
2023-02-28
Online:
2023-03-20
Published:
2023-11-10
Contact:
Yisong WANG
About author:
JIN Yutang, born in 1999, M. S. candidate. His research interests include digital signal processing, speech enhancement, signal denoising.Supported by:
摘要:
针对频率域语音增强算法中因相位混乱产生人工伪影,导致去噪性能受限、语音质量不高的问题,提出一种基于多尺度阶梯型时频Conformer生成对抗网络(MSLTF-CMGAN)的语音增强算法。将语音语谱图的实部、虚部和振幅谱作为输入,生成器首先在多个尺度上利用时间-频率Conformer学习时域和频域的全局及局部特征依赖;其次,利用Mask Decoder分支学习振幅掩码,而Complex Decoder分支则直接学习干净的语谱图,融合这两个Decoder分支的输出可得到重建后的语音;最后,利用指标判别器判别语音的评价指标得分,通过极大极小训练使生成器生成高质量的语音。采用主观评价平均意见得分(MOS)和客观评价指标在公开数据集VoiceBank+Demand上与各类语音增强模型进行对比,结果显示,所提算法的MOS信号失真(CSIG)和MOS噪声失真(CBAK)比目前最先进的方法CMGAN(基于Conformer的指标生成对抗网络语音增强模型)分别提高了0.04和0.07,尽管它的MOS整体语音质量(COVL)和语音质量的感知评估(PESQ)略低于CMGAN,但与其他对比模型相比在多项主客观语音质量评估方面的评分均处于领先水平。
中图分类号:
金玉堂, 王以松, 王丽会, 赵鹏利. 基于多尺度阶梯时频Conformer GAN的语音增强算法[J]. 计算机应用, 2023, 43(11): 3607-3615.
Yutang JIN, Yisong WANG, Lihui WANG, Pengli ZHAO. Speech enhancement algorithm based on multi-scale ladder-type time-frequency Conformer GAN[J]. Journal of Computer Applications, 2023, 43(11): 3607-3615.
算法 | CSIG | CBAK | COVL | PESQ | STOI |
---|---|---|---|---|---|
有噪声频 | 3.35 | 2.44 | 2.63 | 1.97 | 0.92 |
Wiener | 3.23 | 2.68 | 2.67 | 2.22 | — |
SEGAN | 3.48 | 2.94 | 2.80 | 2.16 | 0.92 |
HiFiGAN | 4.18 | 2.55 | 3.51 | 2.84 | 0.94 |
MetricGAN | 3.99 | 3.18 | 3.42 | 2.86 | — |
PHASEN | 4.21 | 3.55 | 3.62 | 2.99 | — |
DVUGAN | 4.24 | 3.53 | 3.57 | 2.92 | 0.95 |
TSTNN | 4.10 | 3.77 | 3.52 | 2.96 | 0.95 |
MetricGAN+ | 4.14 | 3.16 | 3.64 | 3.15 | — |
DB-AIAT | 4.61 | 3.75 | 3.96 | 3.31 | 0.96 |
DPT-FSNet | 4.58 | 3.72 | 4.00 | 3.33 | 0.96 |
CMGAN | 4.63 | 3.79 | 4.05 | 3.41 | 0.96 |
MSLTF-CMGAN | 4.67 | 3.86 | 4.03 | 3.35 | 0.96 |
表1 不同算法在VoiceBank+Demand数据集上的性能评估
Tab. 1 Performance evaluation of different algorithms on VoiceBank+Demand dataset
算法 | CSIG | CBAK | COVL | PESQ | STOI |
---|---|---|---|---|---|
有噪声频 | 3.35 | 2.44 | 2.63 | 1.97 | 0.92 |
Wiener | 3.23 | 2.68 | 2.67 | 2.22 | — |
SEGAN | 3.48 | 2.94 | 2.80 | 2.16 | 0.92 |
HiFiGAN | 4.18 | 2.55 | 3.51 | 2.84 | 0.94 |
MetricGAN | 3.99 | 3.18 | 3.42 | 2.86 | — |
PHASEN | 4.21 | 3.55 | 3.62 | 2.99 | — |
DVUGAN | 4.24 | 3.53 | 3.57 | 2.92 | 0.95 |
TSTNN | 4.10 | 3.77 | 3.52 | 2.96 | 0.95 |
MetricGAN+ | 4.14 | 3.16 | 3.64 | 3.15 | — |
DB-AIAT | 4.61 | 3.75 | 3.96 | 3.31 | 0.96 |
DPT-FSNet | 4.58 | 3.72 | 4.00 | 3.33 | 0.96 |
CMGAN | 4.63 | 3.79 | 4.05 | 3.41 | 0.96 |
MSLTF-CMGAN | 4.67 | 3.86 | 4.03 | 3.35 | 0.96 |
方法 | CSIG | CBAK | COVL | PESQ | STOI |
---|---|---|---|---|---|
Parallel-Conformer | 4.53 | 3.77 | 3.92 | 3.29 | 0.96 |
Mask Decoder | 4.45 | 3.72 | 3.87 | 3.23 | 0.96 |
Complex Decoder | 4.62 | 3.79 | 4.01 | 3.27 | 0.96 |
Without Downsample | 4.36 | 3.53 | 3.78 | 3.17 | 0.95 |
Without Discriminator | 4.41 | 3.81 | 3.93 | 3.24 | 0.96 |
MSLTF-CMGAN | 4.67 | 3.86 | 4.03 | 3.35 | 0.96 |
表2 消融实验结果
Tab. 2 Ablation study results
方法 | CSIG | CBAK | COVL | PESQ | STOI |
---|---|---|---|---|---|
Parallel-Conformer | 4.53 | 3.77 | 3.92 | 3.29 | 0.96 |
Mask Decoder | 4.45 | 3.72 | 3.87 | 3.23 | 0.96 |
Complex Decoder | 4.62 | 3.79 | 4.01 | 3.27 | 0.96 |
Without Downsample | 4.36 | 3.53 | 3.78 | 3.17 | 0.95 |
Without Discriminator | 4.41 | 3.81 | 3.93 | 3.24 | 0.96 |
MSLTF-CMGAN | 4.67 | 3.86 | 4.03 | 3.35 | 0.96 |
1 | LOIZOU P C. Speech Enhancement: Theory and Practice[M]. Boca Raton, FL: CRC Press, 2007: 1-9. |
2 | BOLL S. Suppression of acoustic noise in speech using spectral subtraction[J]. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1979, 27(2): 113-120. 10.1109/tassp.1979.1163209 |
3 | ZALEVSKY Z, MENDLOVIC D. Fractional Wiener filter[J]. Applied Optics, 1996, 35(20): 3930-3936. 10.1364/ao.35.003930 |
4 | EPHRAIM Y. Statistical-model-based speech enhancement systems[J]. Proceedings of the IEEE, 1992, 80(10): 1526-1555. 10.1109/5.168664 |
5 | EPHRAIM Y, H L VAN TREES. A signal subspace approach for speech enhancement[J]. IEEE Transactions on Speech and Audio Processing, 1995, 3(4): 251-266. 10.1109/89.397090 |
6 | TAMURA S, WAIBEL A. Noise reduction using connectionist models[C]// Proceedings of the 1988 International Conference on Acoustics, Speech, and Signal Processing — Volume 1. Piscataway: IEEE, 1988: 553-556. |
7 | WANG Y, WANG D. Towards scaling up classification-based speech separation[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2013, 21(7): 1381-1390. 10.1109/tasl.2013.2250961 |
8 | HEALY E W, YOHO S E, WANG Y, et al. An algorithm to improve speech recognition in noise for hearing-impaired listeners[J]. The Journal of the Acoustical Society of America, 2013, 134(4): 3029-3038. 10.1121/1.4820893 |
9 | WENINGER F, HERSHEY J R, LE ROUX J, et al. Discriminatively trained recurrent neural networks for single-channel speech separation[C]// Proceedings of the 2014 IEEE Global Conference on Signal and Information Processing. Piscataway: IEEE, 2014: 577-581. 10.1109/globalsip.2014.7032183 |
10 | WENINGER F, ERDOGAN H, WATANABE S, et al. Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR[C]// Proceedings of the 2015 International Conference on Latent Variable Analysis and Signal Separation, LNCS 9237. Cham: Springer, 2015: 91-99. |
11 | PARK S R, LEE J W. A fully convolutional neural network for speech enhancement[C]// Proceedings of the INTERSPEECH 2017. [S.l.]: International Speech Communication Association, 2017: 1993-1997. 10.21437/interspeech.2017-1465 |
12 | 张天骐,柏浩钧,叶绍鹏,等. 基于门控残差卷积编解码网络的单通道语音增强方法[J]. 信号处理, 2021, 37(10):1986-1995. 10.16798/j.issn.1003-0530.2021.10.023 |
ZHANG T Q, BAI H J, YE S P, et al. Single-channel speech enhancement method based on gated residual convolution encoder-and-decoder network[J]. Journal of Signal Processing, 2021, 37(10):1986-1995. 10.16798/j.issn.1003-0530.2021.10.023 | |
13 | PALIWAL K, WÓJCICKI K, SHANNON B. The importance of phase in speech enhancement[J]. Speech Communication, 2011, 53(4): 465-494. 10.1016/j.specom.2010.12.003 |
14 | TAN K, WANG D. Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020, 28: 380-390. 10.1109/taslp.2019.2955276 |
15 | PARVEEN S, GREEN P. Speech enhancement with missing data techniques using recurrent neural networks[C]// Proceedings of the 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing — Volume 1. Piscataway: IEEE, 2004: 733-736. |
16 | PASCUAL S, BONAFONTE A, SERRÀ J. SEGAN: speech enhancement generative adversarial network[C]// Proceedings of the INTERSPEECH 2017. [S.l.]: International Speech Communication Association, 2017: 3642-3646. 10.21437/interspeech.2017-1428 |
17 | FU S W, LIAO C F, TSAO Y, et al. MetricGAN: generative adversarial networks based black-box metric scores optimization for speech enhancement[C]// Proceedings of the 36th International Conference on Machine Learning. New York: JMLR.org, 2019: 2031-2041. |
18 | VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY: Curran Associates Inc., 2017: 6000-6010. |
19 | KIM J, EL-KHAMY M, LEE J. T-GSA: Transformer with Gaussian-weighted self-attention for speech enhancement[C]// Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2020: 6649-6653. 10.1109/icassp40776.2020.9053591 |
20 | GULATI A, QIN J, CHIU C C, et al. Conformer: convolution-augmented Transformer for speech recognition[C]// Proceedings of the INTERSPEECH 2020. [S.l.]: International Speech Communication Association, 2020: 5036-5040. 10.21437/interspeech.2020-3015 |
21 | CAO R, ABDULATIF S, YANG B. CMGAN: conformer-based metric GAN for speech enhancement[C]// Proceedings of the INTERSPEECH 2022. [S.l.]: International Speech Communication Association, 2022: 936-940. 10.21437/interspeech.2022-517 |
22 | VALENTINI-BOTINHAO C, WANG X, TAKAKI S, et al. Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech[C]// Proceedings of the 9th ISCA Speech Synthesis Workshop. [S.l.]: International Speech Communication Association, 2016: 146-152. 10.21437/ssw.2016-24 |
23 | BRAUN S, TASHEV I. A consolidated view of loss functions for supervised deep learning-based speech enhancement[C]// Proceedings of the 44th International Conference on Telecommunications and Signal Processing. Piscataway: IEEE, 2021: 72-76. 10.1109/tsp52935.2021.9522648 |
24 | RIX A W, BEERENDS J G, HOLLIER M P, et al. Perceptual Evaluation of Speech Quality (PESQ) — a new method for speech quality assessment of telephone networks and codecs[C]// Proceedings of the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing — Volume 2. Piscataway: IEEE, 2001: 749-752. |
25 | TAAL C H, HENDRIKS R C, HEUSDENS R, et al. An algorithm for intelligibility prediction of time-frequency weighted noisy speech[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2011, 19(7): 2125-2136. 10.1109/tasl.2011.2114881 |
26 | VEAUX C, YAMAGISHI J, KING S. The voice bank corpus: design, collection and data analysis of a large regional accent speech database[C]// Proceedings of the 2013 International Conference of the Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation. Piscataway: IEEE, 2013: 1-4. 10.1109/icsda.2013.6709856 |
27 | THIEMANN J, ITO N, VINCENT E. The diverse environments multi-channel acoustic noise database: a database of multichannel environmental noise recordings[J]. The Journal of the Acoustical Society of America, 2013, 133(S5): No.4806631. 10.1121/1.4806631 |
28 | SU J, JIN Z, FINKELSTEIN A. HiFi-GAN: high-fidelity denoising and dereverberation based on speech deep features in adversarial networks[C]// Proceedings of the INTERSPEECH 2020. [S.l.]: International Speech Communication Association, 2020: 4506-4510. 10.21437/interspeech.2020-2143 |
29 | FU S W, YU C, HSIEH T A, et al. MetricGAN+: an improved version of MetricGAN for speech enhancement[C]// Proceedings of the INTERSPEECH 2021. [S.l.]: International Speech Communication Association, 2021: 201-205. 10.21437/interspeech.2021-599 |
30 | YIN D, LUO C, XIONG Z, et al. PHASEN: a phase-and-harmonics-aware speech enhancement network[C]// Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto, CA: AAAI Press, 2020: 9458-9465. 10.1609/aaai.v34i05.6489 |
31 | 徐峰,李平. DVUGAN:基于STDCT的DDSP集成变分U-Net的语音增强[J]. 信号处理, 2022, 38(3):582-589. |
XU F, LI P. DVUGAN: DDSP integrated variational U-Net speech enhancement based on STDCT[J]. Journal of Signal Processing, 2022, 38(3):582-589. | |
32 | WANG K, HE B, ZHU W P. TSTNN: two-stage transformer based neural network for speech enhancement in the time domain[C]// Proceedings of the 2021 IEEE International Conference on Acoustics, Speech, and Signal Processing. Piscataway: IEEE, 2021: 7098-7102. 10.1109/icassp39728.2021.9413740 |
33 | YU G, LI A, ZHENG C, et al. Dual-branch attention-in-attention transformer for single-channel speech enhancement[C]// Proceedings of the 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing. Piscataway: IEEE, 2022: 7847-7851. 10.1109/icassp43922.2022.9746273 |
34 | DANG F, CHEN H, ZHANG P. DPT-FSNet: dual-path transformer based full-band and sub-band fusion network for speech enhancement[C]// Proceedings of the 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing. Piscataway: IEEE, 2022: 6857-6861. 10.1109/icassp43922.2022.9746171 |
35 | HU Y, LOIZOU P C. Evaluation of objective quality measures for speech enhancement[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2008, 16(1): 229-238. 10.1109/tasl.2007.911054 |
[1] | 何子仪, 杨燕, 张熠玲. 深度融合多视图聚类网络[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2651-2656. |
[2] | 王宏, 钱清, 王欢, 龙永. 融合大核注意力卷积的轻量化图像篡改定位算法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2692-2699. |
[3] | 杨昊, 张轶. 基于上下文信息和多尺度融合重要性感知的特征金字塔网络算法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2727-2734. |
[4] | 张涵钰, 李振波, 李蔚然, 杨普. 基于机器视觉的水产养殖计数研究综述[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2970-2982. |
[5] | 陈俊韬, 朱子奇. 基于多尺度特征提取与融合的图像复制-粘贴伪造检测[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2919-2924. |
[6] | 李校林, 杨松佳. 基于深度学习的多用户毫米波中继网络混合波束赋形[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2511-2516. |
[7] | 段升位, 程欣宇, 王浩舟, 王飞. 基于改进的YOLOv5的大坝表面病害检测算法[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2619-2629. |
[8] | 王一, 谢杰, 程佳, 豆立伟. 基于深度学习的RGB图像目标位姿估计综述[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2546-2555. |
[9] | 郭祥, 姜文刚, 王宇航. 基于改进Inception-ResNet的加密流量分类方法[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2471-2476. |
[10] | 崔雨萌, 王靖亚, 刘晓文, 闫尚义, 陶知众. 融合注意力和裁剪机制的通用文本分类模型[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2396-2405. |
[11] | 齐爱玲, 王宣淋. 基于中层细微特征提取与多尺度特征融合细粒度图像识别[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2556-2563. |
[12] | 张琨, 杨丰玉, 钟发, 曾广东, 周世健. 基于混合代码表示的源代码脆弱性检测[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2517-2526. |
[13] | 拓雨欣, 薛涛. 融合指针网络与关系嵌入的三元组联合抽取模型[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2116-2124. |
[14] | 刘安阳, 赵怀慈, 蔡文龙, 许泽超, 解瑞灯. 基于主动判别机制的自适应生成对抗网络图像去模糊算法[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2288-2294. |
[15] | 郑帅, 张晓龙, 邓鹤, 任宏伟. 基于多尺度特征融合和网格注意力机制的三维肝脏影像分割方法[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2303-2310. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||