Journal of Computer Applications ›› 2023, Vol. 43 ›› Issue (11): 3607-3615.DOI: 10.11772/j.issn.1001-9081.2022111734
Special Issue: 多媒体计算与计算机仿真
• Multimedia computing and computer simulation • Previous Articles Next Articles
Yutang JIN1, Yisong WANG1(), Lihui WANG1, Pengli ZHAO2
Received:
2022-11-22
Revised:
2023-02-27
Accepted:
2023-02-28
Online:
2023-03-20
Published:
2023-11-10
Contact:
Yisong WANG
About author:
JIN Yutang, born in 1999, M. S. candidate. His research interests include digital signal processing, speech enhancement, signal denoising.Supported by:
通讯作者:
王以松
作者简介:
金玉堂(1999—),男,贵州安顺人,硕士研究生,主要研究方向:数字信号处理、语音增强、信号去噪基金资助:
CLC Number:
Yutang JIN, Yisong WANG, Lihui WANG, Pengli ZHAO. Speech enhancement algorithm based on multi-scale ladder-type time-frequency Conformer GAN[J]. Journal of Computer Applications, 2023, 43(11): 3607-3615.
金玉堂, 王以松, 王丽会, 赵鹏利. 基于多尺度阶梯时频Conformer GAN的语音增强算法[J]. 《计算机应用》唯一官方网站, 2023, 43(11): 3607-3615.
Add to citation manager EndNote|Ris|BibTeX
URL: https://www.joca.cn/EN/10.11772/j.issn.1001-9081.2022111734
算法 | CSIG | CBAK | COVL | PESQ | STOI |
---|---|---|---|---|---|
有噪声频 | 3.35 | 2.44 | 2.63 | 1.97 | 0.92 |
Wiener | 3.23 | 2.68 | 2.67 | 2.22 | — |
SEGAN | 3.48 | 2.94 | 2.80 | 2.16 | 0.92 |
HiFiGAN | 4.18 | 2.55 | 3.51 | 2.84 | 0.94 |
MetricGAN | 3.99 | 3.18 | 3.42 | 2.86 | — |
PHASEN | 4.21 | 3.55 | 3.62 | 2.99 | — |
DVUGAN | 4.24 | 3.53 | 3.57 | 2.92 | 0.95 |
TSTNN | 4.10 | 3.77 | 3.52 | 2.96 | 0.95 |
MetricGAN+ | 4.14 | 3.16 | 3.64 | 3.15 | — |
DB-AIAT | 4.61 | 3.75 | 3.96 | 3.31 | 0.96 |
DPT-FSNet | 4.58 | 3.72 | 4.00 | 3.33 | 0.96 |
CMGAN | 4.63 | 3.79 | 4.05 | 3.41 | 0.96 |
MSLTF-CMGAN | 4.67 | 3.86 | 4.03 | 3.35 | 0.96 |
Tab. 1 Performance evaluation of different algorithms on VoiceBank+Demand dataset
算法 | CSIG | CBAK | COVL | PESQ | STOI |
---|---|---|---|---|---|
有噪声频 | 3.35 | 2.44 | 2.63 | 1.97 | 0.92 |
Wiener | 3.23 | 2.68 | 2.67 | 2.22 | — |
SEGAN | 3.48 | 2.94 | 2.80 | 2.16 | 0.92 |
HiFiGAN | 4.18 | 2.55 | 3.51 | 2.84 | 0.94 |
MetricGAN | 3.99 | 3.18 | 3.42 | 2.86 | — |
PHASEN | 4.21 | 3.55 | 3.62 | 2.99 | — |
DVUGAN | 4.24 | 3.53 | 3.57 | 2.92 | 0.95 |
TSTNN | 4.10 | 3.77 | 3.52 | 2.96 | 0.95 |
MetricGAN+ | 4.14 | 3.16 | 3.64 | 3.15 | — |
DB-AIAT | 4.61 | 3.75 | 3.96 | 3.31 | 0.96 |
DPT-FSNet | 4.58 | 3.72 | 4.00 | 3.33 | 0.96 |
CMGAN | 4.63 | 3.79 | 4.05 | 3.41 | 0.96 |
MSLTF-CMGAN | 4.67 | 3.86 | 4.03 | 3.35 | 0.96 |
方法 | CSIG | CBAK | COVL | PESQ | STOI |
---|---|---|---|---|---|
Parallel-Conformer | 4.53 | 3.77 | 3.92 | 3.29 | 0.96 |
Mask Decoder | 4.45 | 3.72 | 3.87 | 3.23 | 0.96 |
Complex Decoder | 4.62 | 3.79 | 4.01 | 3.27 | 0.96 |
Without Downsample | 4.36 | 3.53 | 3.78 | 3.17 | 0.95 |
Without Discriminator | 4.41 | 3.81 | 3.93 | 3.24 | 0.96 |
MSLTF-CMGAN | 4.67 | 3.86 | 4.03 | 3.35 | 0.96 |
Tab. 2 Ablation study results
方法 | CSIG | CBAK | COVL | PESQ | STOI |
---|---|---|---|---|---|
Parallel-Conformer | 4.53 | 3.77 | 3.92 | 3.29 | 0.96 |
Mask Decoder | 4.45 | 3.72 | 3.87 | 3.23 | 0.96 |
Complex Decoder | 4.62 | 3.79 | 4.01 | 3.27 | 0.96 |
Without Downsample | 4.36 | 3.53 | 3.78 | 3.17 | 0.95 |
Without Discriminator | 4.41 | 3.81 | 3.93 | 3.24 | 0.96 |
MSLTF-CMGAN | 4.67 | 3.86 | 4.03 | 3.35 | 0.96 |
1 | LOIZOU P C. Speech Enhancement: Theory and Practice[M]. Boca Raton, FL: CRC Press, 2007: 1-9. |
2 | BOLL S. Suppression of acoustic noise in speech using spectral subtraction[J]. IEEE Transactions on Acoustics, Speech, and Signal Processing, 1979, 27(2): 113-120. 10.1109/tassp.1979.1163209 |
3 | ZALEVSKY Z, MENDLOVIC D. Fractional Wiener filter[J]. Applied Optics, 1996, 35(20): 3930-3936. 10.1364/ao.35.003930 |
4 | EPHRAIM Y. Statistical-model-based speech enhancement systems[J]. Proceedings of the IEEE, 1992, 80(10): 1526-1555. 10.1109/5.168664 |
5 | EPHRAIM Y, H L VAN TREES. A signal subspace approach for speech enhancement[J]. IEEE Transactions on Speech and Audio Processing, 1995, 3(4): 251-266. 10.1109/89.397090 |
6 | TAMURA S, WAIBEL A. Noise reduction using connectionist models[C]// Proceedings of the 1988 International Conference on Acoustics, Speech, and Signal Processing — Volume 1. Piscataway: IEEE, 1988: 553-556. |
7 | WANG Y, WANG D. Towards scaling up classification-based speech separation[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2013, 21(7): 1381-1390. 10.1109/tasl.2013.2250961 |
8 | HEALY E W, YOHO S E, WANG Y, et al. An algorithm to improve speech recognition in noise for hearing-impaired listeners[J]. The Journal of the Acoustical Society of America, 2013, 134(4): 3029-3038. 10.1121/1.4820893 |
9 | WENINGER F, HERSHEY J R, LE ROUX J, et al. Discriminatively trained recurrent neural networks for single-channel speech separation[C]// Proceedings of the 2014 IEEE Global Conference on Signal and Information Processing. Piscataway: IEEE, 2014: 577-581. 10.1109/globalsip.2014.7032183 |
10 | WENINGER F, ERDOGAN H, WATANABE S, et al. Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR[C]// Proceedings of the 2015 International Conference on Latent Variable Analysis and Signal Separation, LNCS 9237. Cham: Springer, 2015: 91-99. |
11 | PARK S R, LEE J W. A fully convolutional neural network for speech enhancement[C]// Proceedings of the INTERSPEECH 2017. [S.l.]: International Speech Communication Association, 2017: 1993-1997. 10.21437/interspeech.2017-1465 |
12 | 张天骐,柏浩钧,叶绍鹏,等. 基于门控残差卷积编解码网络的单通道语音增强方法[J]. 信号处理, 2021, 37(10):1986-1995. 10.16798/j.issn.1003-0530.2021.10.023 |
ZHANG T Q, BAI H J, YE S P, et al. Single-channel speech enhancement method based on gated residual convolution encoder-and-decoder network[J]. Journal of Signal Processing, 2021, 37(10):1986-1995. 10.16798/j.issn.1003-0530.2021.10.023 | |
13 | PALIWAL K, WÓJCICKI K, SHANNON B. The importance of phase in speech enhancement[J]. Speech Communication, 2011, 53(4): 465-494. 10.1016/j.specom.2010.12.003 |
14 | TAN K, WANG D. Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020, 28: 380-390. 10.1109/taslp.2019.2955276 |
15 | PARVEEN S, GREEN P. Speech enhancement with missing data techniques using recurrent neural networks[C]// Proceedings of the 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing — Volume 1. Piscataway: IEEE, 2004: 733-736. |
16 | PASCUAL S, BONAFONTE A, SERRÀ J. SEGAN: speech enhancement generative adversarial network[C]// Proceedings of the INTERSPEECH 2017. [S.l.]: International Speech Communication Association, 2017: 3642-3646. 10.21437/interspeech.2017-1428 |
17 | FU S W, LIAO C F, TSAO Y, et al. MetricGAN: generative adversarial networks based black-box metric scores optimization for speech enhancement[C]// Proceedings of the 36th International Conference on Machine Learning. New York: JMLR.org, 2019: 2031-2041. |
18 | VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY: Curran Associates Inc., 2017: 6000-6010. |
19 | KIM J, EL-KHAMY M, LEE J. T-GSA: Transformer with Gaussian-weighted self-attention for speech enhancement[C]// Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2020: 6649-6653. 10.1109/icassp40776.2020.9053591 |
20 | GULATI A, QIN J, CHIU C C, et al. Conformer: convolution-augmented Transformer for speech recognition[C]// Proceedings of the INTERSPEECH 2020. [S.l.]: International Speech Communication Association, 2020: 5036-5040. 10.21437/interspeech.2020-3015 |
21 | CAO R, ABDULATIF S, YANG B. CMGAN: conformer-based metric GAN for speech enhancement[C]// Proceedings of the INTERSPEECH 2022. [S.l.]: International Speech Communication Association, 2022: 936-940. 10.21437/interspeech.2022-517 |
22 | VALENTINI-BOTINHAO C, WANG X, TAKAKI S, et al. Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech[C]// Proceedings of the 9th ISCA Speech Synthesis Workshop. [S.l.]: International Speech Communication Association, 2016: 146-152. 10.21437/ssw.2016-24 |
23 | BRAUN S, TASHEV I. A consolidated view of loss functions for supervised deep learning-based speech enhancement[C]// Proceedings of the 44th International Conference on Telecommunications and Signal Processing. Piscataway: IEEE, 2021: 72-76. 10.1109/tsp52935.2021.9522648 |
24 | RIX A W, BEERENDS J G, HOLLIER M P, et al. Perceptual Evaluation of Speech Quality (PESQ) — a new method for speech quality assessment of telephone networks and codecs[C]// Proceedings of the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing — Volume 2. Piscataway: IEEE, 2001: 749-752. |
25 | TAAL C H, HENDRIKS R C, HEUSDENS R, et al. An algorithm for intelligibility prediction of time-frequency weighted noisy speech[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2011, 19(7): 2125-2136. 10.1109/tasl.2011.2114881 |
26 | VEAUX C, YAMAGISHI J, KING S. The voice bank corpus: design, collection and data analysis of a large regional accent speech database[C]// Proceedings of the 2013 International Conference of the Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation. Piscataway: IEEE, 2013: 1-4. 10.1109/icsda.2013.6709856 |
27 | THIEMANN J, ITO N, VINCENT E. The diverse environments multi-channel acoustic noise database: a database of multichannel environmental noise recordings[J]. The Journal of the Acoustical Society of America, 2013, 133(S5): No.4806631. 10.1121/1.4806631 |
28 | SU J, JIN Z, FINKELSTEIN A. HiFi-GAN: high-fidelity denoising and dereverberation based on speech deep features in adversarial networks[C]// Proceedings of the INTERSPEECH 2020. [S.l.]: International Speech Communication Association, 2020: 4506-4510. 10.21437/interspeech.2020-2143 |
29 | FU S W, YU C, HSIEH T A, et al. MetricGAN+: an improved version of MetricGAN for speech enhancement[C]// Proceedings of the INTERSPEECH 2021. [S.l.]: International Speech Communication Association, 2021: 201-205. 10.21437/interspeech.2021-599 |
30 | YIN D, LUO C, XIONG Z, et al. PHASEN: a phase-and-harmonics-aware speech enhancement network[C]// Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto, CA: AAAI Press, 2020: 9458-9465. 10.1609/aaai.v34i05.6489 |
31 | 徐峰,李平. DVUGAN:基于STDCT的DDSP集成变分U-Net的语音增强[J]. 信号处理, 2022, 38(3):582-589. |
XU F, LI P. DVUGAN: DDSP integrated variational U-Net speech enhancement based on STDCT[J]. Journal of Signal Processing, 2022, 38(3):582-589. | |
32 | WANG K, HE B, ZHU W P. TSTNN: two-stage transformer based neural network for speech enhancement in the time domain[C]// Proceedings of the 2021 IEEE International Conference on Acoustics, Speech, and Signal Processing. Piscataway: IEEE, 2021: 7098-7102. 10.1109/icassp39728.2021.9413740 |
33 | YU G, LI A, ZHENG C, et al. Dual-branch attention-in-attention transformer for single-channel speech enhancement[C]// Proceedings of the 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing. Piscataway: IEEE, 2022: 7847-7851. 10.1109/icassp43922.2022.9746273 |
34 | DANG F, CHEN H, ZHANG P. DPT-FSNet: dual-path transformer based full-band and sub-band fusion network for speech enhancement[C]// Proceedings of the 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing. Piscataway: IEEE, 2022: 6857-6861. 10.1109/icassp43922.2022.9746171 |
35 | HU Y, LOIZOU P C. Evaluation of objective quality measures for speech enhancement[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2008, 16(1): 229-238. 10.1109/tasl.2007.911054 |
[1] | Yunchuan HUANG, Yongquan JIANG, Juntao HUANG, Yan YANG. Molecular toxicity prediction based on meta graph isomorphism network [J]. Journal of Computer Applications, 2024, 44(9): 2964-2969. |
[2] | Yexin PAN, Zhe YANG. Optimization model for small object detection based on multi-level feature bidirectional fusion [J]. Journal of Computer Applications, 2024, 44(9): 2871-2877. |
[3] | Shunyong LI, Shiyi LI, Rui XU, Xingwang ZHAO. Incomplete multi-view clustering algorithm based on self-attention fusion [J]. Journal of Computer Applications, 2024, 44(9): 2696-2703. |
[4] | Jing QIN, Zhiguang QIN, Fali LI, Yueheng PENG. Diagnosis of major depressive disorder based on probabilistic sparse self-attention neural network [J]. Journal of Computer Applications, 2024, 44(9): 2970-2974. |
[5] | Xiyuan WANG, Zhancheng ZHANG, Shaokang XU, Baocheng ZHANG, Xiaoqing LUO, Fuyuan HU. Unsupervised cross-domain transfer network for 3D/2D registration in surgical navigation [J]. Journal of Computer Applications, 2024, 44(9): 2911-2918. |
[6] | Yan RONG, Jiawen LIU, Xinlei LI. Adaptive hybrid network for affective computing in student classroom [J]. Journal of Computer Applications, 2024, 44(9): 2919-2930. |
[7] | Tong CHEN, Fengyu YANG, Yu XIONG, Hong YAN, Fuxing QIU. Construction method of voiceprint library based on multi-scale frequency-channel attention fusion [J]. Journal of Computer Applications, 2024, 44(8): 2407-2413. |
[8] | Yuhan LIU, Genlin JI, Hongping ZHANG. Video pedestrian anomaly detection method based on skeleton graph and mixed attention [J]. Journal of Computer Applications, 2024, 44(8): 2551-2557. |
[9] | Chenqian LI, Jun LIU. Ultrasound carotid plaque segmentation method based on semi-supervision and multi-scale cascaded attention [J]. Journal of Computer Applications, 2024, 44(8): 2604-2610. |
[10] | Yanjie GU, Yingjun ZHANG, Xiaoqian LIU, Wei ZHOU, Wei SUN. Traffic flow forecasting via spatial-temporal multi-graph fusion [J]. Journal of Computer Applications, 2024, 44(8): 2618-2625. |
[11] | Qianhong SHI, Yan YANG, Yongquan JIANG, Xiaocao OUYANG, Wubo FAN, Qiang CHEN, Tao JIANG, Yuan LI. Multi-granularity abrupt change fitting network for air quality prediction [J]. Journal of Computer Applications, 2024, 44(8): 2643-2650. |
[12] | Yuan TANG, Yanping CHEN, Ying HU, Ruizhang HUANG, Yongbin QIN. Relation extraction model based on multi-scale hybrid attention convolutional neural networks [J]. Journal of Computer Applications, 2024, 44(7): 2011-2017. |
[13] | Sailong SHI, Zhiwen FANG. Gaze estimation model based on multi-scale aggregation and shared attention [J]. Journal of Computer Applications, 2024, 44(7): 2047-2054. |
[14] | Yiqun ZHAO, Zhiyu ZHANG, Xue DONG. Anisotropic travel time computation method based on dense residual connection physical information neural networks [J]. Journal of Computer Applications, 2024, 44(7): 2310-2318. |
[15] | Li LIU, Haijin HOU, Anhong WANG, Tao ZHANG. Generative data hiding algorithm based on multi-scale attention [J]. Journal of Computer Applications, 2024, 44(7): 2102-2109. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||