Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (8): 2611-2617.DOI: 10.11772/j.issn.1001-9081.2023081141
• Multimedia computing and computer simulation • Previous Articles
Shangbin MO1,2, Wenjun WANG1,2, Ling DONG1,2,3, Shengxiang GAO1,2,3(), Zhengtao YU1,2,3
Received:
2023-08-25
Revised:
2023-09-20
Accepted:
2023-10-08
Online:
2024-08-22
Published:
2024-08-10
Contact:
Shengxiang GAO
About author:
MO Shangbin, born in 1996, M. S. candidate. His research interests include speech enhancement, speech recognition.Supported by:
莫尚斌1,2, 王文君1,2, 董凌1,2,3, 高盛祥1,2,3(), 余正涛1,2,3
通讯作者:
高盛祥
作者简介:
莫尚斌(1996—),男,四川西昌人,硕士研究生,主要研究方向:语音增强、语音识别基金资助:
CLC Number:
Shangbin MO, Wenjun WANG, Ling DONG, Shengxiang GAO, Zhengtao YU. Single-channel speech enhancement based on multi-channel information aggregation and collaborative decoding[J]. Journal of Computer Applications, 2024, 44(8): 2611-2617.
莫尚斌, 王文君, 董凌, 高盛祥, 余正涛. 基于多路信息聚合协同解码的单通道语音增强[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2611-2617.
Add to citation manager EndNote|Ris|BibTeX
URL: https://www.joca.cn/EN/10.11772/j.issn.1001-9081.2023081141
模型 | 特征类型 | WB-PESQ | STOI/% | CSIG | CBAK | COVL |
---|---|---|---|---|---|---|
Noisy | - | 1.97 | 92.1 | 3.35 | 2.44 | 2.63 |
Wave U-Net[ | 波形 | 2.40 | - | 3.52 | 3.24 | 2.96 |
TSTNN[ | 波形 | 2.96 | 95.0 | 4.10 | 3.77 | 3.52 |
DEMUCS[ | 波形 | 3.07 | 95.0 | 4.31 | 3.40 | 3.63 |
MetricGAN[ | 幅度谱 | 2.86 | - | 3.99 | 3.18 | 3.42 |
CRGAN[ | 幅度谱 | 2.92 | 94.0 | 4.16 | 3.24 | 3.54 |
DCCRN[ | 复数谱 | 2.68 | 93.7 | 3.88 | 3.18 | 3.27 |
DCCARN[ | 复数谱 | 2.83 | - | 3.91 | 3.60 | 3.43 |
GaGNet[ | 幅度谱+复数谱 | 2.94 | 94.0 | 4.26 | 3.45 | 3.59 |
MIACD | 幅度谱+复数谱 | 3.09 | 96.7 | 4.48 | 3.65 | 3.79 |
Tab. 1 Evaluation scores of different models for speech enhancement on Voice Bank-DEMAND dataset
模型 | 特征类型 | WB-PESQ | STOI/% | CSIG | CBAK | COVL |
---|---|---|---|---|---|---|
Noisy | - | 1.97 | 92.1 | 3.35 | 2.44 | 2.63 |
Wave U-Net[ | 波形 | 2.40 | - | 3.52 | 3.24 | 2.96 |
TSTNN[ | 波形 | 2.96 | 95.0 | 4.10 | 3.77 | 3.52 |
DEMUCS[ | 波形 | 3.07 | 95.0 | 4.31 | 3.40 | 3.63 |
MetricGAN[ | 幅度谱 | 2.86 | - | 3.99 | 3.18 | 3.42 |
CRGAN[ | 幅度谱 | 2.92 | 94.0 | 4.16 | 3.24 | 3.54 |
DCCRN[ | 复数谱 | 2.68 | 93.7 | 3.88 | 3.18 | 3.27 |
DCCARN[ | 复数谱 | 2.83 | - | 3.91 | 3.60 | 3.43 |
GaGNet[ | 幅度谱+复数谱 | 2.94 | 94.0 | 4.26 | 3.45 | 3.59 |
MIACD | 幅度谱+复数谱 | 3.09 | 96.7 | 4.48 | 3.65 | 3.79 |
模型 | WB-PESQ | STOI/% | CSIG | CBAK | COVL |
---|---|---|---|---|---|
NO SSL | 2.91 | 96.0 | 4.38 | 3.30 | 3.68 |
SSL+EF | 2.99 | 96.0 | 4.36 | 2.66 | 3.70 |
SSL+PF | 2.98 | 95.9 | 4.37 | 2.67 | 3.71 |
SSL | 3.09 | 96.7 | 4.48 | 3.65 | 3.79 |
Tab. 2 Influence of incorporating speech self-supervised learning representations on overall network performance
模型 | WB-PESQ | STOI/% | CSIG | CBAK | COVL |
---|---|---|---|---|---|
NO SSL | 2.91 | 96.0 | 4.38 | 3.30 | 3.68 |
SSL+EF | 2.99 | 96.0 | 4.36 | 2.66 | 3.70 |
SSL+PF | 2.98 | 95.9 | 4.37 | 2.67 | 3.71 |
SSL | 3.09 | 96.7 | 4.48 | 3.65 | 3.79 |
模型 | WB-PESQ | STOI/% | CSIG | CBAK | COVL |
---|---|---|---|---|---|
MIACD(No central layer) | 2.87 | 95.9 | 4.29 | 3.29 | 3.61 |
MIACD+LSTM | 2.98 | 96.3 | 4.35 | 3.44 | 3.70 |
MIACD+Transformer | 2.99 | 96.3 | 4.33 | 3.49 | 3.69 |
MIACD+Conformer(本文模型) | 3.09 | 96.7 | 4.48 | 3.43 | 3.79 |
Tab. 3 Influence of different models acting as intermediate layers on overall network performance on Voice Bank-DEMAND dataset
模型 | WB-PESQ | STOI/% | CSIG | CBAK | COVL |
---|---|---|---|---|---|
MIACD(No central layer) | 2.87 | 95.9 | 4.29 | 3.29 | 3.61 |
MIACD+LSTM | 2.98 | 96.3 | 4.35 | 3.44 | 3.70 |
MIACD+Transformer | 2.99 | 96.3 | 4.33 | 3.49 | 3.69 |
MIACD+Conformer(本文模型) | 3.09 | 96.7 | 4.48 | 3.43 | 3.79 |
实验序号 | WB-PESQ | STOI/% | CSIG | CBAK | COVL |
---|---|---|---|---|---|
1 | 2.43 | 92.3 | 3.96 | 2.56 | 3.23 |
2 | 2.63 | 93.9 | 4.26 | 2.87 | 3.54 |
3 | 2.86 | 95.9 | 4.27 | 2.69 | 3.58 |
4 | 2.96 | 96.3 | 4.33 | 3.38 | 3.68 |
5 | 2.77 | 95.8 | 4.24 | 3.07 | 3.52 |
6 | 2.86 | 95.9 | 4.24 | 3.71 | 3.58 |
7 | 2.93 | 96.1 | 4.34 | 3.32 | 3.66 |
8 | 2.97 | 96.4 | 4.36 | 3.63 | 3.70 |
9 | 3.03 | 96.6 | 4.41 | 3.64 | 3.73 |
10 | 3.09 | 96.7 | 4.48 | 3.65 | 3.79 |
Tab. 4 Ablation experiment results on Voice Bank-DEMAND dataset
实验序号 | WB-PESQ | STOI/% | CSIG | CBAK | COVL |
---|---|---|---|---|---|
1 | 2.43 | 92.3 | 3.96 | 2.56 | 3.23 |
2 | 2.63 | 93.9 | 4.26 | 2.87 | 3.54 |
3 | 2.86 | 95.9 | 4.27 | 2.69 | 3.58 |
4 | 2.96 | 96.3 | 4.33 | 3.38 | 3.68 |
5 | 2.77 | 95.8 | 4.24 | 3.07 | 3.52 |
6 | 2.86 | 95.9 | 4.24 | 3.71 | 3.58 |
7 | 2.93 | 96.1 | 4.34 | 3.32 | 3.66 |
8 | 2.97 | 96.4 | 4.36 | 3.63 | 3.70 |
9 | 3.03 | 96.6 | 4.41 | 3.64 | 3.73 |
10 | 3.09 | 96.7 | 4.48 | 3.65 | 3.79 |
1 | 高长丰, 程高峰, 张鹏远. 面向鲁棒自动语音识别的一致性自监督学习方法[J]. 声学学报, 2023, 48(3): 578-587. |
GAO C F, CHENG G F, ZHANG P Y. Consistency self-supervised learning method for robust automatic speech recognition[J]. Acta Acustica, 2023, 48(3): 578-587. | |
2 | ZHONG X, DAI Y, DAI Y, et al. Study on processing of wavelet speech denoising in speech recognition system[J]. International Journal of Speech Technology, 2018, 21: 563-569. |
3 | PENG R, TAN Z-H, LI X, et al. A perceptually motivated LP residual estimator in noisy and reverberant environments[J]. Speech Communication, 2018, 96: 129-141. |
4 | HU Y, LOIZOU P C. A generalized subspace approach for enhancing speech corrupted by colored noise[J]. IEEE Transactions on Speech and Audio Processing, 2003, 11(4): 334-341. |
5 | 蓝天, 彭川, 李森, 等. 单声道语音降噪与去混响研究综述[J]. 计算机研究与发展, 2020, 57(5): 928-953. |
LAN T, PENG C, LI S, et al. An overview of monaural speech denoising and dereverberation research[J]. Journal of Computer Research and Development, 2020, 57(5): 928-953. | |
6 | LUO Y, MESGARANI N. TaSNET: time-domain audio separation network for real-time, single-channel speech separation[C]// Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2018: 696-700. |
7 | GAO T, DU J, DAI L-R, et al. Densely connected progressive learning for LSTM-based speech enhancement[C]// Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2018: 5054-5058. |
8 | ROUTRAY S, MAO Q. Phase sensitive masking-based single channel speech enhancement using conditional generative adversarial network[J]. Computer Speech & Language, 2022, 71: 101270. |
9 | YU W, ZHOU J, WANG H B, et al. SETransformer: speech enhancement Transformer[J]. Cognitive Computation, 2022,14(3):1152-1158. |
10 | FU S-W, LIAO C-F, TASO Y, et al. MetricGAN: generative adversarial networks based black-box metric scores optimization for speech enhancement[C]// Proceedings of the 36th International Conference on Machine Learning. New York: JMLR.org, 2019: 2031-2041. |
11 | ZHANG Z, DENG C, SHEN Y, et al. On loss functions and recurrency training for GAN-based speech enhancement systems[C]// Proceedings of the 2020 Interspeech. Baixas, France: International Speech Communication Association, 2020: 3266-3270. |
12 | NIKZAD M, NICOLSON A, GAO Y, et al. Deep residual-dense lattice network for speech enhancement[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(5): 8552-8559. |
13 | PASCUAL S, BONAFONTE A, SERRÀ J. SEGAN: speech enhancement generative adversarial network[C]// Proceedings of the 2017 Interspeech. Baixas, France: International Speech Communication Association, 2017: 3642-3646. |
14 | KIM E, SEO H. SE-Conformer: time-domain speech enhancement using conformer[EB/OL].[2023-06-20].. |
15 | WANG K, HE B, ZHU W-P. TSTNN: two-stage transformer based neural network for speech enhancement in the time domain[C]// Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2021: 7098-7102. |
16 | TAN K, WANG D L. Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019, 28: 380-390. |
17 | H-S CHOI, KIM J-H, HUH J, et al. Phase-aware speech enhancement with deep complex U-net[C/OL]// Proceedings of the 2019 International Conference on Learning Representations ( 2019-03-07)[2023-08-01]. . |
18 | HU Y, LIU Y, LV S, et al. DCCRN: deep complex convolution recurrent network for phase-aware speech enhancement[C]// Proceedings of the 2020 Interspeech. Baixas, France: International Speech Communication Association, 2020: 2472-2476. |
19 | LI A, ZHENG C, FAN C, et al. A recursive network with dynamic attention for monaural speech enhancement[C]// Proceedings of the 2020 Interspeech. Baixas, France: International Speech Communication Association, 2020: 2422-2426. |
20 | DÉFOSSEZ A, SYNNAEVE G, ADI Y. Real time speech enhancement in the waveform domain [C]]// Proceedings of the 2020 Interspeech. Baixas, France: International Speech Communication Association, 2020: 3291-3295. |
21 | HUANG Z, WATANABE S, YANG S-W, et al. Investigating self-supervised learning for speech enhancement and separation[C]// Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2022: 6837-6841. |
22 | LI A, LIU W, ZHENG C, et al. Two heads are better than one: a two-stage complex spectral mapping approach for monaural speech enhancement[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 1829-1843. |
23 | HAO X, SU X, WEN S, et al. Masking and inpainting: a two-stage speech enhancement approach for low SNR and non-stationary noise[C]// Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2020: 6959-6963. |
24 | WANG H, WANG D L. Neural cascade architecture with triple-domain loss for speech enhancement[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 30: 734-743. |
25 | 范君怡, 杨吉斌, 张雄伟, 等. U-net网络中融合多头注意力机制的单通道语音增强[J]. 声学学报, 2022, 47(6): 703-716. |
FAN J Y, YANG J B, ZHANG X W, et al. Monaural speech enhancement using U-net fused with multi-head self-attention[J]. Acta Acustica, 2022, 47(6): 703-716. | |
26 | JU Y, RAO W, YAN X, et al. TEA-PSE: Tencent-ethereal-audio-lab personalized speech enhancement system for ICASSP 2022 DNS challenge[C]// Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2022: 9291-9295. |
27 | CHEN S, WANG C, CHEN Z,et al. WavLM: large-scale self-supervised pre-training for full stack speech processing[J]. IEEE Journal of Selected Topics in Signal Processing, 2022, 16(6):1505-1518. |
28 | WOO S, PARK J, LEE J-Y, et al. CBAM: convolutional block attention module[C]// Proceedings of the 15th European Conference on Computer Vision. Cham: Springer, 2018: 3-19. |
29 | VEAUX C, YAMAGISHI J, KING S. The voice bank corpus: design, collection and data analysis of a large regional accent speech database[C]// Proceedings of the 2013 International Conference Oriental COCOSDA Held Jointly with Conference on Asian Spoken Language Research and Evaluation. Piscataway: IEEE, 2013: 1-4. |
30 | THIEMANN J, ITO N, VINCENT E. The diverse environments multi-channel acoustic noise database (DEMAND): a database of multichannel environmental noise recordings[J]. Proceedings of Meetings on Acoustics, 2013, 19(1): 035081. |
31 | MACARTNEY C, WEYDE T. Improved speech enhancement with the Wave-U-Net[EB/OL]. (2018-11-27) [2022-12-15].. |
32 | LI A, ZHENG C, ZHANG L, et al. Glance and gaze: a collaborative learning framework for single-channel speech enhancement[J]. Applied Acoustics, 2022, 187: 108499. |
33 | 余本年,詹永照,毛启容,等.面向语音增强的双复数卷积注意聚合递归网络[J].计算机应用, 2023, 43(10): 3217-2124. |
YU B N, ZHAN Y Z, MAO Q R, et al. Double complex convolutional and attention aggregating recurrent network for speech enhancement[J]. Journal of Computer Applications, 2023, 43(10): 3217-2124. |
[1] | Xinyuan YOU, Heng WANG. Monaural speech enhancement based on gated dilated convolutional recurrent network [J]. Journal of Computer Applications, 2024, 44(4): 1317-1324. |
[2] | Jianqing GAO, Yanhui TU, Feng MA, Zhonghua FU. Progressive ratio mask-based adaptive noise estimation method [J]. Journal of Computer Applications, 2023, 43(4): 1303-1308. |
[3] | LONG Chao, ZENG Qingning, LUO Ying. Small-array speech enhancement based on noise cancellation and beamforming [J]. Journal of Computer Applications, 2020, 40(8): 2386-2391. |
[4] | WU Qinghe, WU Haifeng, SHEN Yong, ZENG Yu. Speech enhancement using multi-microphone state space model under industrial noise environment [J]. Journal of Computer Applications, 2020, 40(5): 1476-1482. |
[5] | GE Wanying, ZHANG Tianqi. Monaural speech enhancement algorithm based on mask estimation and optimization [J]. Journal of Computer Applications, 2019, 39(10): 3065-3070. |
[6] | LUO Ying, ZENG Qingning, LONG Chao. Dual mini micro-array speech enhancement algorithm under multi-noise environment [J]. Journal of Computer Applications, 2019, 39(8): 2426-2430. |
[7] | JIANG Maosong, WANG Dongxia, NIU Fanglin, CAO Yudong. Speech enhancement method based on sparsity-regularized non-negative matrix factorization [J]. Journal of Computer Applications, 2018, 38(4): 1176-1180. |
[8] | LIU Jingang, ZHOU Yi, MA Yongbao, LIU Hongqing. Estimation algorithm of switching speech power spectrum for automatic speech recognition system [J]. Journal of Computer Applications, 2016, 36(12): 3369-3373. |
[9] | MA Jinlong, ZENG Qingning, HU Dan, LONG Chao, XIE Xianming. Speech enhancement algorithm based on microphone array under multiple noise environments [J]. Journal of Computer Applications, 2015, 35(8): 2341-2344. |
[10] | . BigData2023-P00186 Monaural Speech Enhancement Based on Multi-Channel Information Aggregation and collaborative decoding [J]. , , (): 0-0. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||