Single-channel speech enhancement based on multi-channel information aggregation and collaborative decoding

doi:10.11772/j.issn.1001-9081.2023081141

Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (8): 2611-2617.DOI: 10.11772/j.issn.1001-9081.2023081141

• Multimedia computing and computer simulation • Previous Articles Next Articles

Single-channel speech enhancement based on multi-channel information aggregation and collaborative decoding

Shangbin MO¹^,², Wenjun WANG¹^,², Ling DONG¹^,²^,³, Shengxiang GAO¹^,²^,³(), Zhengtao YU¹^,²^,³

^1.Faculty of Information Engineering and Automation，Kunming University of Science and Technology，Kunming Yunnan 650500，China
^2.Yunnan Key Laboratory of Artificial Intelligence （Kunming University of Science and Technology），Kunming Yunnan 650500，China
^3.Yunnan Provincial Key Laboratory of Media Integration，Kunming Yunnan 650228，China

Received:2023-08-25 Revised:2023-09-20 Accepted:2023-10-08 Online:2024-08-22 Published:2024-08-10
Contact: Shengxiang GAO
About author:MO Shangbin， born in 1996， M. S. candidate. His research interests include speech enhancement， speech recognition.
WANG Wenjun，born in 1988， Ph. D. candidate. His research interests include speech recognition， natural language processing.
DONG Ling， born in 1984， Ph. D. candidate， lecturer. His research interests include speech recognition， natural language processing.
YU Zhengtao， bron in 1970， Ph. D.， professor. His research interests include natural language processing， machine translation， information retrieval.
Supported by:
National Natural Science Foundation of China(61972186);Yunnan High-tech Industry Development Project(201606);Major Science and Technology Special Program of Yunnan Province(202103AA080015);Basic Research Program of Yunnan Province(202001AS070014);Yunnan Science and Technology Talents and Platform Program(202105AC160018);Open Project of Yunnan Provincial Key Laboratory of Media Integration(220225702)

基于多路信息聚合协同解码的单通道语音增强

莫尚斌¹^,², 王文君¹^,², 董凌¹^,²^,³, 高盛祥¹^,²^,³(), 余正涛¹^,²^,³

^1.昆明理工大学信息工程与自动化学院，昆明 650500
^2.云南省人工智能重点实验室（昆明理工大学），昆明 650500
^3.云南省媒体融合重点实验室，昆明 650228

通讯作者: 高盛祥
作者简介:莫尚斌（1996—），男，四川西昌人，硕士研究生，主要研究方向：语音增强、语音识别
王文君（1988—），男，云南昆明人，博士研究生，主要研究方向：语音识别、自然语言处理
董凌（1984—），男，云南大理人，讲师，博士研究生，主要研究方向：语音识别、自然语言处理
高盛祥（1977—），女，云南洱源人，教授，博士，CCF会员，主要研究方向：自然语言处理、机器翻译、语音识别、语音合成 gaoshengxiang.yn@foxmail.com
余正涛（1970—），男，云南曲靖人，教授，博士，CCF会员，主要研究方向：自然语言处理、机器翻译、信息检索。
基金资助:
国家自然科学基金资助项目(61972186);云南高新技术产业发展项目(201606);云南省重大科技专项计划项目(202103AA080015);云南省基础研究计划项目(202001AS070014);云南省科技人才与平台计划项目(202105AC160018);云南省媒体融合重点实验室开放课题(220225702)

Abstract

Abstract:

In order to address the issues of insufficient acoustic feature extraction and severe decoding feature loss in single-channel speech enhancement networks based on convolutional encoder-decoder architecture， a single-channel speech enhancement network called Multi-Channel Information Aggregation and Collaborative Decoding （MIACD） was proposed. A dual-channel encoder was utilized to extract the speech magnitude spectrum and complex spectrum features， which were enriched with Self-Supervised Learning （SSL） representations. A four-layer Conformer block was employed to model the extracted features in time and frequency domains. By incorporating residual connections， the speech magnitude and complex features extracted by the dual-channel encoder were introduced into a three-channel information aggregation decoder. Additionally， a Channel-Time-Frequency Attention （CTF-Attention） mechanism was proposed to adjust the aggregated information in the decoder based on the distribution of speech energy， effectively alleviating the problem of severe acoustic information loss during decoding. Experimental results on the publicly available dataset Voice Bank DEMAND demonstrate that， compared to Glance and Gaze： a collaborative learning framework for Single-channel speech enhancement （GaGNet）， the proposed method achieves a 5.1% improvement on the objective metric WB-PESQ （Wide Band Perceptual Evaluation of Speech Quality） and 96.7% on STOI （Short-Time Objective Intelligibility）， validating that the proposed method effectively utilizes speech information for signal reconstruction， noise suppression， and speech intelligibility enhancement.

Key words: acoustic feature, multi-channel information aggregation, dual-channel encoder, three-channel information aggregation decoder, channel-time-frequency attention mechanism

摘要：

为了改善基于卷积编解码架构的单通道语音增强网络对语音声学特征提取不充分、解码特征丢失严重的问题，提出一种基于多路信息聚合协同解码的单通道语音增强网络MIACD，通过双路编码器充分提取融入了语音自监督学习（SSL）表征的幅度谱和复数谱特征，由4层Conformer分别从时间和频率维度对提取特征建模，采用残差连接将双路编码器提取的语音幅度、复数特征引入三路信息聚合解码器，并利用所提通道-时频注意力（CTF-Attention）机制根据语音能量分布情况调节解码器中聚合信息，有效缓解解码时可用声学信息缺失严重的问题。在公开数据集Voice Bank DEMAND上的实验结果表明，与用于单通道语音增强的协作学习框架（GaGNet）相比，MIACD在客观评价指标宽带感知评估语音质量（WB-PESQ）上提升了5.1%，短时客观可懂度（STOI）达到96.7%，验证所提方法可充分利用语音信息重构信号，有效抑制噪声并提升语音可理解性。

关键词: 声学特征, 多路信息聚合, 双路编码器, 三路信息聚合解码器, 通道-时频注意力机制

CLC Number:

TN912.35

Shangbin MO, Wenjun WANG, Ling DONG, Shengxiang GAO, Zhengtao YU. Single-channel speech enhancement based on multi-channel information aggregation and collaborative decoding[J]. Journal of Computer Applications, 2024, 44(8): 2611-2617.

莫尚斌, 王文君, 董凌, 高盛祥, 余正涛. 基于多路信息聚合协同解码的单通道语音增强[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2611-2617.

Figures/Tables 8

References 33

1	高长丰，程高峰，张鹏远. 面向鲁棒自动语音识别的一致性自监督学习方法［J］. 声学学报， 2023， 48（3）： 578-587.
	GAO C F， CHENG G F， ZHANG P Y. Consistency self-supervised learning method for robust automatic speech recognition［J］. Acta Acustica， 2023， 48（3）： 578-587.
2	ZHONG X， DAI Y， DAI Y， et al. Study on processing of wavelet speech denoising in speech recognition system［J］. International Journal of Speech Technology， 2018， 21： 563-569.
3	PENG R， TAN Z-H， LI X， et al. A perceptually motivated LP residual estimator in noisy and reverberant environments［J］. Speech Communication， 2018， 96： 129-141.
4	HU Y， LOIZOU P C. A generalized subspace approach for enhancing speech corrupted by colored noise［J］. IEEE Transactions on Speech and Audio Processing， 2003， 11（4）： 334-341.
5	蓝天，彭川，李森，等. 单声道语音降噪与去混响研究综述［J］. 计算机研究与发展， 2020， 57（5）： 928-953.
	LAN T， PENG C， LI S， et al. An overview of monaural speech denoising and dereverberation research［J］. Journal of Computer Research and Development， 2020， 57（5）： 928-953.
6	LUO Y， MESGARANI N. TaSNET： time-domain audio separation network for real-time， single-channel speech separation［C］// Proceedings of the 2018 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2018： 696-700.
7	GAO T， DU J， DAI L-R， et al. Densely connected progressive learning for LSTM-based speech enhancement［C］// Proceedings of the 2018 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2018： 5054-5058.
8	ROUTRAY S， MAO Q. Phase sensitive masking-based single channel speech enhancement using conditional generative adversarial network［J］. Computer Speech & Language， 2022， 71： 101270.
9	YU W， ZHOU J， WANG H B， et al. SETransformer： speech enhancement Transformer［J］. Cognitive Computation， 2022，14（3）：1152-1158.
10	FU S-W， LIAO C-F， TASO Y， et al. MetricGAN： generative adversarial networks based black-box metric scores optimization for speech enhancement［C］// Proceedings of the 36th International Conference on Machine Learning. New York： JMLR.org， 2019： 2031-2041.
11	ZHANG Z， DENG C， SHEN Y， et al. On loss functions and recurrency training for GAN-based speech enhancement systems［C］// Proceedings of the 2020 Interspeech. Baixas， France： International Speech Communication Association， 2020： 3266-3270.
12	NIKZAD M， NICOLSON A， GAO Y， et al. Deep residual-dense lattice network for speech enhancement［J］. Proceedings of the AAAI Conference on Artificial Intelligence， 2020， 34（5）： 8552-8559.
13	PASCUAL S， BONAFONTE A， SERRÀ J. SEGAN： speech enhancement generative adversarial network［C］// Proceedings of the 2017 Interspeech. Baixas， France： International Speech Communication Association， 2017： 3642-3646.
14	KIM E， SEO H. SE-Conformer： time-domain speech enhancement using conformer［EB/OL］.［2023-06-20］..
15	WANG K， HE B， ZHU W-P. TSTNN： two-stage transformer based neural network for speech enhancement in the time domain［C］// Proceedings of the 2021 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2021： 7098-7102.
16	TAN K， WANG D L. Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2019， 28： 380-390.
17	H-S CHOI， KIM J-H， HUH J， et al. Phase-aware speech enhancement with deep complex U-net［C/OL］// Proceedings of the 2019 International Conference on Learning Representations （ 2019-03-07）［2023-08-01］. .
18	HU Y， LIU Y， LV S， et al. DCCRN： deep complex convolution recurrent network for phase-aware speech enhancement［C］// Proceedings of the 2020 Interspeech. Baixas， France： International Speech Communication Association， 2020： 2472-2476.
19	LI A， ZHENG C， FAN C， et al. A recursive network with dynamic attention for monaural speech enhancement［C］// Proceedings of the 2020 Interspeech. Baixas， France： International Speech Communication Association， 2020： 2422-2426.
20	DÉFOSSEZ A， SYNNAEVE G， ADI Y. Real time speech enhancement in the waveform domain ［C］］// Proceedings of the 2020 Interspeech. Baixas， France： International Speech Communication Association， 2020： 3291-3295.
21	HUANG Z， WATANABE S， YANG S-W， et al. Investigating self-supervised learning for speech enhancement and separation［C］// Proceedings of the 2022 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2022： 6837-6841.
22	LI A， LIU W， ZHENG C， et al. Two heads are better than one： a two-stage complex spectral mapping approach for monaural speech enhancement［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2021， 29： 1829-1843.
23	HAO X， SU X， WEN S， et al. Masking and inpainting： a two-stage speech enhancement approach for low SNR and non-stationary noise［C］// Proceedings of the 2020 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2020： 6959-6963.
24	WANG H， WANG D L. Neural cascade architecture with triple-domain loss for speech enhancement［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2021， 30： 734-743.
25	范君怡，杨吉斌，张雄伟，等. U-net网络中融合多头注意力机制的单通道语音增强［J］. 声学学报， 2022， 47（6）： 703-716.
	FAN J Y， YANG J B， ZHANG X W， et al. Monaural speech enhancement using U-net fused with multi-head self-attention［J］. Acta Acustica， 2022， 47（6）： 703-716.
26	JU Y， RAO W， YAN X， et al. TEA-PSE： Tencent-ethereal-audio-lab personalized speech enhancement system for ICASSP 2022 DNS challenge［C］// Proceedings of the 2022 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2022： 9291-9295.
27	CHEN S， WANG C， CHEN Z，et al. WavLM： large-scale self-supervised pre-training for full stack speech processing［J］. IEEE Journal of Selected Topics in Signal Processing， 2022， 16（6）：1505-1518.
28	WOO S， PARK J， LEE J-Y， et al. CBAM： convolutional block attention module［C］// Proceedings of the 15th European Conference on Computer Vision. Cham： Springer， 2018： 3-19.
29	VEAUX C， YAMAGISHI J， KING S. The voice bank corpus： design， collection and data analysis of a large regional accent speech database［C］// Proceedings of the 2013 International Conference Oriental COCOSDA Held Jointly with Conference on Asian Spoken Language Research and Evaluation. Piscataway： IEEE， 2013： 1-4.
30	THIEMANN J， ITO N， VINCENT E. The diverse environments multi-channel acoustic noise database （DEMAND）： a database of multichannel environmental noise recordings［J］. Proceedings of Meetings on Acoustics， 2013， 19（1）： 035081.
31	MACARTNEY C， WEYDE T. Improved speech enhancement with the Wave-U-Net［EB/OL］. （2018-11-27）［2022-12-15］..
32	LI A， ZHENG C， ZHANG L， et al. Glance and gaze： a collaborative learning framework for single-channel speech enhancement［J］. Applied Acoustics， 2022， 187： 108499.
33	余本年，詹永照，毛启容，等.面向语音增强的双复数卷积注意聚合递归网络［J］.计算机应用， 2023， 43（10）： 3217-2124.
	YU B N， ZHAN Y Z， MAO Q R， et al. Double complex convolutional and attention aggregating recurrent network for speech enhancement［J］. Journal of Computer Applications， 2023， 43（10）： 3217-2124.

模型	特征类型	WB-PESQ	STOI/%	CSIG	CBAK	COVL
Noisy	－	1.97	92.1	3.35	2.44	2.63
Wave U-Net^［31］	波形	2.40	－	3.52	3.24	2.96
TSTNN^［15］	波形	2.96	95.0	4.10	3.77	3.52
DEMUCS^［20］	波形	3.07	95.0	4.31	3.40	3.63
MetricGAN^［10］	幅度谱	2.86	－	3.99	3.18	3.42
CRGAN^［11］	幅度谱	2.92	94.0	4.16	3.24	3.54
DCCRN^［18］	复数谱	2.68	93.7	3.88	3.18	3.27
DCCARN^［33］	复数谱	2.83	－	3.91	3.60	3.43
GaGNet^［32］	幅度谱+复数谱	2.94	94.0	4.26	3.45	3.59
MIACD	幅度谱+复数谱	3.09	96.7	4.48	3.65	3.79

模型	特征类型	WB-PESQ	STOI/%	CSIG	CBAK	COVL
Noisy	－	1.97	92.1	3.35	2.44	2.63
Wave U-Net^［31］	波形	2.40	－	3.52	3.24	2.96
TSTNN^［15］	波形	2.96	95.0	4.10	3.77	3.52
DEMUCS^［20］	波形	3.07	95.0	4.31	3.40	3.63
MetricGAN^［10］	幅度谱	2.86	－	3.99	3.18	3.42
CRGAN^［11］	幅度谱	2.92	94.0	4.16	3.24	3.54
DCCRN^［18］	复数谱	2.68	93.7	3.88	3.18	3.27
DCCARN^［33］	复数谱	2.83	－	3.91	3.60	3.43
GaGNet^［32］	幅度谱+复数谱	2.94	94.0	4.26	3.45	3.59
MIACD	幅度谱+复数谱	3.09	96.7	4.48	3.65	3.79

模型	WB-PESQ	STOI/%	CSIG	CBAK	COVL
NO SSL	2.91	96.0	4.38	3.30	3.68
SSL+EF	2.99	96.0	4.36	2.66	3.70
SSL+PF	2.98	95.9	4.37	2.67	3.71
SSL	3.09	96.7	4.48	3.65	3.79

模型	WB-PESQ	STOI/%	CSIG	CBAK	COVL
NO SSL	2.91	96.0	4.38	3.30	3.68
SSL+EF	2.99	96.0	4.36	2.66	3.70
SSL+PF	2.98	95.9	4.37	2.67	3.71
SSL	3.09	96.7	4.48	3.65	3.79

模型	WB-PESQ	STOI/%	CSIG	CBAK	COVL
MIACD（No central layer）	2.87	95.9	4.29	3.29	3.61
MIACD+LSTM	2.98	96.3	4.35	3.44	3.70
MIACD+Transformer	2.99	96.3	4.33	3.49	3.69
MIACD+Conformer（本文模型）	3.09	96.7	4.48	3.43	3.79

Single-channel speech enhancement based on multi-channel information aggregation and collaborative decoding

基于多路信息聚合协同解码的单通道语音增强

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 8

References 33

Related Articles 10

Recommended Articles

Metrics

实验序号	WB-PESQ	STOI/%	CSIG	CBAK	COVL
1	2.43	92.3	3.96	2.56	3.23
2	2.63	93.9	4.26	2.87	3.54
3	2.86	95.9	4.27	2.69	3.58
4	2.96	96.3	4.33	3.38	3.68
5	2.77	95.8	4.24	3.07	3.52
6	2.86	95.9	4.24	3.71	3.58
7	2.93	96.1	4.34	3.32	3.66
8	2.97	96.4	4.36	3.63	3.70
9	3.03	96.6	4.41	3.64	3.73
10	3.09	96.7	4.48	3.65	3.79

[1]	Xinyuan YOU, Heng WANG. Monaural speech enhancement based on gated dilated convolutional recurrent network [J]. Journal of Computer Applications, 2024, 44(4): 1317-1324.
[2]	Jianqing GAO, Yanhui TU, Feng MA, Zhonghua FU. Progressive ratio mask-based adaptive noise estimation method [J]. Journal of Computer Applications, 2023, 43(4): 1303-1308.
[3]	LONG Chao, ZENG Qingning, LUO Ying. Small-array speech enhancement based on noise cancellation and beamforming [J]. Journal of Computer Applications, 2020, 40(8): 2386-2391.
[4]	WU Qinghe, WU Haifeng, SHEN Yong, ZENG Yu. Speech enhancement using multi-microphone state space model under industrial noise environment [J]. Journal of Computer Applications, 2020, 40(5): 1476-1482.
[5]	GE Wanying, ZHANG Tianqi. Monaural speech enhancement algorithm based on mask estimation and optimization [J]. Journal of Computer Applications, 2019, 39(10): 3065-3070.
[6]	LUO Ying, ZENG Qingning, LONG Chao. Dual mini micro-array speech enhancement algorithm under multi-noise environment [J]. Journal of Computer Applications, 2019, 39(8): 2426-2430.
[7]	JIANG Maosong, WANG Dongxia, NIU Fanglin, CAO Yudong. Speech enhancement method based on sparsity-regularized non-negative matrix factorization [J]. Journal of Computer Applications, 2018, 38(4): 1176-1180.
[8]	LIU Jingang, ZHOU Yi, MA Yongbao, LIU Hongqing. Estimation algorithm of switching speech power spectrum for automatic speech recognition system [J]. Journal of Computer Applications, 2016, 36(12): 3369-3373.
[9]	MA Jinlong, ZENG Qingning, HU Dan, LONG Chao, XIE Xianming. Speech enhancement algorithm based on microphone array under multiple noise environments [J]. Journal of Computer Applications, 2015, 35(8): 2341-2344.
[10]	. BigData2023-P00186 Monaural Speech Enhancement Based on Multi-Channel Information Aggregation and collaborative decoding [J]. , , (): 0-0.