Monaural speech enhancement with heterogeneous dual-branch decoding based on multi-view attention

doi:10.11772/j.issn.1001-9081.2024101463

Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (10): 3284-3293.DOI: 10.11772/j.issn.1001-9081.2024101463

• Multimedia computing and computer simulation • Previous Articles

Monaural speech enhancement with heterogeneous dual-branch decoding based on multi-view attention

Gengzangcuomao¹^,², Heming HUANG¹^,²()

^1.College of Computer，Qinghai Normal University，Xining Qinghai 810008，China
^2.The State Key Laboratory of Tibetan Intelligence （Qinghai Normal University），Xining Qinghai 810008，China

Received:2024-10-21 Revised:2025-03-02 Accepted:2025-03-10 Online:2025-10-14 Published:2025-10-10
Contact: Heming HUANG
About author:Gengzangcuomao， born in 1993， Ph. D. candidate， lecturer. Her research interests include speech enhancement， speech recognition.
HUANG Heming， born in 1969， Ph. D.， professor. His research interests include pattern recognition， intelligent system.
Supported by:
National Natural Science Foundation of China(62066039);Natural Science Foundation of Qinghai Province(2022-ZJ-925)

基于多视角注意力的异构双分支解码单通道语音增强

更藏措毛null¹^,², 黄鹤鸣¹^,²()

^1.青海师范大学计算机学院，西宁 810000
^2.藏语智能全国重点实验室（青海师范大学），西宁 810008

通讯作者: 黄鹤鸣
作者简介:更藏措毛（1993—），女（藏族），青海共和县人，讲师，博士研究生，CCF会员，主要研究方向：语音增强、语音识别
黄鹤鸣（1969—），男（藏族），青海海东人，教授，博士，CCF会员，主要研究方向：模式识别、智能系统。Email:1021489068@qq.com
基金资助:
国家自然科学基金资助项目(62066039);青海省自然科学基金资助项目(2022-ZJ-925)

Abstract

Abstract:

To address the issues of insufficient acoustic feature extraction， channel information loss， and amplitude-phase compensation difficulties in mainstream encoder-decoder structures， a monaural speech enhancement model of heterogeneous dual-branch decoding for monaural speech enhancement — HDBMV （Heterogeneous Dual-Branch with Multi-View） was proposed by combining speech features from different dimensions. In the model， the performance of monaural speech enhancement was improved through mechanisms such as an Information Fusion Encoder （IFE）， Time-Frequency Residual Conformer （TFRC） module， Multi-View Attention （MVA） module， and Heterogeneous Dual-Branch Decoder （HDBD）. Firstly， amplitude and multiple features were processed by IFE jointly， thereby capturing both global dependencies and local correlations to generate compact feature representations. Secondly， TFRC module was used to capture correlations along both time and frequency dimensions effectively， while reducing the computational complexity. Thirdly， MVA module was used to reconstruct information across both channel and time-frequency domains， thereby further enhancing ability of the model to represent information in multiple views and levels. Finally， HDBD was used to process amplitude features and refine multiple features separately， thereby solving the amplitude-phase compensation problem and improving the decoding robustness. Experimental results show that HDBMV achieves the Perceptual Evaluation of Speech Quality （PESQ） of 3.00， 3.12， and 2.09， respectively， and the Short-Time Objective Intelligibility measure （STOI） of 0.96， 0.97， and 0.81， respectively， on the public dataset VoiceBank+DEMAND， the large-scale dataset DNS Challenge 2020， and the self-built Tibetan dataset BodSpeDB. It can be seen that with the smallest number of parameters and high computational efficiency， HDBMV obtains the best speech enhancement performance and strong generalization ability.

Key words: speech enhancement, encoder-decoder, Conformer, attention mechanism, complex feature

摘要：

针对单通道语音增强中主流编解码结构面临的声学特征提取不充分、通道信息丢失和幅度相位补偿困难等问题，提出一种融合不同维度语音特征的异构双分支解码单通道语音增强模型——HDBMV（Heterogeneous Dual-Branch with Multi-View）。该模型通过信息融合编码器（IFE）、时频残差Conformer（TFRC）模块、多视角注意力（MVA）模块和异构双分支解码器（HDBD）等机制，提升单通道语音增强的性能。首先，IFE联合处理振幅与复数特征，捕捉全局依赖和局部相关，生成紧凑的特征表示；其次，TFRC模块有效捕捉时间维度和频域维度上的相关性，同时降低计算复杂度；再次，MVA模块重构通道域和时频域信息，进一步增强模型对信息的多视角多层次的表征能力；最后，HDBD分别处理幅度特征和细化复数特征，解决幅度相位补偿问题，提升解码鲁棒性。实验结果表明，HDBMV在公开数据集VoiceBank+DEMAND、大数据集DNS Challenge 2020和自建的藏语数据集BodSpeDB上的语音质量感知评估（PESQ）分别达到了3.00、3.12和2.09，短时目标可理解度（STOI）分别达到了0.96、0.97和0.81。可见，HDBMV以最小的参数量和较高的计算效率获得了最佳的语音增强性能和较强的泛化能力。

关键词: 语音增强, 编解码器, Conformer, 注意力机制, 复数特征

CLC Number:

TP391.4

Gengzangcuomao, Heming HUANG. Monaural speech enhancement with heterogeneous dual-branch decoding based on multi-view attention[J]. Journal of Computer Applications, 2025, 45(10): 3284-3293.

更藏措毛null, 黄鹤鸣. 基于多视角注意力的异构双分支解码单通道语音增强[J]. 《计算机应用》唯一官方网站, 2025, 45(10): 3284-3293.

Figures/Tables 12

References 46

[1]	LIANG J， ZENG Q. Improved spectral subtraction based on second-order differential array and phase spectrum compensation［C］// Proceedings of the 3rd International Symposium on Computer Technology and Information Science. Piscataway： IEEE， 2023： 658-661.
[2]	NEO V W， EVERS C， NAYLOR P A. Enhancement of noisy reverberant speech using polynomial matrix eigenvalue decomposition［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2021， 29： 3255-3266.
[3]	XIANG Y， HØJVANG J L， RASMUSSEN M H， et al. A two-stage deep representation learning-based speech enhancement method using variational autoencoder and adversarial training［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2024， 32： 164-177.
[4]	REN J， MAO Q. DCTCN： deep complex temporal convolutional network for long time speech enhancement［C］// Proceedings of the INTERSPEECH 2022. ［S.l.］： International Speech Communication Association， 2022： 5478-5482.
[5]	SONI M H， SHAH N， PATIL H A. Time-frequency masking-based speech enhancement using generative adversarial network［C］// Proceedings of the 2018 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2018： 5039-5043.
[6]	CHEN J， MAO Q， LIU D. Dual-path Transformer network： direct context-aware modeling for end-to-end monaural speech separation［C］// Proceedings of the INTERSPEECH 2020. ［S.l.］： International Speech Communication Association， 2020： 2642-2646.
[7]	TANG C， LUO C， ZHAO Z， et al. Joint time frequency and time domain learning for speech enhancement［C］// Proceedings of the 29th International Joint Conference on Artificial Intelligence. California： ijcai.org， 2020： 3816-3822.
[8]	张天骐，罗庆予，张慧芝，等. 复谱映射下融合高效Transformer的语音增强方法［J］. 信号处理， 2024， 40（2）： 406-416.
	ZHANG T Q， LUO Q Y， ZHANG H Z， et al. Speech enhancement method based on complex spectrum mapping with efficient Transformer［J］. Journal of Signal Processing， 2024， 40（2）： 406-416.
[9]	TAN K， WANG D. Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2020， 28： 380-390.
[10]	WILLIAMSON D S， WANG Y， WANG D. Complex ratio masking for monaural speech separation［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2016， 24（3）： 483-492.
[11]	LIU Y， ZHANG H， ZHANG X， et al. Supervised speech enhancement with real spectrum approximation［C］// Proceedings of the 2019 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2019： 5746-5750.
[12]	DANG F， CHEN H， ZHANG P. DPT-FSNet： dual-path Transformer based full-band and sub-band fusion network for speech enhancement［C］// Proceedings of the 2022 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2022： 6857-6861.
[13]	CHAU H N， BUI T D， NGUYEN H B， et al. A novel approach to multi-channel speech enhancement based on graph neural network［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2024， 32： 1133-1144.
[14]	SUN L， YUAN S， GONG A， et al. Dual-branch modeling based on state-space model for speech enhancement［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2024， 32： 1457-1467.
[15]	ZHAO S， MA B. D2Former： a fully complex dual-path dual-decoder Conformer network using joint complex masking and complex spectral mapping for monaural speech enhancement［C］// Proceedings of the 2023 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2023： 1-5.
[16]	YU R， ZHAO Z， YE Z. PFRNet： dual-branch progressive fusion rectification network for monaural speech enhancement［J］. IEEE Signal Processing Letters， 2022， 29： 2358-2362.
[17]	莫尚斌，王文君，董凌，等. 基于多路信息聚合协同解码的单通道语音增强［J］. 计算机应用， 2024， 44（8）： 2611-2617.
	MO S B， WANG W J， DONG L， et al. Single-channel speech enhancement based on multi-channel information aggregation and collaborative decoding［J］. Journal of Computer Applications， 2024， 44（8）： 2611-2617.
[18]	XIANG X， ZHANG X， CHEN H. A nest U-Net with self-attention and dense connectivity for monaural speech enhancement［J］. IEEE Signal Processing Letters， 2022， 29： 105-109.
[19]	WANG K， HE B， ZHU W P. TSTNN： two-stage Transformer based neural network for speech enhancement in the time domain［C］// Proceedings of the 2021 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2021： 7098-7102.
[20]	WEI W， HU Y， HUANG H， et al. IIFC-Net： a monaural speech enhancement network with high-order information interaction and feature calibration［J］. IEEE Signal Processing Letters， 2024， 31： 196-200.
[21]	琚吴涵，孙成立，陈飞龙，等. 基于Dual-path Skip-Transformer的轻量级语音增强网络［J］. 计算机工程与应用，2025，61（15）：209-217.
	JU W H， SUN C L， CHEN F L， et al. Dual-path Skip-Transformer based light weighting speech enhancement network［J］. Computer Engineering and Applications， 2025， 61（15）：209-217.
[22]	王景润，郭海燕，王婷婷，等. 基于可学习图比率掩码估计的图频域语音增强方法［J］. 信号处理， 2024， 40（12）： 2249-2260.
	WANG J R， GUO H Y， WANG T T， et al. Learnable graph ratio mask-based speech enhancement in the graph frequency domain［J］. Journal of Signal Processing， 2024， 40（12）： 2249-2260.
[23]	张池，王忠，姜添豪，等. 基于并行多注意力的语音增强网络［J］. 计算机工程， 2024， 50（4）：68-77.
	ZHANG C， WANG Z， JIANG T H， et al. Speech enhancement network based on parallel multi-attention［J］. Computer Engineering， 2024， 50（4）： 68-77.
[24]	高盛祥，莫尚斌，余正涛，等. 基于多维度注意力机制和复数Conformer的单通道语音增强方法［J］. 重庆邮电大学学报（自然科学版）， 2024， 36（2）： 393-403.
	GAO S X， MO S B， YU Z T， et al. Monaural speech enhancement method based on multi-dimensional attention mechanism and complex Conformer［J］. Journal of Chongqing university of Posts and Telecommunications （Natural Science Edition）， 2024， 36（2）： 393-403.
[25]	ZHANG Z， LIANG X， XU R， et al. Hybrid attention time-frequency analysis network for single-channel speech enhancement［C］// Proceedings of the 2024 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2024： 10426-10430.
[26]	LI Y， SUN Y， WANG W， et al. U-shaped Transformer with frequency-band aware attention for speech enhancement［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing. 2023， 31： 1511-1521.
[27]	WANG T， ZHU W， GAO Y， et al. Harmonic attention for monaural speech enhancement［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2023， 31： 2424-2436.
[28]	DENG F， JIANG T， WANG X R， et al. NAAGN： noise-aware attention-gated network for speech enhancement［C］// Proceedings of the INTERSPEECH 2020. ［S.l.］： International Speech Communication Association， 2020： 2457-2461.
[29]	NAYEM K M， WILLIAMSON D S. Attention-based speech enhancement using human quality perception modeling［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2024， 32： 250-260.
[30]	LU Y J， WANG Z Q， WATANABE S， et al. Conditional diffusion probabilistic model for speech enhancement［C］// Proceedings of the 2022 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2022： 7402-7406.
[31]	KONG Z， PING W， HUANG J， et al. DiffWave： a versatile diffusion model for audio synthesis［EB/OL］. ［2024-09-21］..
[32]	PASCUAL S， BONAFONTE A， SERRÀ J. SEGAN： speech enhancement generative adversarial network［C］// Proceedings of the INTERSPEECH 2017. ［S.l.］： International Speech Communication Association， 2017： 3642-3646.
[33]	FU S W， LIAO C F， TSAO Y， et al. MetricGAN： generative adversarial networks based black-box metric scores optimization for speech enhancement［C］// Proceedings of the 36th International Conference on Machine Learning. New York： JMLR.org， 2019： 2031-2041.
[34]	STRAUSS M， EDLER B. A flow-based neural network for time domain speech enhancement［C］// Proceedings of the 2021 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2021： 5754-5758.
[35]	YU X， GUO D， ZHANG J， et al. ROSE： a recognition-oriented speech enhancement framework in air traffic control using multi-objective learning［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2024， 32： 3365-3378.
[36]	YANG L， LIU W， MENG R， et al. FSPEN： an ultra-lightweight network for real time speech enhancement［C］// Proceedings of the 2024 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2024： 10671-10675.
[37]	SCHRÖTER H， ESCALANTE-B A N， ROSENKRANZ T， et al. DeepFilterNet： a low complexity speech enhancement framework for full-band audio based on deep filtering［C］// Proceedings of the 2022 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2022： 7407-7411.
[38]	CHEN C， HU Y， WENG W， et al. Metric-oriented speech enhancement using diffusion probabilistic model［C］// Proceedings of the 2023 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2023： 1-5.
[39]	TOLOOSHAMS B， GIRI R， SONG A H， et al. Channel-attention dense U-Net for multichannel speech enhancement［C］// Proceedings of the 2020 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2020： 836-840.
[40]	LUO Y， CHEN Z， MESGARANI N， et al. End-to-end microphone permutation and number invariant multi-channel speech separation［C］// Proceedings of the 2020 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2020： 6394-6398.
[41]	WANG Z Q， WANG D. Multi-microphone complex spectral mapping for speech dereverberation［C］// Proceedings of the 2020 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2020： 486-490.
[42]	PANDEY A， XU B， KUMAR A， et al. TPARN： triple-path attentive recurrent network for time-domain multichannel speech enhancement［C］// Proceedings of the 2022 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2022： 6497-6501.
[43]	PANDEY A， XU B， KUMAR A， et al. Multichannel speech enhancement without beamforming［C］// Proceedings of the 2022 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2022： 6502-6506.
[44]	LIU J， ZHANG X. DRC-NET： densely connected recurrent convolutional neural network for speech dereverberation［C］// Proceedings of the 2022 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2022： 166-170.
[45]	LEE D， CHOI J W. DeFT-AN： dense frequency-time attentive network for multichannel speech enhancement［J］. IEEE Signal Processing Letters， 2023， 30： 155-159.
[46]	PEI Y R， SHRIVASTAVA R， SIDHARTH. Real-time speech enhancement on raw signals with deep state-space modeling［EB/OL］. ［2024-10-21］. .

模块	层名	输入尺寸	超参数	输出尺寸
IFE	gatedconvblock	3×257×257	（1，1），（1，1），64	64×257×257
	dilatedconv_1	64×257×257	（2，3），（1，1），128，1	128×257×257
	dilatedconv_2	128×257×257	（2，3），（1，1），192，2	192×257×257
	dilatedconv_3	192×257×257	（2，3），（1，1），256，4	256×257×257
	dilatedconv_4	256×257×257	（2，3），（1，1），320，8	320×257×257
	convblock	320×257×257	（1，1），（1，1），64	64×257×129
Reshape-size		64×257×129	—	258×257×64
TFRC	TimeresidualConformer	258×257×64	Head=4，Channel=64，Dropout=0.2	258×257×64
	Reshape-size	258×257×64	—	514×129×64
	FrequencyresidualConformer	514×129×64	Head=4，Channel=64，Dropout=0.2	514×129×64
Reshape-size		514×129×64	—	64×257×129
MVA	MViAtt	514×129×64	—	64×257×129
门控掩蔽解码器	dilateconv_1	64×257×129	（2，3），（1，1），128，1	128×257×129
	dilatedconv_2	128×257×129	（2，3），（1，1），192，2	192×257×129
	dilatedconv_3	192×257×129	（2，3），（1，1），256，4	256×257×129
	dilatedconv_4	256×257×129	（2，3），（1，1），320，8	320×257×129
	Sub-pixelconv	320×257×129	（1，3），（1，1），64	64×257×258
	gatedconvblock	64×257×258	（1，1），（1，1），1	1×257×257
复数映射解码器	dilateconv_1	64×257×129	（2，3），（1，1），128，1	128×257×129
	dilatedconv_2	128×257×129	（2，3），（1，1），192，2	192×257×129
	dilatedconv_3	192×257×129	（2，3），（1，1），256，4	256×257×129
	dilatedconv_4	256×257×129	（2，3），（1，1），320，8	64×257×129
	Sub-pixelconv	64×257×129	（1，3），（1，1），64	64×257×258
	convblock	64×257×258	（1，2），（1，1），2	2×257×257

模块	层名	输入尺寸	超参数	输出尺寸
IFE	gatedconvblock	3×257×257	（1，1），（1，1），64	64×257×257
	dilatedconv_1	64×257×257	（2，3），（1，1），128，1	128×257×257
	dilatedconv_2	128×257×257	（2，3），（1，1），192，2	192×257×257
	dilatedconv_3	192×257×257	（2，3），（1，1），256，4	256×257×257
	dilatedconv_4	256×257×257	（2，3），（1，1），320，8	320×257×257
	convblock	320×257×257	（1，1），（1，1），64	64×257×129
Reshape-size		64×257×129	—	258×257×64
TFRC	TimeresidualConformer	258×257×64	Head=4，Channel=64，Dropout=0.2	258×257×64
	Reshape-size	258×257×64	—	514×129×64
	FrequencyresidualConformer	514×129×64	Head=4，Channel=64，Dropout=0.2	514×129×64
Reshape-size		514×129×64	—	64×257×129
MVA	MViAtt	514×129×64	—	64×257×129
门控掩蔽解码器	dilateconv_1	64×257×129	（2，3），（1，1），128，1	128×257×129
	dilatedconv_2	128×257×129	（2，3），（1，1），192，2	192×257×129
	dilatedconv_3	192×257×129	（2，3），（1，1），256，4	256×257×129
	dilatedconv_4	256×257×129	（2，3），（1，1），320，8	320×257×129
	Sub-pixelconv	320×257×129	（1，3），（1，1），64	64×257×258
	gatedconvblock	64×257×258	（1，1），（1，1），1	1×257×257
复数映射解码器	dilateconv_1	64×257×129	（2，3），（1，1），128，1	128×257×129
	dilatedconv_2	128×257×129	（2，3），（1，1），192，2	192×257×129
	dilatedconv_3	192×257×129	（2，3），（1，1），256，4	256×257×129
	dilatedconv_4	256×257×129	（2，3），（1，1），320，8	64×257×129
	Sub-pixelconv	64×257×129	（1，3），（1，1），64	64×257×258
	convblock	64×257×258	（1，2），（1，1），2	2×257×257

时频域	模型	模型大小/10⁶	PESQ	STOI	CSIG	CBAK	COVL
Noisy			1.97	0.79	3.35	2.44	2.63
时域	DPTNet	2.69	2.78	—	3.92	3.40	3.36
	SE-Flow	—	2.43	—	3.77	3.12	3.09
	CDiffuSE	—	2.52	—	3.72	2.91	3.10
	DiffWave	6.91	2.52	—	3.72	3.27	3.11
	SEGAN	—	2.16	0.92	3.48	2.94	2.80
	FSPEN	—	2.97	0.94	—	—	—
	MOSE	—	2.54	—	3.72	2.93	3.06
	ROSE	36.98	3.01	0.95	3.56	3.72	4.47
频域	DCTCN	9.70	2.83	—	3.91	3.37	3.37
	NAAGN	—	2.79	—	3.99	3.38	3.32
	TFT-Net	—	2.71	0.95	3.86	3.39	3.25
	MMSE-GAN	—	2.53	—	3.80	3.12	3.14
	S4D	10.80	2.55	0.93	3.94	3.00	3.23
	MetricGAN	—	2.86	—	3.99	3.18	3.42
	PMOS	—	2.98	0.87	—	—	—
	DeepFilterNet	1.80	2.81	—	—	—	—
	HDBMV	1.80	3.00	0.96	3.96	3.64	3.45

时频域	模型	模型大小/10⁶	PESQ	STOI	CSIG	CBAK	COVL
Noisy			1.97	0.79	3.35	2.44	2.63
时域	DPTNet	2.69	2.78	—	3.92	3.40	3.36
	SE-Flow	—	2.43	—	3.77	3.12	3.09
	CDiffuSE	—	2.52	—	3.72	2.91	3.10
	DiffWave	6.91	2.52	—	3.72	3.27	3.11
	SEGAN	—	2.16	0.92	3.48	2.94	2.80
	FSPEN	—	2.97	0.94	—	—	—
	MOSE	—	2.54	—	3.72	2.93	3.06
	ROSE	36.98	3.01	0.95	3.56	3.72	4.47
频域	DCTCN	9.70	2.83	—	3.91	3.37	3.37
	NAAGN	—	2.79	—	3.99	3.38	3.32
	TFT-Net	—	2.71	0.95	3.86	3.39	3.25
	MMSE-GAN	—	2.53	—	3.80	3.12	3.14
	S4D	10.80	2.55	0.93	3.94	3.00	3.23
	MetricGAN	—	2.86	—	3.99	3.18	3.42
	PMOS	—	2.98	0.87	—	—	—
	DeepFilterNet	1.80	2.81	—	—	—	—
	HDBMV	1.80	3.00	0.96	3.96	3.64	3.45

模型	PESQ	STOI	模型大小/10⁶	MAC/10⁹
CA DU-Net	1.99	0.83	26.00	—
FasNet TAC	2.26	0.87	2.90	55.3
DCCRN	2.57	0.90	18.00	14.4
TPARN	2.75	0.92	3.20	71.8
ADCN	2.84	0.92	9.30	72.8
DRC-NET	2.79	0.89	2.60	13.9
DeFT-AN	3.01	0.92	2.70	95.6
aTENNuate	2.98	—	—	—
HDBMV	3.12	0.97	1.80	65.1

Monaural speech enhancement with heterogeneous dual-branch decoding based on multi-view attention

基于多视角注意力的异构双分支解码单通道语音增强

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 12

References 46

Related Articles 15

Recommended Articles

Metrics

噪声环境	SNR/dB	PESQ						STOI
噪声环境	SNR/dB	Noisy	CRN	GCRN	TSTNN	DPTFSNET	HDBMV	Noisy	CRN	GCRN	TSTNN	DPTFSNET	HDBMV
已知噪声（pink）	-5	1.11	1.12	1.30	1.42	1.22	1.50	0.57	0.60	0.67	0.71	0.59	0.72
	0	1.15	1.18	1.53	1.75	1.37	1.82	0.67	0.67	0.76	0.79	0.68	0.79
	5	1.23	1.30	1.84	2.14	1.63	2.16	0.76	0.76	0.81	0.84	0.75	0.84
	10	1.42	1.50	2.10	2.53	1.91	2.46	0.83	0.83	0.85	0.88	0.78	0.87
	15	1.73	1.74	2.33	2.88	2.14	2.76	0.87	0.87	0.87	0.90	0.80	0.90
未知噪声（white）	-6	1.13	1.12	1.20	1.19	1.15	1.39	0.61	0.61	0.63	0.68	0.63	0.72
	0	1.15	1.16	1.40	1.32	1.31	1.76	0.70	0.70	0.75	0.78	0.71	0.80
	5	1.21	1.24	1.64	1.49	1.59	2.15	0.77	0.76	0.80	0.83	0.77	0.84
	12	1.41	1.39	2.05	1.89	2.06	2.57	0.84	0.82	0.85	0.88	0.81	0.88
	15	1.57	1.51	2.20	2.13	2.19	2.73	0.87	0.84	0.87	0.90	0.81	0.89

模型	PESQ	STOI	CSIG	CBAK	COVL
基线0	2.85	0.95	3.62	3.56	3.22
基线1	3.00	0.96	3.86	3.56	3.41
IFE	2.97	0.96	3.93	3.64	3.43
IFE+HDBD	2.97	0.96	3.93	3.64	3.44
IFE+HDBD+TFRC	3.00	0.96	3.95	3.63	3.44
HDBMV	3.00	0.96	3.96	3.64	3.45

[1]	Weigang LI, Jiale SHAO, Zhiqiang TIAN. Point cloud classification and segmentation network based on dual attention mechanism and multi-scale fusion [J]. Journal of Computer Applications, 2025, 45(9): 3003-3010.
[2]	Xiang WANG, Zhixiang CHEN, Guojun MAO. Multivariate time series prediction method combining local and global correlation [J]. Journal of Computer Applications, 2025, 45(9): 2806-2816.
[3]	Jinggang LYU, Shaorui PENG, Shuo GAO, Jin ZHOU. Speech enhancement network driven by complex frequency attention and multi-scale frequency enhancement [J]. Journal of Computer Applications, 2025, 45(9): 2957-2965.
[4]	Chao JING, Yutao QUAN, Yan CHEN. Improved multi-layer perceptron and attention model-based power consumption prediction algorithm [J]. Journal of Computer Applications, 2025, 45(8): 2646-2655.
[5]	Jinhao LIN, Chuan LUO, Tianrui LI, Hongmei CHEN. Thoracic disease classification method based on cross-scale attention network [J]. Journal of Computer Applications, 2025, 45(8): 2712-2719.
[6]	Haifeng WU, Liqing TAO, Yusheng CHENG. Partial label regression algorithm integrating feature attention and residual connection [J]. Journal of Computer Applications, 2025, 45(8): 2530-2536.
[7]	Jin ZHOU, Yuzhi LI, Xu ZHANG, Shuo GAO, Li ZHANG, Jiachuan SHENG. Modulation recognition network for complex electromagnetic environments [J]. Journal of Computer Applications, 2025, 45(8): 2672-2682.
[8]	Chen LIANG, Yisen WANG, Qiang WEI, Jiang DU. Source code vulnerability detection method based on Transformer-GCN [J]. Journal of Computer Applications, 2025, 45(7): 2296-2303.
[9]	Yihan WANG, Chong LU, Zhongyuan CHEN. Multimodal sentiment analysis model with cross-modal text information enhancement [J]. Journal of Computer Applications, 2025, 45(7): 2237-2244.
[10]	Haoyu LIU, Pengwei KONG, Yaoli WANG, Qing CHANG. Pedestrian detection algorithm based on multi-view information [J]. Journal of Computer Applications, 2025, 45(7): 2325-2332.
[11]	Xiaoqiang ZHAO, Yongyong LIU, Yongyong HUI, Kai LIU. Batch process quality prediction model using improved time-domain convolutional network with multi-head self-attention mechanism [J]. Journal of Computer Applications, 2025, 45(7): 2245-2252.
[12]	Huibin WANG, Zhan’ao HU, Jie HU, Yuanwei XU, Bo WEN. Time series forecasting model based on segmented attention mechanism [J]. Journal of Computer Applications, 2025, 45(7): 2262-2268.
[13]	Haijie WANG, Guangxin ZHANG, Hai SHI, Shu CHEN. Document-level relation extraction based on entity representation enhancement [J]. Journal of Computer Applications, 2025, 45(6): 1809-1816.
[14]	Weigang LI, Xinyi LI, Yongqiang WANG, Yuntao ZHAO. Point cloud classification and segmentation method based on adaptive dynamic graph convolution and parameter-free attention [J]. Journal of Computer Applications, 2025, 45(6): 1980-1986.
[15]	Sheping ZHAI, Yan HUANG, Qing YANG, Rui YANG. Multi-view entity alignment combining triples and text attributes [J]. Journal of Computer Applications, 2025, 45(6): 1793-1800.