基于双谱非线性特征耦合的语音增强方法

doi:10.11772/j.issn.1001-9081.2025050674

《计算机应用》唯一官方网站 ›› 2026, Vol. 46 ›› Issue (5): 1596-1603.DOI: 10.11772/j.issn.1001-9081.2025050674

• 多媒体计算与计算机仿真 • 上一篇

基于双谱非线性特征耦合的语音增强方法

余正涛¹^,²(), 栾逸雪¹^,², 王文君¹^,², 董凌¹^,², 相艳¹^,², 高盛祥¹^,²

^1.昆明理工大学信息工程与自动化学院，昆明 650504
^2.云南省人工智能重点实验室（昆明理工大学），昆明 650504

收稿日期:2025-06-19 修回日期:2025-07-18 接受日期:2025-07-23 发布日期:2025-08-01 出版日期:2026-05-10
通讯作者: 余正涛
作者简介:栾逸雪（2000—），女，云南个旧人，硕士研究生，主要研究方向：语音增强、语音识别；
王文君（1988—），男，云南昆明人，博士研究生，主要研究方向：语音识别、自然语言处理；
董凌（1984—），男，云南大理人，讲师，博士研究生，主要研究方向：语音识别、自然语言处理；
相艳（1979—），女，云南大理人，副教授，博士，主要研究方向：自然语言处理；
高盛祥（1977—），女，云南洱源人，教授，博士，CCF会员，主要研究方向：自然语言处理、机器翻译、语音识别、语音合成。
基金资助:
国家自然科学基金资助项目(U24A20334);国家自然科学基金资助项目(62466030);国家自然科学基金资助项目(62376111);云南省重点研发计划项目(202303AP140008);云南省人工智能重点实验室开放基金资助项目(CB24069D018A)

Bispectrum-based nonlinear feature coupling method for speech enhancement

Zhengtao YU¹^,²(), Yixue LUAN¹^,², Wenjun WANG¹^,², Ling DONG¹^,², Yan XIANG¹^,², Shengxiang GAO¹^,²

^1.Faculty of Information Engineering and Automation，Kunming University of Science and Technology，Kunming Yunnan 650504，China
^2.Key Laboratory of Artificial Intelligence in Yunnan Province （Kunming University of Science and Technology），Kunming Yunnan 650504，China

Received:2025-06-19 Revised:2025-07-18 Accepted:2025-07-23 Online:2025-08-01 Published:2026-05-10
Contact: Zhengtao YU
About author:LUAN Yixue， born in 2000， M. S. candidate. Her research interests include speech enhancement， speech recognition.
WANG Wenjun， born in 1988， Ph. D. candidate. His research interests include speech recognition， natural language processing.
DONG Ling， born in 1984， Ph. D. candidate， lecturer. His research interests include speech recognition， natural language processing.
XIANG Yan， born in 1979， Ph. D.， associate professor. Her research interests include natural language processing.
GAO Shengxiang， born in 1977， Ph. D.， professor. Her research interests include natural language processing， machine translation， speech recognition， speech synthesis.
Supported by:
National Natural Science Foundation of China(U24A20334);Key Research and Development Program of Yunnan Province(202303AP140008);Open Fund of Key Laboratory of Artificial Intelligence in Yunnan Province(CB24069D018A)

摘要/Abstract

摘要：

针对当前基于时频域的语音增强方法普遍通过短时傅里叶变换（STFT）后利用频谱二阶统计量建模信号的线性特征，忽略了语音中潜在的高阶非线性交互信息的问题，提出一种基于双谱非线性特征耦合的语音增强方法（BNFC）。该方法采用编解码结构作为整体框架，在编码器后引入双谱特征提取模块，以获取三阶统计量所揭示的相位耦合与非线性结构信息；并通过跳跃连接与编码器特征融合，实现更深层次的幅度与相位建模。在VoiceBank+DEMAND数据集上的实验结果显示，BNFC在语音质量的感知评估（PESQ）指标上达到3.57，比基线模型BREM（Bispectral Refinement Enhancement Module）提升15.53%，在语音信号失真感知评分（CSIG）、背景噪声干扰评分（CBAK）和整体语音质量评分（COVL）指标上分别提升5.51%、3.08%和10.31%，验证了高阶非线性特征建模对语音增强任务的重要性。

关键词: 语音增强, 双谱分析, 特征耦合, 高阶非线性, 跳跃连接

Abstract:

To address the issue that current time-frequency domain-based speech enhancement methods commonly model the linear characteristics of signals using second-order spectral statistics after Short-Time Fourier Transform （STFT）， while neglecting the potential higher-order nonlinear interaction information in speech， a Bispectrum-based Nonlinear Feature Coupling method for speech enhancement （BNFC） was proposed. An encoder-decoder structure was employed as the overall framework， and a bispectral feature extraction module was introduced after the encoder to capture phase coupling and nonlinear structural information revealed by third-order statistics. By fusing the extracted bispectral features with encoder features through skip connections， deeper amplitude and phase modeling was achieved. Experimental results on the VoiceBank+DEMAND dataset showed that BNFC achieved a Perceptual Evaluation of Speech Quality （PESQ） score of 3.57， representing a 15.53% improvement over the baseline model BREM （Bispectral Refinement Enhancement Module）. In addition， Mean Opinion Score of Signal Distortion （CSIG）， Background Noise Intrusiveness （CBAK）， and Overall Speech Quality （COVL） were improved by 5.51%， 3.08%， and 10.31%， respectively， validating the importance of higher-order nonlinear feature modeling for speech enhancement tasks.

Key words: Speech Enhancement (SE), bispectral analysis, feature coupling, higher-order nonlinearity, skip connection

中图分类号:

TN912.35

余正涛, 栾逸雪, 王文君, 董凌, 相艳, 高盛祥. 基于双谱非线性特征耦合的语音增强方法[J]. 计算机应用, 2026, 46(5): 1596-1603.

Zhengtao YU, Yixue LUAN, Wenjun WANG, Ling DONG, Yan XIANG, Shengxiang GAO. Bispectrum-based nonlinear feature coupling method for speech enhancement[J]. Journal of Computer Applications, 2026, 46(5): 1596-1603.

图/表 7

参考文献 35

[1]	EPHRAIM Y， MALAH D. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator［J］. IEEE Transactions on Acoustics， Speech， and Signal Processing， 1984， 32（6）： 1109-1121.
[2]	BOLL S. Suppression of acoustic noise in speech using spectral subtraction［J］. IEEE Transactions on Acoustics， Speech， and Signal Processing， 1979， 27（2）： 113-120.
[3]	WILSON K W， RAJ B， SMARAGDIS P， et al. Speech denoising using nonnegative matrix factorization with priors［C］// Proceedings of the 2008 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2008： 4029-4032.
[4]	WANG D， CHEN J. Supervised speech separation based on deep learning： an overview［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2018， 26（10）： 1702-1726.
[5]	VAN DEN OORD A， DIELEMAN S， ZEN H， et al. WaveNet： a generative model for raw audio［C］// Proceedings of the 9th ISCA Speech Synthesis Workshop. ［S.l.］： International Speech Communication Association， 2016： 125.
[6]	DÉFOSSEZ A， USUNIER N， BOTTOU L， et al. Demucs： deep extractor for music sources with extra unlabeled data remixed［EB/OL］. ［2025-07-18］. .
[7]	HU Y， LIU Y， LV S， et al. DCCRN： deep complex convolution recurrent network for phase-aware speech enhancement［C］// Proceedings of the INTERSPEECH 2020. ［S.l.］： International Speech Communication Association， 2020： 2472-2476.
[8]	YIN D， LUO C， XIONG Z， et al. PHASEN： a phase-and-harmonics-aware speech enhancement network［C］// Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2020： 9458-9465.
[9]	ABDALLA R. Complex-valued neural networks — theory and analysis［EB/OL］. ［2025-07-18］..
[10]	LU Y X， YANG A， LING Z H. MP-SENet： a speech enhancement model with parallel denoising of magnitude and phase spectra［C］// Proceedings of the INTERSPEECH 2023. ［S.l.］： International Speech Communication Association， 2023： 3834-3838.
[11]	ALHUSSEIN G， ALKHODARI M， KHANDOKER A H， et al. Deep bispectral analysis of conversational speech towards emotional climate recognition［C］// Proceedings of the 2023 IEEE International Conference on Artificial Intelligence in Engineering and Technology. Piscataway： IEEE， 2023： 170-175.
[12]	WANG W， DONG L， YU Z， et al. Robust speech recognition method based on dense time-frequency convolution and bispectral refinement enhancement［J］. International Journal of Machine Learning and Cybernetics， 2025， 16（9）： 5707-5725.
[13]	TAN K， WANG D. Complex spectral mapping with a convolutional recurrent network for monaural speech enhancement［C］// Proceedings of the 2019 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2019： 6865-6869.
[14]	莫尚斌，王文君，董凌，等.基于多路信息聚合协同解码的单通道语音增强［J］.计算机应用，2024，44（8）：2611-2617.
	MO S B， WANG W J， DONG L， et al. Single-channel speech enhancement based on multi-channel information aggregation and collaborative decoding［J］. Journal of Computer Applications， 2024， 44（8）： 2611-2617.
[15]	CAO R， ABDULATIF S， YANG B. CMGAN： conformer-based metric GAN for speech enhancement［C］// Proceedings of the INTERSPEECH 2022. ［S.l.］： International Speech Communication Association， 2022： 936-940.
[16]	ZHANG Z， XU S， ZHUANG X， et al. Dual branch deep interactive UNet for monaural noisy-reverberant speech enhancement［J］. Applied Acoustics， 2023， 212： No.109574.
[17]	SU Y， LIU Y， YANG C， et al. MN-Net： multi-scale feature fusion and neighborhood attention self-supervised network for industrial spool surface anomaly detection［C］// Proceedings of the IEEE 36th International Conference on Tools with Artificial Intelligence. Piscataway： IEEE， 2024： 282-289.
[18]	NIKIAS C L， MENDEL J M. Signal processing with higher-order spectra［J］. IEEE Signal Processing Magazine， 1993， 10（3）： 10-37.
[19]	RANGOUSSI M， CARAYANNIS G. Adaptive detection of noisy speech using third-order statistics［J］. International Journal of Adaptive Control and Signal Processing， 1996， 10（2/3）： 113-136.
[20]	HIRLEKAR S G， HOLAMBE R S， BASU T K. Phase recovery from bispectrum［J］. IETE Journal of Research， 2000， 46（3）： 139-145.
[21]	LAVANYA T， VIJAYALAKSHMI P， MRINALINI K， et al. Higher order statistics-driven magnitude and phase spectrum estimation for speech enhancement［J］. Computer Speech and Language， 2024， 87： No.101639.
[22]	PANDEY A， WANG D. Densely connected neural network with dilated convolutions for real-time speech enhancement in the time domain［C］// Proceedings of the 2020 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2020： 6629-6633.
[23]	ULYANOV D， VEDALDI A， LEMPITSKY V. Instance normalization： the missing ingredient for fast stylization［EB/OL］. ［2025-02-18］..
[24]	HE K， ZHANG X， REN S， et al. Delving deep into rectifiers： surpassing human-level performance on ImageNet classification［C］// Proceedings of the 2015 IEEE International Conference on Computer Vision. Piscataway： IEEE， 2015： 1026-1034.
[25]	FU S W， YU C， HSIEH T A， et al. MetricGAN+： an improved version of MetricGAN for speech enhancement［C］// Proceedings of the INTERSPEECH 2021. ［S.l.］： International Speech Communication Association， 2021： 201-205.
[26]	YANG A， LING Z H. Neural speech phase prediction based on parallel estimation architecture and anti-wrapping losses［C］// Proceedings of the 2023 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2023： 1-5.
[27]	VALENTINI-BOTINHAO C， WANG X， TAKAKI S， et al. Investigating RNN-based speech enhancement methods for noise-robust text-to-speech［C］// Proceedings of the 9th ISCA Speech Synthesis Workshop. ［S.l.］： International Speech Communication Association， 2016： 146-152.
[28]	VEAUX C， YAMAGISHI J， KING S. The voice bank corpus： design， collection and data analysis of a large regional accent speech database［C］// Proceedings of the 2013 International Conference on Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation. Piscataway： IEEE， 2013： 1-4.
[29]	THIEMANN J， ITO N， VINCENT E. The Diverse Environments Multi-channel Acoustic Noise Database （DEMAND）： a database of multichannel environmental noise recordings［J］. Proceedings of Meetings on Acoustics， 2013， 19（1）： No.035081.
[30]	LOSHCHILOV I， HUTTER F. Decoupled weight decay regularization［EB/OL］. ［2025-01-09］..
[31]	PASCUAL S， BONAFONTE A， SERRÀ J. SEGAN： speech enhancement generative adversarial network［C］// Proceedings of the INTERSPEECH 2017. ［S.l.］： International Speech Communication Association， 2017： 3642-3646.
[32]	KIM E， SEO H. SE-Conformer： time-domain speech enhancement using conformer［C］// Proceedings of the INTERSPEECH 2021. ［S.l.］： International Speech Communication Association， 2021： 2736-2740.
[33]	FU S W， LIAO C F， TSAO Y， et al. MetricGAN： generative adversarial networks based black-box metric scores optimization for speech enhancement［C］// Proceedings of the 36th International Conference on Machine Learning. New York： JMLR.org， 2019： 2031-2041.
[34]	YIN D， ZHAO Z， TANG C， et al. TridentSE： guiding speech enhancement with 32 global tokens［C］// Proceedings of the INTERSPEECH 2023. ［S.l.］： International Speech Communication Association， 2023： 3839-3843.
[35]	CHAO R， CHENG W H， LA QUATRA M， et al. An investigation of incorporating mamba for speech enhancement［C］// Proceedings of the 2024 IEEE Spoken Language Technology Workshop. Piscataway： IEEE， 2024： 302-308.

模型	年份	模型处理信号的方法	模型参数量/10⁶	PESQ	CSIG	CBAK	COVL	SSNR/dB	STOI
Noisy	—	—	—	1.97	3.35	2.44	2.63	1.68	0.91
SEGAN	2017	T	43.18	2.16	3.48	2.94	2.80	7.73	0.92
Demucs	2021	T	33.53	3.07	4.31	3.40	3.63	—	0.95
SE-Conformer	2021	T	—	3.13	4.45	3.55	3.82	—	0.95
MetricGAN	2019	T	—	2.86	3.99	3.18	3.42	—	—
MetricGAN+	2021	T-F	—	3.15	4.14	3.16	3.64	—	—
TridentSE	2023	T-F	3.03	3.47	4.70	3.81	4.10	—	0.96
CMGAN	2022	T-F	1.83	3.41	4.63	3.94	4.12	11.10	0.96
PHASEN	2020	T-F	—	2.99	4.21	3.55	3.62	10.18	—
BREM	2025	T-F	5.16	3.09	4.54	3.90	3.88	—	0.97
MP-SENet	2023	T-F	2.05	3.50	4.73	3.95	4.22	10.64	0.96
SEMamba	2024	T-F	2.25	3.55	4.77	3.95	4.26	—	0.96
BNFC	2025	T-F	2.26	3.57	4.79	4.02	4.28	10.71	0.96

模型	年份	模型处理信号的方法	模型参数量/10⁶	PESQ	CSIG	CBAK	COVL	SSNR/dB	STOI
Noisy	—	—	—	1.97	3.35	2.44	2.63	1.68	0.91
SEGAN	2017	T	43.18	2.16	3.48	2.94	2.80	7.73	0.92
Demucs	2021	T	33.53	3.07	4.31	3.40	3.63	—	0.95
SE-Conformer	2021	T	—	3.13	4.45	3.55	3.82	—	0.95
MetricGAN	2019	T	—	2.86	3.99	3.18	3.42	—	—
MetricGAN+	2021	T-F	—	3.15	4.14	3.16	3.64	—	—
TridentSE	2023	T-F	3.03	3.47	4.70	3.81	4.10	—	0.96
CMGAN	2022	T-F	1.83	3.41	4.63	3.94	4.12	11.10	0.96
PHASEN	2020	T-F	—	2.99	4.21	3.55	3.62	10.18	—
BREM	2025	T-F	5.16	3.09	4.54	3.90	3.88	—	0.97
MP-SENet	2023	T-F	2.05	3.50	4.73	3.95	4.22	10.64	0.96
SEMamba	2024	T-F	2.25	3.55	4.77	3.95	4.26	—	0.96
BNFC	2025	T-F	2.26	3.57	4.79	4.02	4.28	10.71	0.96

模型	PESQ	CSIG	CBAK	COVL	SSNR/dB
MP-SENet	3.50	4.73	3.95	4.22	10.64
BNFC	3.57	4.79	4.02	4.28	10.71
+Encoder	3.55	4.78	4.00	4.27	10.68
+Decoder	3.49	4.75	3.97	4.22	10.64
+Branch	3.52	4.76	3.99	4.25	10.66

模型	PESQ	CSIG	CBAK	COVL	SSNR/dB
MP-SENet	3.50	4.73	3.95	4.22	10.64
BNFC	3.57	4.79	4.02	4.28	10.71
+Encoder	3.55	4.78	4.00	4.27	10.68
+Decoder	3.49	4.75	3.97	4.22	10.64
+Branch	3.52	4.76	3.99	4.25	10.66

基于双谱非线性特征耦合的语音增强方法

Bispectrum-based nonlinear feature coupling method for speech enhancement

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 7

参考文献 35

相关文章 15

编辑推荐

Metrics

[1]	吕景刚, 彭绍睿, 高硕, 周金. 复频域注意力和多尺度频域增强驱动的语音增强网络[J]. 《计算机应用》唯一官方网站, 2025, 45(9): 2957-2965.
[2]	邓酩, 徐锦凡, 肖洪祥, 谢晓兰. 改进TransUNet的高效通道注意力医学图像分割网络[J]. 《计算机应用》唯一官方网站, 2025, 45(12): 4037-4044.
[3]	徐国愚, 闫晓龙, 张一丹. 基于动态上采样的轻量级生成对抗网络DU-FastGAN[J]. 《计算机应用》唯一官方网站, 2025, 45(10): 3067-3073.
[4]	更藏措毛null, 黄鹤鸣. 基于多视角注意力的异构双分支解码单通道语音增强[J]. 《计算机应用》唯一官方网站, 2025, 45(10): 3284-3293.
[5]	尤昕源, 王恒. 基于门控膨胀卷积循环网络的单声道语音增强[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1317-1324.
[6]	陈俊韬, 朱子奇. 基于多尺度特征提取与融合的图像复制-粘贴伪造检测[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2919-2924.
[7]	高建清, 屠彦辉, 马峰, 付中华. 基于渐进比率掩蔽目标的自适应噪声估计方法[J]. 《计算机应用》唯一官方网站, 2023, 43(4): 1303-1308.
[8]	张江峰, 闫涛, 陈斌, 钱宇华, 宋艳涛. 全局时空特征耦合的多景深三维形貌重建[J]. 《计算机应用》唯一官方网站, 2023, 43(3): 894-902.
[9]	金玉堂, 王以松, 王丽会, 赵鹏利. 基于多尺度阶梯时频Conformer GAN的语音增强算法[J]. 《计算机应用》唯一官方网站, 2023, 43(11): 3607-3615.
[10]	余本年, 詹永照, 毛启容, 董文龙, 刘洪麟. 面向语音增强的双复数卷积注意聚合递归网络[J]. 《计算机应用》唯一官方网站, 2023, 43(10): 3217-3224.
[11]	吴明晖, 张广洁, 金苍宏. 基于多模态信息融合的时间序列预测模型[J]. 《计算机应用》唯一官方网站, 2022, 42(8): 2326-2332.
[12]	肖勇, 郑楷洪, 郑镇境, 钱斌, 李森, 马千里. 基于多尺度跳跃深度长短期记忆网络的短期多变量负荷预测[J]. 计算机应用, 2021, 41(1): 231-236.
[13]	龙超, 曾庆宁, 罗瀛. 基于噪声抵消与波束形成的小阵语音增强[J]. 计算机应用, 2020, 40(8): 2386-2391.
[14]	代强, 程曦, 王永梅, 牛子未, 刘飞. 基于轻量自动残差缩放网络的图像超分辨率重建[J]. 计算机应用, 2020, 40(5): 1446-1452.
[15]	吴庆贺, 吴海锋, 沈勇, 曾玉. 工业噪声环境下多麦状态空间模型语音增强算法[J]. 计算机应用, 2020, 40(5): 1476-1482.