基于渐进比率掩蔽目标的自适应噪声估计方法

doi:10.11772/j.issn.1001-9081.2022030384

《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (4): 1303-1308.DOI: 10.11772/j.issn.1001-9081.2022030384

所属专题：多媒体计算与计算机仿真

• 多媒体计算与计算机仿真 • 上一篇下一篇

基于渐进比率掩蔽目标的自适应噪声估计方法

高建清¹(), 屠彦辉¹, 马峰¹, 付中华²

^1.科大讯飞股份有限公司，合肥 230088
^2.西安讯飞超脑信息技术有限公司，西安 710000

收稿日期:2022-03-30 修回日期:2022-09-05 接受日期:2022-09-05 发布日期:2023-04-11 出版日期:2023-04-10
通讯作者: 高建清
作者简介:屠彦辉（1990—），男，安徽六安人，工程师，博士，CCF会员，主要研究方向：语音增强、语音识别；
马峰（1986—），男，安徽合肥人，工程师，硕士，主要研究方向：语音增强；
付中华（1977—），男，湖北十堰人，副教授，博士，CCF会员，主要研究方向：语音与音频信号处理、声纹识别。
基金资助:
科技创新2030?“新一代人工智能”重大项目(2018AAA0102200)

Progressive ratio mask-based adaptive noise estimation method

Jianqing GAO¹(), Yanhui TU¹, Feng MA¹, Zhonghua FU²

^1.iFLYTEK Company Limited，Hefei Anhui 230088，China
^2.Xi’an iFLYTEK Hyper?brain Information Technology Company Limited，Xi’an Shaanxi 710000，China

Received:2022-03-30 Revised:2022-09-05 Accepted:2022-09-05 Online:2023-04-11 Published:2023-04-10
Contact: Jianqing GAO
About author:TU Yanhui， born in 1990， Ph. D.， engineer. His research interests include speech enhancement， speech recognition.
MA Feng， born in 1986， M. S.， engineer. His research interests include speech enhancement.
FU Zhonghua， born in 1977， Ph. D.， associate professor. His research interests include speech and audio signal processing， voiceprint recognition.
Supported by:
Technological Innovation 2030-“New Generation Artificial Intelligence” Major Project(2018AAA0102200)

摘要/Abstract

摘要：

基于深度学习的语音增强算法的性能通常优于传统的基于噪声抑制的语音增强算法。然而当训练数据和测试数据之间存在不匹配时，基于深度学习的语音增强算法通常无法正常工作。针对上述问题，提出一种新的基于渐进比率掩蔽（PRM）的自适应噪声估计（PRM-ANE）方法，并把它作为语音识别系统的预处理方法。所提方法综合利用了具有帧级别的噪声跟踪能力的改进最小统计量控制递归平均（IMCRA）算法和具有学习噪声和语音之间复杂非线性映射关系的渐进学习算法这两种算法。首先，使用二维卷积神经网络（2D-CNN）学习随信噪比（SNR）增加的PRM；其次，通过传统的帧级语音增强算法组合句子级估计的PRM，进行语音增强；最后，将基于多级别信息融合的增强语音直接作为语音识别系统的输入，从而提高识别系统性能。在CHiME-4真实测试集上的实验结果表明，所提方法可以实现7.42%的相对字识别错误率（WER），与IMCRA语音增强方法相比下降了51.41%，可见所提方法能够有效提升下游识别任务的性能。

关键词: 语音增强, 深度学习, 渐进比率掩蔽, 语音识别, CHiME-4比赛

Abstract:

Deep learning based speech enhancement algorithms typically perform better than the traditional noise suppression based speech enhancement algorithms. However， deep learning based speech enhancement algorithms usually do not work well when there exists mismatch between training data and test data. Aiming at the above problem， a novel Progressive Ratio Mask （PRM）-based Adaptive Noise Estimation （PRM-ANE） method was proposed， and this method was used for the preprocessing of the speech recognition system. In the method， Improved Minima Controlled Recursive Averaging （IMCRA） algorithm with frame-level noise tracking capability and utterance-level deep progressive learning algorithm nonlinear interactions between speech and noise were used comprehensively. Firstly， two Dimensional-Convolutional Neural Network （2D-CNN） was adopted to learn PRM， which increased with the increase of Signal-to-Noise Ratio （SNR）. Then， the PRMs at sentence level were combined by the conventional frame-level speech enhancement algorithm to perform speech enhancement. Finally， the enhanced speech based on the multi-level information fusion was directly fed into speech recognition system to improve the performance of the system. Experimental results on the CHiME-4 real test set show that the proposed method can achieve a relative Word Error Rate （WER） of 7.42%， which is 51.41% lower than that of IMCRA speech enhancement method. Experimental results show that the proposed enhancement method can effectively improve the performance of downstream recognition tasks.

Key words: speech enhancement, deep learning, Progressive Ratio Mask (PRM), speech recognition, CHiME-4 challenge

中图分类号:

TN912.35

高建清, 屠彦辉, 马峰, 付中华. 基于渐进比率掩蔽目标的自适应噪声估计方法[J]. 计算机应用, 2023, 43(4): 1303-1308.

Jianqing GAO, Yanhui TU, Feng MA, Zhonghua FU. Progressive ratio mask-based adaptive noise estimation method[J]. Journal of Computer Applications, 2023, 43(4): 1303-1308.

图/表 6

图1 本文方法框架

Fig. 1 Framework of the proposed method

图2 基于2D-CNN-MT的多目标输出网络

Fig. 2 Multi-target output network based on 2D-CNN-MT

图3 不同模型的开发集损失收敛曲线

Fig. 3 Loss convergence curves of different models on development set

图4 不同模型在不同帧数配置下的PESQ和STOI比较

Fig. 4 Comparison of PESQ and STOI of different models under different frame number configurations

表1 不同声学上下文大小的输入维度（输入帧数，2τ+1）

Tab. 1 Input dimensions of different acoustic context sizes （number of input frames， 2τ+1）

帧数	维度	帧数	维度
1	257	7	1 799
4	771	11	2 827
5	1 285	15	3 855

表2 不同增强方法在真实测试集上的WER (%)

Tab. 2 Comparison of WER of different enhancement methods on real test set

方法	BUS	CAF	PED	STR	平均值
Noisy	19.05	12.35	9.34	7.81	12.14
IMCRA	24.40	16.62	11.79	8.26	15.27
DNN	26.19	18.95	12.67	9.12	16.73
LSTM	22.51	14.76	11.03	7.98	14.07
2D-CNN-IRM	16.95	11.73	7.98	7.12	10.95
文献［17］方法	15.76	9.95	7.15	6.92	9.95
2D-CNN-MT-PRM1	13.18	8.39	6.46	5.04	8.27
2D-CNN-MT-PRM2	14.75	9.21	7.14	5.97	9.27
2D-CNN-MT-PRM3	15.45	11.01	7.32	6.81	10.15
PRM-ANE	11.37	7.86	5.89	4.57	7.42

参考文献 24

1	BROWN G J， COOKE M. Computational auditory scene analysis［J］. Computer Speech and Language， 1994， 8（4）： 297-336. 10.1006/csla.1994.1016
2	吴镇扬，张子瑜，李想，等. 听觉场景分析的研究进展［J］. 电路与系统学报， 2001， 6（2）： 68-73. 10.3969/j.issn.1007-0249.2001.02.015
	WU Z Y， ZHANG Z Y， LI X， et al. The research advance of auditory scene analysis［J］. Journal of Circuits and Systems， 2001， 6（2）： 68-73. 10.3969/j.issn.1007-0249.2001.02.015
3	CHANG J H， JO Q H， KIM D K， et al. Global soft decision employing support vector machine for speech enhancement［J］. IEEE Signal Processing Letters， 2009， 16（1）： 57-60. 10.1109/lsp.2008.2008574
4	WILSON K W， RAJ B， SMARAGDIS P， et al. Speech denoising using nonnegative matrix factorization with priors［C］// Proceedings of the 2008 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2008： 4029-4032. 10.1109/icassp.2008.4518538
5	WILSON K W， RAJ B， SMARAGDIS P. Regularized non-negative matrix factorization with temporal dependencies for speech denoising［C］// Proceedings of the INTERSPEECH 2008. ［S.l.］： International Speech Communication Association， 2008： 411-414. 10.21437/interspeech.2008-49
6	SCHMIDT M N， LARSEN J， HSIAO F T. Wind noise reduction using non-negative sparse coding［C］// Proceedings of the 2007 IEEE Workshop on Machine Learning for Signal Processing. Piscataway： IEEE， 2007： 431-436. 10.1109/mlsp.2007.4414345
7	WANG Y X， WANG D L. Towards scaling up classification-based speech separation［J］. IEEE Transactions on Audio， Speech， and Language Processing， 2013， 21（7）： 1381-1390. 10.1109/tasl.2013.2250961
8	袁文浩，孙文珠，夏斌，等. 利用深度卷积神经网络提高未知噪声下的语音增强性能［J］. 自动化学报， 2018， 44（4）： 751-759. 10.16383/j.aas.2018.c170001
	YUAN W H， SUN W Z， XIA B， et al. Improving speech enhancement in unseen noise using deep convolutional neural network［J］. Acta Automatica Sinica， 2018， 44（4）： 751-759. 10.16383/j.aas.2018.c170001
9	XU Y， DU J， DAI L R， et al. A regression approach to speech enhancement based on deep neural networks［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2015， 23（1）： 7-19. 10.1109/taslp.2014.2364452
10	VINCENT E， GRIBONVAL R， FÉVOTTE C. Performance measurement in blind audio source separation［J］. IEEE Transactions on Audio， Speech， and Language Processing， 2006， 14（4）： 1462-1469. 10.1109/tsa.2005.858005
11	WANG Y X， NARAYANAN A， WANG D L. On training targets for supervised speech separation［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2014， 22（12）： 1849-1858. 10.1109/taslp.2014.2352935
12	PEARLMUTTER B A. Gradient calculations for dynamic recurrent neural networks： a survey［J］. IEEE Transactions on Neural Networks， 1995， 6（5）： 1212-1228. 10.1109/72.410363
13	WENINGER F， EYBEN F， SCHULLER B. Single-channel speech separation with memory-enhanced recurrent neural networks［C］// Proceedings of the 2014 IEEE International Conference on Acoustics， Speech， and Signal Processing. Piscataway： IEEE， 2014： 3709-3713. 10.1109/icassp.2014.6854294
14	WENINGER F， HERSHEY J R， LE ROUX J， et al. Discriminatively trained recurrent neural networks for single-channel speech separation［C］// Proceedings of the 2014 IEEE Global Conference on Signal and Information Processing. Piscataway： IEEE， 2014： 577-581. 10.1109/globalsip.2014.7032183
15	TU Y H， DU J， LEE C H. 2D-to-2D mask estimation for speech enhancement based on fully convolutional neural network［C］// Proceedings of the 2020 IEEE International Conference on Acoustics， Speech， and Signal Processing. Piscataway： IEEE， 2020： 6664-6668. 10.1109/icassp40776.2020.9054615
16	COHEN I. Noise spectrum estimation in adverse environments： improved minima controlled recursive averaging［J］. IEEE Transactions on Speech and Audio Processing， 2003， 11（5）： 466-475. 10.1109/tsa.2003.811544
17	屠彦辉. 复杂场景下基于深度学习的鲁棒性语音识别的研究［D］. 合肥：中国科学技术大学， 2019：111.
	TU Y H. Research on robust speech recognition based on deep learning in adverse environment［D］. Hefei： University of Science and Technology of China， 2019： 111.
18	TANG H， HSU W N， GRONDIN F， et al. A study of enhancement， augmentation， and autoencoder methods for domain adaptation in distant speech recognition［C］// Proceedings of the INTERSPEECH 2018. ［S.l.］： International Speech Communication Association， 2018： 2928-2932. 10.21437/interspeech.2018-2030
19	GAO T， DU J， DAI L R， et al. Densely connected progressive learning for LSTM-based speech enhancement［C］// Proceedings of the 2018 IEEE International Conference on Acoustics， Speech， and Signal Processing. Piscataway： IEEE， 2018： 5054-5058. 10.1109/icassp.2018.8461861
20	TU Y H， DU J， GAO T， et al. A multi-target SNR-progressive learning approach to regression based speech enhancement［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2020， 28： 1608-1619. 10.1109/taslp.2020.2996503
21	KINGMA D P， BA J L. Adam： a method for stochastic optimization［EB/OL］. （2017-01-30）［2022-01-03］..
22	SUN L， DU J， DAI L R， et al. Multiple-target deep learning for LSTM-RNN based speech enhancement［C］// Proceedings of the 2017 Hands-free Speech Communications and Microphone Arrays. Piscataway： IEEE， 2017： 136-140. 10.1109/hscma.2017.7895577
23	ZHOU N， DU J， TU Y H， et al. A speech enhancement neural network architecture with SNR-progressive multi-target learning for robust speech recognition［C］// Proceedings of the 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. Piscataway： IEEE， 2019： 873-877. 10.1109/apsipaasc47483.2019.9023157
24	VINCENT E， WATANABE S， NUGRAHA A A， et al. An analysis of environment， microphone and data simulation mismatches in robust speech recognition［J］. Computer Speech and Language， 2017， 46： 535-557. 10.1016/j.csl.2016.11.005

[1]	黄云川, 江永全, 黄骏涛, 杨燕. 基于元图同构网络的分子毒性预测[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2964-2969.
[2]	潘烨新, 杨哲. 基于多级特征双向融合的小目标检测优化模型[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2871-2877.
[3]	李顺勇, 李师毅, 胥瑞, 赵兴旺. 基于自注意力融合的不完整多视图聚类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2696-2703.
[4]	秦璟, 秦志光, 李发礼, 彭悦恒. 基于概率稀疏自注意力神经网络的重性抑郁疾患诊断[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2970-2974.
[5]	王熙源, 张战成, 徐少康, 张宝成, 罗晓清, 胡伏原. 面向手术导航3D/2D配准的无监督跨域迁移网络[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2911-2918.
[6]	刘禹含, 吉根林, 张红苹. 基于骨架图与混合注意力的视频行人异常检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2551-2557.
[7]	顾焰杰, 张英俊, 刘晓倩, 周围, 孙威. 基于时空多图融合的交通流量预测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2618-2625.
[8]	石乾宏, 杨燕, 江永全, 欧阳小草, 范武波, 陈强, 姜涛, 李媛. 面向空气质量预测的多粒度突变拟合网络[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2643-2650.
[9]	吴筝, 程志友, 汪真天, 汪传建, 王胜, 许辉. 基于深度学习的患者麻醉复苏过程中的头部运动幅度分类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2258-2263.
[10]	李欢欢, 黄添强, 丁雪梅, 罗海峰, 黄丽清. 基于多尺度时空图卷积网络的交通出行需求预测[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2065-2072.
[11]	张郅, 李欣, 叶乃夫, 胡凯茜. 基于暗知识保护的模型窃取防御技术DKP[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2080-2086.
[12]	赵亦群, 张志禹, 董雪. 基于密集残差物理信息神经网络的各向异性旅行时计算方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2310-2318.
[13]	徐松, 张文博, 王一帆. 基于时空信息的轻量视频显著性目标检测网络[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2192-2199.
[14]	孙逊, 冯睿锋, 陈彦如. 基于深度与实例分割融合的单目3D目标检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2208-2215.
[15]	邴雅星, 王阳萍, 雍玖, 白浩谋. 基于筛选学习网络的六自由度目标位姿估计算法[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1920-1926.

基于渐进比率掩蔽目标的自适应噪声估计方法

Progressive ratio mask-based adaptive noise estimation method

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 6

参考文献 24

相关文章 15

编辑推荐

Metrics