基于渐进比率掩蔽目标的自适应噪声估计方法

doi:10.11772/j.issn.1001-9081.2022030384

《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (4): 1303-1308.DOI: 10.11772/j.issn.1001-9081.2022030384

• 多媒体计算与计算机仿真 • 上一篇

基于渐进比率掩蔽目标的自适应噪声估计方法

高建清¹(), 屠彦辉¹, 马峰¹, 付中华²

^1.科大讯飞股份有限公司，合肥 230088
^2.西安讯飞超脑信息技术有限公司，西安 710000

收稿日期:2022-03-30 修回日期:2022-09-05 接受日期:2022-09-05 发布日期:2023-04-11 出版日期:2023-04-10
通讯作者: 高建清
作者简介:屠彦辉（1990—），男，安徽六安人，工程师，博士，CCF会员，主要研究方向：语音增强、语音识别；
马峰（1986—），男，安徽合肥人，工程师，硕士，主要研究方向：语音增强；
付中华（1977—），男，湖北十堰人，副教授，博士，CCF会员，主要研究方向：语音与音频信号处理、声纹识别。
基金资助:
科技创新2030?“新一代人工智能”重大项目(2018AAA0102200)

Progressive ratio mask-based adaptive noise estimation method

Jianqing GAO¹(), Yanhui TU¹, Feng MA¹, Zhonghua FU²

^1.iFLYTEK Company Limited，Hefei Anhui 230088，China
^2.Xi’an iFLYTEK Hyper?brain Information Technology Company Limited，Xi’an Shaanxi 710000，China

Received:2022-03-30 Revised:2022-09-05 Accepted:2022-09-05 Online:2023-04-11 Published:2023-04-10
Contact: Jianqing GAO
About author:TU Yanhui， born in 1990， Ph. D.， engineer. His research interests include speech enhancement， speech recognition.
MA Feng， born in 1986， M. S.， engineer. His research interests include speech enhancement.
FU Zhonghua， born in 1977， Ph. D.， associate professor. His research interests include speech and audio signal processing， voiceprint recognition.
Supported by:
Technological Innovation 2030-“New Generation Artificial Intelligence” Major Project(2018AAA0102200)

摘要/Abstract

摘要：

基于深度学习的语音增强算法的性能通常优于传统的基于噪声抑制的语音增强算法。然而当训练数据和测试数据之间存在不匹配时，基于深度学习的语音增强算法通常无法正常工作。针对上述问题，提出一种新的基于渐进比率掩蔽（PRM）的自适应噪声估计（PRM-ANE）方法，并把它作为语音识别系统的预处理方法。所提方法综合利用了具有帧级别的噪声跟踪能力的改进最小统计量控制递归平均（IMCRA）算法和具有学习噪声和语音之间复杂非线性映射关系的渐进学习算法这两种算法。首先，使用二维卷积神经网络（2D-CNN）学习随信噪比（SNR）增加的PRM；其次，通过传统的帧级语音增强算法组合句子级估计的PRM，进行语音增强；最后，将基于多级别信息融合的增强语音直接作为语音识别系统的输入，从而提高识别系统性能。在CHiME-4真实测试集上的实验结果表明，所提方法可以实现7.42%的相对字识别错误率（WER），与IMCRA语音增强方法相比下降了51.41%，可见所提方法能够有效提升下游识别任务的性能。

关键词: 语音增强, 深度学习, 渐进比率掩蔽, 语音识别, CHiME-4比赛

Abstract:

Deep learning based speech enhancement algorithms typically perform better than the traditional noise suppression based speech enhancement algorithms. However， deep learning based speech enhancement algorithms usually do not work well when there exists mismatch between training data and test data. Aiming at the above problem， a novel Progressive Ratio Mask （PRM）-based Adaptive Noise Estimation （PRM-ANE） method was proposed， and this method was used for the preprocessing of the speech recognition system. In the method， Improved Minima Controlled Recursive Averaging （IMCRA） algorithm with frame-level noise tracking capability and utterance-level deep progressive learning algorithm nonlinear interactions between speech and noise were used comprehensively. Firstly， two Dimensional-Convolutional Neural Network （2D-CNN） was adopted to learn PRM， which increased with the increase of Signal-to-Noise Ratio （SNR）. Then， the PRMs at sentence level were combined by the conventional frame-level speech enhancement algorithm to perform speech enhancement. Finally， the enhanced speech based on the multi-level information fusion was directly fed into speech recognition system to improve the performance of the system. Experimental results on the CHiME-4 real test set show that the proposed method can achieve a relative Word Error Rate （WER） of 7.42%， which is 51.41% lower than that of IMCRA speech enhancement method. Experimental results show that the proposed enhancement method can effectively improve the performance of downstream recognition tasks.

Key words: speech enhancement, deep learning, Progressive Ratio Mask (PRM), speech recognition, CHiME-4 challenge

中图分类号:

TN912.35

高建清, 屠彦辉, 马峰, 付中华. 基于渐进比率掩蔽目标的自适应噪声估计方法[J]. 计算机应用, 2023, 43(4): 1303-1308.

Jianqing GAO, Yanhui TU, Feng MA, Zhonghua FU. Progressive ratio mask-based adaptive noise estimation method[J]. Journal of Computer Applications, 2023, 43(4): 1303-1308.

图/表 6

图1 本文方法框架

Fig. 1 Framework of the proposed method

图2 基于2D-CNN-MT的多目标输出网络

Fig. 2 Multi-target output network based on 2D-CNN-MT

图3 不同模型的开发集损失收敛曲线

Fig. 3 Loss convergence curves of different models on development set

图4 不同模型在不同帧数配置下的PESQ和STOI比较

Fig. 4 Comparison of PESQ and STOI of different models under different frame number configurations

表1 不同声学上下文大小的输入维度（输入帧数，2τ+1）

Tab. 1 Input dimensions of different acoustic context sizes （number of input frames， 2τ+1）

帧数	维度	帧数	维度
1	257	7	1 799
4	771	11	2 827
5	1 285	15	3 855

表2 不同增强方法在真实测试集上的WER (%)

Tab. 2 Comparison of WER of different enhancement methods on real test set

方法	BUS	CAF	PED	STR	平均值
Noisy	19.05	12.35	9.34	7.81	12.14
IMCRA	24.40	16.62	11.79	8.26	15.27
DNN	26.19	18.95	12.67	9.12	16.73
LSTM	22.51	14.76	11.03	7.98	14.07
2D-CNN-IRM	16.95	11.73	7.98	7.12	10.95
文献［17］方法	15.76	9.95	7.15	6.92	9.95
2D-CNN-MT-PRM1	13.18	8.39	6.46	5.04	8.27
2D-CNN-MT-PRM2	14.75	9.21	7.14	5.97	9.27
2D-CNN-MT-PRM3	15.45	11.01	7.32	6.81	10.15
PRM-ANE	11.37	7.86	5.89	4.57	7.42

参考文献 24

1	BROWN G J， COOKE M. Computational auditory scene analysis［J］. Computer Speech and Language， 1994， 8（4）： 297-336. 10.1006/csla.1994.1016
2	吴镇扬，张子瑜，李想，等. 听觉场景分析的研究进展［J］. 电路与系统学报， 2001， 6（2）： 68-73. 10.3969/j.issn.1007-0249.2001.02.015
	WU Z Y， ZHANG Z Y， LI X， et al. The research advance of auditory scene analysis［J］. Journal of Circuits and Systems， 2001， 6（2）： 68-73. 10.3969/j.issn.1007-0249.2001.02.015
3	CHANG J H， JO Q H， KIM D K， et al. Global soft decision employing support vector machine for speech enhancement［J］. IEEE Signal Processing Letters， 2009， 16（1）： 57-60. 10.1109/lsp.2008.2008574
4	WILSON K W， RAJ B， SMARAGDIS P， et al. Speech denoising using nonnegative matrix factorization with priors［C］// Proceedings of the 2008 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2008： 4029-4032. 10.1109/icassp.2008.4518538
5	WILSON K W， RAJ B， SMARAGDIS P. Regularized non-negative matrix factorization with temporal dependencies for speech denoising［C］// Proceedings of the INTERSPEECH 2008. ［S.l.］： International Speech Communication Association， 2008： 411-414. 10.21437/interspeech.2008-49
6	SCHMIDT M N， LARSEN J， HSIAO F T. Wind noise reduction using non-negative sparse coding［C］// Proceedings of the 2007 IEEE Workshop on Machine Learning for Signal Processing. Piscataway： IEEE， 2007： 431-436. 10.1109/mlsp.2007.4414345
7	WANG Y X， WANG D L. Towards scaling up classification-based speech separation［J］. IEEE Transactions on Audio， Speech， and Language Processing， 2013， 21（7）： 1381-1390. 10.1109/tasl.2013.2250961
8	袁文浩，孙文珠，夏斌，等. 利用深度卷积神经网络提高未知噪声下的语音增强性能［J］. 自动化学报， 2018， 44（4）： 751-759. 10.16383/j.aas.2018.c170001
	YUAN W H， SUN W Z， XIA B， et al. Improving speech enhancement in unseen noise using deep convolutional neural network［J］. Acta Automatica Sinica， 2018， 44（4）： 751-759. 10.16383/j.aas.2018.c170001
9	XU Y， DU J， DAI L R， et al. A regression approach to speech enhancement based on deep neural networks［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2015， 23（1）： 7-19. 10.1109/taslp.2014.2364452
10	VINCENT E， GRIBONVAL R， FÉVOTTE C. Performance measurement in blind audio source separation［J］. IEEE Transactions on Audio， Speech， and Language Processing， 2006， 14（4）： 1462-1469. 10.1109/tsa.2005.858005
11	WANG Y X， NARAYANAN A， WANG D L. On training targets for supervised speech separation［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2014， 22（12）： 1849-1858. 10.1109/taslp.2014.2352935
12	PEARLMUTTER B A. Gradient calculations for dynamic recurrent neural networks： a survey［J］. IEEE Transactions on Neural Networks， 1995， 6（5）： 1212-1228. 10.1109/72.410363
13	WENINGER F， EYBEN F， SCHULLER B. Single-channel speech separation with memory-enhanced recurrent neural networks［C］// Proceedings of the 2014 IEEE International Conference on Acoustics， Speech， and Signal Processing. Piscataway： IEEE， 2014： 3709-3713. 10.1109/icassp.2014.6854294
14	WENINGER F， HERSHEY J R， LE ROUX J， et al. Discriminatively trained recurrent neural networks for single-channel speech separation［C］// Proceedings of the 2014 IEEE Global Conference on Signal and Information Processing. Piscataway： IEEE， 2014： 577-581. 10.1109/globalsip.2014.7032183
15	TU Y H， DU J， LEE C H. 2D-to-2D mask estimation for speech enhancement based on fully convolutional neural network［C］// Proceedings of the 2020 IEEE International Conference on Acoustics， Speech， and Signal Processing. Piscataway： IEEE， 2020： 6664-6668. 10.1109/icassp40776.2020.9054615
16	COHEN I. Noise spectrum estimation in adverse environments： improved minima controlled recursive averaging［J］. IEEE Transactions on Speech and Audio Processing， 2003， 11（5）： 466-475. 10.1109/tsa.2003.811544
17	屠彦辉. 复杂场景下基于深度学习的鲁棒性语音识别的研究［D］. 合肥：中国科学技术大学， 2019：111.
	TU Y H. Research on robust speech recognition based on deep learning in adverse environment［D］. Hefei： University of Science and Technology of China， 2019： 111.
18	TANG H， HSU W N， GRONDIN F， et al. A study of enhancement， augmentation， and autoencoder methods for domain adaptation in distant speech recognition［C］// Proceedings of the INTERSPEECH 2018. ［S.l.］： International Speech Communication Association， 2018： 2928-2932. 10.21437/interspeech.2018-2030
19	GAO T， DU J， DAI L R， et al. Densely connected progressive learning for LSTM-based speech enhancement［C］// Proceedings of the 2018 IEEE International Conference on Acoustics， Speech， and Signal Processing. Piscataway： IEEE， 2018： 5054-5058. 10.1109/icassp.2018.8461861
20	TU Y H， DU J， GAO T， et al. A multi-target SNR-progressive learning approach to regression based speech enhancement［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2020， 28： 1608-1619. 10.1109/taslp.2020.2996503
21	KINGMA D P， BA J L. Adam： a method for stochastic optimization［EB/OL］. （2017-01-30）［2022-01-03］..
22	SUN L， DU J， DAI L R， et al. Multiple-target deep learning for LSTM-RNN based speech enhancement［C］// Proceedings of the 2017 Hands-free Speech Communications and Microphone Arrays. Piscataway： IEEE， 2017： 136-140. 10.1109/hscma.2017.7895577
23	ZHOU N， DU J， TU Y H， et al. A speech enhancement neural network architecture with SNR-progressive multi-target learning for robust speech recognition［C］// Proceedings of the 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. Piscataway： IEEE， 2019： 873-877. 10.1109/apsipaasc47483.2019.9023157
24	VINCENT E， WATANABE S， NUGRAHA A A， et al. An analysis of environment， microphone and data simulation mismatches in robust speech recognition［J］. Computer Speech and Language， 2017， 46： 535-557. 10.1016/j.csl.2016.11.005

[1]	徐周波, 陈浦青, 刘华东, 杨欣. 基于自注意力网络的深度图匹配模型[J]. 《计算机应用》唯一官方网站, 2023, 43(4): 1005-1012.
[2]	方澄, 李贝, 韩萍, 吴琼. 基于语法依存图的中文微博细粒度情感分类[J]. 《计算机应用》唯一官方网站, 2023, 43(4): 1056-1061.
[3]	窦光义, 魏发南, 邱创一, 巢建树. 基于注意力自相关机制的跟踪外观特征[J]. 《计算机应用》唯一官方网站, 2023, 43(4): 1248-1254.
[4]	张旭, 生龙, 张海芳, 田丰, 王巍. 基于标签混淆的院前急救文本分类模型[J]. 《计算机应用》唯一官方网站, 2023, 43(4): 1050-1055.
[5]	樊小宇, 蔺素珍, 王彦博, 刘峰, 李大威. 基于残差图卷积神经网络的高倍欠采样核磁共振图像重建算法[J]. 《计算机应用》唯一官方网站, 2023, 43(4): 1261-1268.
[6]	何雪东, 宣士斌, 王款, 陈梦楠. 融合累积分布函数和通道注意力机制的DeepLabV3+图像分割算法[J]. 《计算机应用》唯一官方网站, 2023, 43(3): 936-942.
[7]	伏博毅, 彭云聪, 蓝鑫, 秦小林. 基于深度学习的标签噪声学习算法综述[J]. 《计算机应用》唯一官方网站, 2023, 43(3): 674-684.
[8]	张江峰, 闫涛, 陈斌, 钱宇华, 宋艳涛. 全局时空特征耦合的多景深三维形貌重建[J]. 《计算机应用》唯一官方网站, 2023, 43(3): 894-902.
[9]	陈容均, 严宣辉, 杨超城. 面向时间序列的混合图像化循环胶囊分类网络[J]. 《计算机应用》唯一官方网站, 2023, 43(3): 692-699.
[10]	王奇, 雷航, 王旭鹏. 姿态干扰下的深度人脸验证[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 595-600.
[11]	王萍, 陈楠, 鲁磊. 基于场景先验及注意力引导的跌倒检测算法[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 529-535.
[12]	刘聪, 万根顺, 高建清, 付中华. 基于韵律特征辅助的端到端语音识别方法[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 380-384.
[13]	朱利安, 张鸿. 基于双分支条件生成对抗网络的非均匀图像去雾[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 567-574.
[14]	申志军, 穆丽娜, 高静, 史远航, 刘志强. 细粒度图像分类综述[J]. 《计算机应用》唯一官方网站, 2023, 43(1): 51-60.
[15]	郭克友, 李雪, 杨民. 基于轻量化YOLOv4的交通信息实时检测方法[J]. 《计算机应用》唯一官方网站, 2023, 43(1): 74-80.

基于渐进比率掩蔽目标的自适应噪声估计方法

Progressive ratio mask-based adaptive noise estimation method

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 6

参考文献 24

相关文章 15

编辑推荐

Metrics