多噪声环境下的层级语音识别模型

doi:10.11772/j.issn.1001-9081.2017112678

计算机应用 ›› 2018, Vol. 38 ›› Issue (6): 1790-1794.DOI: 10.11772/j.issn.1001-9081.2017112678

• 虚拟现实与多媒体计算 • 上一篇下一篇

多噪声环境下的层级语音识别模型

曹晶晶, 许洁萍, 邵聖淇

中国人民大学信息学院, 北京 100872

收稿日期:2017-11-14 修回日期:2018-01-09 发布日期:2018-06-13 出版日期:2018-06-10
通讯作者: 许洁萍
作者简介:曹晶晶(1993-),女,安徽马鞍山人,硕士研究生,主要研究方向:语音识别;许洁萍(1966-),女,黑龙江牡丹江人,副教授,博士,CCF会员,主要研究方向:音频信息处理;邵聖淇(1993-),男,辽宁沈阳人,硕士研究生,主要研究方向:语音识别。
基金资助:
国家自然科学基金资助项目（61672523）。

Hierarchical speech recognition model in multi-noise environment

CAO Jingjing, XU Jieping, SHAO Shengqi

School of Information, Renmin University of China, Beijing 100872, China

Received:2017-11-14 Revised:2018-01-09 Online:2018-06-13 Published:2018-06-10
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61672523).

摘要/Abstract

摘要： 针对多噪声环境下的语音识别问题，提出了将环境噪声作为语音识别上下文考虑的层级语音识别模型。该模型由含噪语音分类模型和特定噪声环境下的声学模型两层组成，通过含噪语音分类模型降低训练数据与测试数据的差异，消除了特征空间研究对噪声稳定性的限制，并且克服了传统多类型训练在某些噪声环境下识别准确率低的弊端，又通过深度神经网络（DNN）进行声学模型建模，进一步增强声学模型分辨噪声的能力，从而提高模型空间语音识别的噪声鲁棒性。实验中将所提模型与多类型训练得到的基准模型进行对比，结果显示所提层级语音识别模型较该基准模型的词错率（WER）相对降低了20.3%，表明该层级语音识别模型有利于增强语音识别的噪声鲁棒性。

关键词: 语音识别, 噪声鲁棒性, 环境噪声, 声学模型, 深度神经网络

Abstract: Focusing on the issue of speech recognition in multi-noise environment, a new hierarchical speech recognition model considering environmental noise as the context of speech recognition was proposed. The proposed model was composed of two layers of noisy speech classification model and acoustic model under specific noise environment. The difference between training data and test data was reduced by noisy speech classification model, which eliminated the limitation of noise stability required in feature space research and solved the disadvantage of low recognition rate caused by traditional multi-type training under certain noise environment. Furthermore, a Deep Neural Network (DNN) was used for modeling of acoustic model, which could further enhance the ability of acoustic model to distinguish noise and speech, and the noise robustness of speech recognition in model space was improved. In the experiment, the proposed model was compared with the benchmark model obtained by multi-type training. The experimental results show that, the proposed hierarchical speech recognition model has relatively reduced the Word Error Rate (WER) by 20.3% compared with the traditional benchmark model. The proposed hierarchical speech recognition model is helpful to enhance the noise robustness of speech recognition.

Key words: speech recognition, noise-robustness, environmental noise, acoustic model, Deep Neural Network (DNN)

中图分类号:

TP391.4

曹晶晶, 许洁萍, 邵聖淇. 多噪声环境下的层级语音识别模型[J]. 计算机应用, 2018, 38(6): 1790-1794.

CAO Jingjing, XU Jieping, SHAO Shengqi. Hierarchical speech recognition model in multi-noise environment[J]. Journal of Computer Applications, 2018, 38(6): 1790-1794.

参考文献

[1] LI J Y, DENG L, GONG Y F, et al. An overview of noise-robust automatic speech recognition[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2014, 22(4):745-777.
[2] HIMAWAN I, MOTLICEK P, IMSENG D, et al. Learning feature mapping using deep neural network bottleneck features for distant large vocabulary speech recognition[C]//Proceedings of the 2015 IEEE International Conference on Acoustics, Speech, and Signal Processing. Piscataway, NJ:IEEE, 2015:4540-4544.
[3] HAN K, HE Y Z, BAGCHI D, et al. Deep neural network based spectral feature mapping for robust speech recognition[C]//Proceedings of the 201516th Annual Conference of the International Speech Communication Association. Grenoble, France:ISCA, 2015:2484-2488.
[4] REHR R, GERKMANN T. Cepstral noise subtraction for robust automatic speech recognition[C]//Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway, NJ:IEEE, 2015:375-378.
[5] WANG D, ZHANG X W. THCHS-30:a free Chinese speech corpus[EB/OL].[2017-10-16]. http://pdfs.semanticscholar.org/207e/c1b9457c1e42f34d331cf2a7bc791358b9cd.pdf.
[6] LIPPMANN R, MARTIN E, PAUL D. Multi-style training for robust isolated-word speech recognition[C]//Proceedings of the 2003 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway, NJ:IEEE, 2003:705-708.
[7] 易克初,田斌,付强.语音信号处理[M].北京:国防工业出版社,2000:210-242.(YI K C, TIAN B, FU Q. Speech Signal Processing[M]. Beijing:National Defense Industry Press, 2000:210-242.)
[8] 张仕良.基于深度神经网络的语音识别模型研究[D].合肥:中国科学技术大学,2017:1-4.(ZHANG S L. Research on deep neural network based models for speech recognition[D]. Hefei:University of Science and Technology of China, 2017:1-4.)
[9] DAHL G E, YU D, DENG L, et al. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(1):30-42.
[10] GAO T, DU J, DAI L R, et al. Joint training of front-end and back-end deep neural networks for robust speech recognition[C]//Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway, NJ:IEEE, 2015:4375-4379.
[11] MA L, MILNER B, SMITH D. Acoustic environment classification[J]. ACM Transactions on Speech and Language Processing, 2006, 3(2):1-22.
[12] CHU S, NARAYANAN S, KUO C C J. Environmental sound recognition with time-frequency audio features[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2009, 17(6):1142-1158.
[13] XUE X B, ZHOU Z H. Distributional features for text categorization[J]. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(3):428-442.
[14] 周志华.机器学习[M].北京:清华大学出版社,2016:121-145.(ZHOU Z H. Machine Learning[M]. Beijing:Tsinghua University Press, 2016:121-145.)
[15] PHILBIN J, CHUM O, ISARD M, et al. Object retrieval with large vocabularies and fast spatial matching[C]//Proceedings of the 2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Piscataway, NJ:IEEE, 2007:1-8.
[16] LIANG J W, JIN Q, HE X X, et al. Semantic concept annotation of consumer videos at frame-level using audio[C]//Proceedings of the 201415th Pacific-Rim Conference on Advances in Multimedia Information Processing, LNCS 8879. Cham:Springer, 2014:113-122.
[17] VESELY K, GHOSHAL A, BURGET L, et al. Sequence-discriminative training of deep neural networks[C]//Proceedings of the 201314th Annual Conference of International Speech Communication Association. Prefecture of Grenoble, France:ISCA, 2013:2345-2349.
[18] 俞栋,邓力.解析深度学习:语音识别实践[M].北京:电子工业出版社,2016:81-85.(YU D, DENG L. Parsing the Deep Learning:Speech Recognition Practices[M]. Beijing:Publishing House of Electronics Industry, 2016:81-85.)

多噪声环境下的层级语音识别模型

Hierarchical speech recognition model in multi-noise environment

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	石锐, 李勇, 朱延晗. 基于特征梯度均值化的调制信号对抗样本攻击算法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2521-2527.
[2]	王美, 苏雪松, 刘佳, 殷若南, 黄珊. 时频域多尺度交叉注意力融合的时间序列分类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1842-1847.
[3]	肖斌, 杨模, 汪敏, 秦光源, 李欢. 独立性视角下的相频融合领域泛化方法[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1002-1009.
[4]	赖华, 孙童, 王文君, 余正涛, 高盛祥, 董凌. 多模态特征的越南语语音识别文本标点恢复[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 418-423.
[5]	颜梦玫, 杨冬平. 深度神经网络平均场理论综述[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 331-343.
[6]	柴汶泽, 范菁, 孙书魁, 梁一鸣, 刘竟锋. 深度度量学习综述[J]. 《计算机应用》唯一官方网站, 2024, 44(10): 2995-3010.
[7]	赵旭剑, 李杭霖. 基于混合机制的深度神经网络压缩算法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2686-2691.
[8]	申云飞, 申飞, 李芳, 张俊. 基于张量虚拟机的深度神经网络模型加速方法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2836-2844.
[9]	李校林, 杨松佳. 基于深度学习的多用户毫米波中继网络混合波束赋形[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2511-2516.
[10]	李淦, 牛洺第, 陈路, 杨静, 闫涛, 陈斌. 融合视觉特征增强机制的机器人弱光环境抓取检测[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2564-2571.
[11]	高建清, 屠彦辉, 马峰, 付中华. 基于渐进比率掩蔽目标的自适应噪声估计方法[J]. 《计算机应用》唯一官方网站, 2023, 43(4): 1303-1308.
[12]	杨海宇, 郭文普, 康凯. 基于卷积长短时深度神经网络的信号调制方式识别方法[J]. 《计算机应用》唯一官方网站, 2023, 43(4): 1318-1322.
[13]	刘聪, 万根顺, 高建清, 付中华. 基于韵律特征辅助的端到端语音识别方法[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 380-384.
[14]	高媛媛, 余振华, 杜方, 宋丽娟. 基于贝叶斯优化的无标签网络剪枝算法[J]. 《计算机应用》唯一官方网站, 2023, 43(1): 30-36.
[15]	刘小宇, 陈怀新, 刘壁源, 林英, 马腾. 自适应置信度阈值的非限制场景车牌检测算法[J]. 《计算机应用》唯一官方网站, 2023, 43(1): 67-73.