Hierarchical speech recognition model in multi-noise environment

doi:10.11772/j.issn.1001-9081.2017112678

Abstract

Abstract: Focusing on the issue of speech recognition in multi-noise environment, a new hierarchical speech recognition model considering environmental noise as the context of speech recognition was proposed. The proposed model was composed of two layers of noisy speech classification model and acoustic model under specific noise environment. The difference between training data and test data was reduced by noisy speech classification model, which eliminated the limitation of noise stability required in feature space research and solved the disadvantage of low recognition rate caused by traditional multi-type training under certain noise environment. Furthermore, a Deep Neural Network (DNN) was used for modeling of acoustic model, which could further enhance the ability of acoustic model to distinguish noise and speech, and the noise robustness of speech recognition in model space was improved. In the experiment, the proposed model was compared with the benchmark model obtained by multi-type training. The experimental results show that, the proposed hierarchical speech recognition model has relatively reduced the Word Error Rate (WER) by 20.3% compared with the traditional benchmark model. The proposed hierarchical speech recognition model is helpful to enhance the noise robustness of speech recognition.

Key words: speech recognition, noise-robustness, environmental noise, acoustic model, Deep Neural Network (DNN)

摘要： 针对多噪声环境下的语音识别问题，提出了将环境噪声作为语音识别上下文考虑的层级语音识别模型。该模型由含噪语音分类模型和特定噪声环境下的声学模型两层组成，通过含噪语音分类模型降低训练数据与测试数据的差异，消除了特征空间研究对噪声稳定性的限制，并且克服了传统多类型训练在某些噪声环境下识别准确率低的弊端，又通过深度神经网络（DNN）进行声学模型建模，进一步增强声学模型分辨噪声的能力，从而提高模型空间语音识别的噪声鲁棒性。实验中将所提模型与多类型训练得到的基准模型进行对比，结果显示所提层级语音识别模型较该基准模型的词错率（WER）相对降低了20.3%，表明该层级语音识别模型有利于增强语音识别的噪声鲁棒性。

关键词: 语音识别, 噪声鲁棒性, 环境噪声, 声学模型, 深度神经网络

CLC Number:

TP391.4

CAO Jingjing, XU Jieping, SHAO Shengqi. Hierarchical speech recognition model in multi-noise environment[J]. Journal of Computer Applications, 2018, 38(6): 1790-1794.

曹晶晶, 许洁萍, 邵聖淇. 多噪声环境下的层级语音识别模型[J]. 计算机应用, 2018, 38(6): 1790-1794.

References

[1] LI J Y, DENG L, GONG Y F, et al. An overview of noise-robust automatic speech recognition[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2014, 22(4):745-777.
[2] HIMAWAN I, MOTLICEK P, IMSENG D, et al. Learning feature mapping using deep neural network bottleneck features for distant large vocabulary speech recognition[C]//Proceedings of the 2015 IEEE International Conference on Acoustics, Speech, and Signal Processing. Piscataway, NJ:IEEE, 2015:4540-4544.
[3] HAN K, HE Y Z, BAGCHI D, et al. Deep neural network based spectral feature mapping for robust speech recognition[C]//Proceedings of the 201516th Annual Conference of the International Speech Communication Association. Grenoble, France:ISCA, 2015:2484-2488.
[4] REHR R, GERKMANN T. Cepstral noise subtraction for robust automatic speech recognition[C]//Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway, NJ:IEEE, 2015:375-378.
[5] WANG D, ZHANG X W. THCHS-30:a free Chinese speech corpus[EB/OL].[2017-10-16]. http://pdfs.semanticscholar.org/207e/c1b9457c1e42f34d331cf2a7bc791358b9cd.pdf.
[6] LIPPMANN R, MARTIN E, PAUL D. Multi-style training for robust isolated-word speech recognition[C]//Proceedings of the 2003 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway, NJ:IEEE, 2003:705-708.
[7] 易克初,田斌,付强.语音信号处理[M].北京:国防工业出版社,2000:210-242.(YI K C, TIAN B, FU Q. Speech Signal Processing[M]. Beijing:National Defense Industry Press, 2000:210-242.)
[8] 张仕良.基于深度神经网络的语音识别模型研究[D].合肥:中国科学技术大学,2017:1-4.(ZHANG S L. Research on deep neural network based models for speech recognition[D]. Hefei:University of Science and Technology of China, 2017:1-4.)
[9] DAHL G E, YU D, DENG L, et al. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(1):30-42.
[10] GAO T, DU J, DAI L R, et al. Joint training of front-end and back-end deep neural networks for robust speech recognition[C]//Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway, NJ:IEEE, 2015:4375-4379.
[11] MA L, MILNER B, SMITH D. Acoustic environment classification[J]. ACM Transactions on Speech and Language Processing, 2006, 3(2):1-22.
[12] CHU S, NARAYANAN S, KUO C C J. Environmental sound recognition with time-frequency audio features[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2009, 17(6):1142-1158.
[13] XUE X B, ZHOU Z H. Distributional features for text categorization[J]. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(3):428-442.
[14] 周志华.机器学习[M].北京:清华大学出版社,2016:121-145.(ZHOU Z H. Machine Learning[M]. Beijing:Tsinghua University Press, 2016:121-145.)
[15] PHILBIN J, CHUM O, ISARD M, et al. Object retrieval with large vocabularies and fast spatial matching[C]//Proceedings of the 2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Piscataway, NJ:IEEE, 2007:1-8.
[16] LIANG J W, JIN Q, HE X X, et al. Semantic concept annotation of consumer videos at frame-level using audio[C]//Proceedings of the 201415th Pacific-Rim Conference on Advances in Multimedia Information Processing, LNCS 8879. Cham:Springer, 2014:113-122.
[17] VESELY K, GHOSHAL A, BURGET L, et al. Sequence-discriminative training of deep neural networks[C]//Proceedings of the 201314th Annual Conference of International Speech Communication Association. Prefecture of Grenoble, France:ISCA, 2013:2345-2349.
[18] 俞栋,邓力.解析深度学习:语音识别实践[M].北京:电子工业出版社,2016:81-85.(YU D, DENG L. Parsing the Deep Learning:Speech Recognition Practices[M]. Beijing:Publishing House of Electronics Industry, 2016:81-85.)