End-to-end Chinese speech recognition system using bidirectional long short-term memory networks and weighted finite-state transducers

doi:10.11772/j.issn.1001-9081.2018020402

Abstract

Abstract: For the assumption of unreasonable conditions in speech recognition by Hidden Markov Model (HMM), the ability of sequence modeling of recurrent neural networks was further studied, an acoustic model based on Bidirectional Long Short-Term Memory (BLSTM) neural networks was proposed. The training criterion based on Connectionist Temporal Classification (CTC) was successfully applied to the acoustic model training, and an end-to-end Chinese speech recognition system was built which does not rely on HMM. Meanwhile, a speech decoding method based on Weighted Finite-State Transducer (WFST) was designed to effectively solve the problem that lexicon and language model are difficult to integrate into the decoding process. Compared with the traditional GMM-HMM system and hybrid DNN-HMM system, the experimental results show that the end-to-end system not only significantly reduces the recognition error rate, but also significantly improves the speech decoding speed, indicating that the proposed acoustic model can effectively enhance the model discrimination and optimize the system structure.

Key words: speech recognition, Long Short-Term Memory (LSTM) neural network, Connectionist Temporal Classification (CTC), Weight Finite-State Transducer (WFST), end-to-end system

摘要： 针对隐马尔可夫模型（HMM）在语音识别中存在的不合理条件假设，进一步研究循环神经网络的序列建模能力，提出了基于双向长短时记忆神经网络的声学模型构建方法，并将联结时序分类（CTC）训练准则成功地应用于该声学模型训练中，搭建出不依赖于隐马尔可夫模型的端到端中文语音识别系统；同时设计了基于加权有限状态转换器（WFST）的语音解码方法，有效解决了发音词典和语言模型难以融入解码过程的问题。与传统GMM-HMM系统和混合DNN-HMM系统对比，实验结果显示该端到端系统不仅明显降低了识别错误率，而且大幅提高了语音解码速度，表明了该声学模型可以有效地增强模型区分度和优化系统结构。

关键词: 语音识别, 长短时记忆神经网络, 联结时序分类, 加权有限状态转换器, 端到端系统

CLC Number:

TN912.34

YAO Yu, RYAD Chellali. End-to-end Chinese speech recognition system using bidirectional long short-term memory networks and weighted finite-state transducers[J]. Journal of Computer Applications, 2018, 38(9): 2495-2499.

姚煜, RYAD Chellali. 基于双向长短时记忆联结时序分类和加权有限状态转换器的端到端中文语音识别系统[J]. 计算机应用, 2018, 38(9): 2495-2499.

References

[1] HINTON G, DENG L, YU D, et al. Deep neural networks for acoustic modeling in speech recognition:the shared views of four research groups[J]. IEEE Signal Processing Magazine, 2012, 29(6):82-97.
[2] VALENTE F, MAGIMAI-DOSS M, WANG W. Analysis and comparison of recent MLP features for LVCSR systems[EB/OL].[2017-12-11]. http://publications.idiap.ch/downloads/papers/2011/Valente_INTERSPEECH_2011.pdf.
[3] DAHL G E, DONG Y, LI D, et al. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(1):30-42.
[4] MOHAMED A R, HINTON G, PENN G. Understanding how deep belief networks perform acoustic modelling[C]//ICASSP 2012:Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway, NJ:IEEE, 2012:4273-4276.
[5] VESELY K, GHOSHAL A, BURGET L, et al. Sequence-discriminative training of deep neural networks[EB/OL].[2017-12-11]. https://www.isca-speech.org/archive/archive_papers/interspeech_2013/i13_2345.pdf.
[6] BLASIAK S, RANGWALA H. A hidden Markov model variant for sequence classification[EB/OL].[2017-12-11]. http://www.ijcai.org/Proceedings/11/Papers/203.pdf.
[7] HAYASHI T, WATANABE S, TODA T, et al. Duration-controlled LSTM for polyphonic sound event detection[J]. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2017, 25(11):2059-2070.
[8] SAON G, KUO H K J, RENNIE S, et al. The IBM 2015 English conversational telephone speech recognition system[EB/OL].[2017-12-11]. https://www.isca-speech.org/archive/interspeech_2015/papers/i15_3140.pdf.
[9] GRAVES A, FERNÁNDEZ S, GOMEZ F, et al. Connectionist temporal classification:labelling unsegmented sequence data with recurrent neural networks[EB/OL].[2017-12-12]. http://web.stanford.edu/class/cs224s/papers/graves06.pdf.
[10] MOHRI M, PEREIRA F, RILEY M. Speech recognition with weighted finite-state transducers[M]//BENESTY J, SONDHI M, HUANG Y A. Springer Handbook of Speech Processing. Berlin:Springer, 2008:559-584.
[11] GRAVES A, MOHAMED A R, HINTON G. Speech recognition with deep recurrent neural networks[C]//Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway, NJ:IEEE, 2013:6645-6649.
[12] MORILLOT O, LIKEFORMANSULEM L. New baseline correction algorithm for text-line recognition with bidirectional recurrent neural networks[J]. Journal of Electronic Imaging, 2013, 22(2):023028.
[13] WOLLMER M, SCHULLER B, EYBEN F, et al. Combining long short-term memory and dynamic Bayesian networks for incremental emotion-sensitive artificial listening[J]. IEEE Journal of Selected Topics in Signal Processing, 2010, 4(5):867-881.
[14] SAINATH T N, VINYALS O, SENIOR A, et al. Convolutional, long short-term memory, fully connected deep neural networks[C]//Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway, NJ:IEEE, 2015:4580-4584.
[15] MAAS A, XIE Z, DAN J, et al. Lexicon-free conversational speech recognition with neural networks[EB/OL].[2017-12-15]. http://www.stanfordlibrary.us/~jurafsky/pubs/N15-1038.pdf.
[16] HANNUN A, CASE C, CASPER J, et al. Deep speech:scaling up end-to-end speech recognition[EB/OL].[2017-12-15]. http://web.stanford.edu/class/cs224s/papers/baidu_speech.pdf.
[17] POVEY D, GHOSHAL A, BOULIANNE G, et al. The Kaldi speech recognition toolkit[EB/OL].[2017-12-20]. http://homepages.inf.ed.ac.uk/aghoshal/pubs/asru11-kaldi.pdf.
[18] DROSTE M, KUICH W, VOGLER H. Handbook of weighted automata[J]. Monographs in Theoretical Computer Science An Eatcs, 2009, 380(1/2):69-86.