基于双向长短时记忆联结时序分类和加权有限状态转换器的端到端中文语音识别系统

doi:10.11772/j.issn.1001-9081.2018020402

计算机应用 ›› 2018, Vol. 38 ›› Issue (9): 2495-2499.DOI: 10.11772/j.issn.1001-9081.2018020402

基于双向长短时记忆联结时序分类和加权有限状态转换器的端到端中文语音识别系统

姚煜, RYAD Chellali

南京工业大学电气工程与控制科学学院, 南京 211816

收稿日期:2018-03-01 修回日期:2018-04-23 发布日期:2018-09-06 出版日期:2018-09-10
通讯作者: RYAD Chellali
作者简介:姚煜(1991—),男,江苏镇江人,硕士研究生,主要研究方向:自动语音识别、深度学习;RYAD Chellali(1964—),男,法国人,教授,博士,主要研究方向:机器学习、计算机听觉、机器人运动学。

End-to-end Chinese speech recognition system using bidirectional long short-term memory networks and weighted finite-state transducers

YAO Yu, RYAD Chellali

College of Electrical Engineering and Control Science, Nanjing Tech University, Nanjing Jiangsu 211816, China

Received:2018-03-01 Revised:2018-04-23 Online:2018-09-06 Published:2018-09-10
Contact: RYAD Chellali

摘要/Abstract

摘要： 针对隐马尔可夫模型（HMM）在语音识别中存在的不合理条件假设，进一步研究循环神经网络的序列建模能力，提出了基于双向长短时记忆神经网络的声学模型构建方法，并将联结时序分类（CTC）训练准则成功地应用于该声学模型训练中，搭建出不依赖于隐马尔可夫模型的端到端中文语音识别系统；同时设计了基于加权有限状态转换器（WFST）的语音解码方法，有效解决了发音词典和语言模型难以融入解码过程的问题。与传统GMM-HMM系统和混合DNN-HMM系统对比，实验结果显示该端到端系统不仅明显降低了识别错误率，而且大幅提高了语音解码速度，表明了该声学模型可以有效地增强模型区分度和优化系统结构。

关键词: 语音识别, 长短时记忆神经网络, 联结时序分类, 加权有限状态转换器, 端到端系统

Abstract: For the assumption of unreasonable conditions in speech recognition by Hidden Markov Model (HMM), the ability of sequence modeling of recurrent neural networks was further studied, an acoustic model based on Bidirectional Long Short-Term Memory (BLSTM) neural networks was proposed. The training criterion based on Connectionist Temporal Classification (CTC) was successfully applied to the acoustic model training, and an end-to-end Chinese speech recognition system was built which does not rely on HMM. Meanwhile, a speech decoding method based on Weighted Finite-State Transducer (WFST) was designed to effectively solve the problem that lexicon and language model are difficult to integrate into the decoding process. Compared with the traditional GMM-HMM system and hybrid DNN-HMM system, the experimental results show that the end-to-end system not only significantly reduces the recognition error rate, but also significantly improves the speech decoding speed, indicating that the proposed acoustic model can effectively enhance the model discrimination and optimize the system structure.

Key words: speech recognition, Long Short-Term Memory (LSTM) neural network, Connectionist Temporal Classification (CTC), Weight Finite-State Transducer (WFST), end-to-end system

中图分类号:

TN912.34

姚煜, RYAD Chellali. 基于双向长短时记忆联结时序分类和加权有限状态转换器的端到端中文语音识别系统[J]. 计算机应用, 2018, 38(9): 2495-2499.

YAO Yu, RYAD Chellali. End-to-end Chinese speech recognition system using bidirectional long short-term memory networks and weighted finite-state transducers[J]. Journal of Computer Applications, 2018, 38(9): 2495-2499.

参考文献

[1] HINTON G, DENG L, YU D, et al. Deep neural networks for acoustic modeling in speech recognition:the shared views of four research groups[J]. IEEE Signal Processing Magazine, 2012, 29(6):82-97.
[2] VALENTE F, MAGIMAI-DOSS M, WANG W. Analysis and comparison of recent MLP features for LVCSR systems[EB/OL].[2017-12-11]. http://publications.idiap.ch/downloads/papers/2011/Valente_INTERSPEECH_2011.pdf.
[3] DAHL G E, DONG Y, LI D, et al. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(1):30-42.
[4] MOHAMED A R, HINTON G, PENN G. Understanding how deep belief networks perform acoustic modelling[C]//ICASSP 2012:Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway, NJ:IEEE, 2012:4273-4276.
[5] VESELY K, GHOSHAL A, BURGET L, et al. Sequence-discriminative training of deep neural networks[EB/OL].[2017-12-11]. https://www.isca-speech.org/archive/archive_papers/interspeech_2013/i13_2345.pdf.
[6] BLASIAK S, RANGWALA H. A hidden Markov model variant for sequence classification[EB/OL].[2017-12-11]. http://www.ijcai.org/Proceedings/11/Papers/203.pdf.
[7] HAYASHI T, WATANABE S, TODA T, et al. Duration-controlled LSTM for polyphonic sound event detection[J]. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2017, 25(11):2059-2070.
[8] SAON G, KUO H K J, RENNIE S, et al. The IBM 2015 English conversational telephone speech recognition system[EB/OL].[2017-12-11]. https://www.isca-speech.org/archive/interspeech_2015/papers/i15_3140.pdf.
[9] GRAVES A, FERNÁNDEZ S, GOMEZ F, et al. Connectionist temporal classification:labelling unsegmented sequence data with recurrent neural networks[EB/OL].[2017-12-12]. http://web.stanford.edu/class/cs224s/papers/graves06.pdf.
[10] MOHRI M, PEREIRA F, RILEY M. Speech recognition with weighted finite-state transducers[M]//BENESTY J, SONDHI M, HUANG Y A. Springer Handbook of Speech Processing. Berlin:Springer, 2008:559-584.
[11] GRAVES A, MOHAMED A R, HINTON G. Speech recognition with deep recurrent neural networks[C]//Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway, NJ:IEEE, 2013:6645-6649.
[12] MORILLOT O, LIKEFORMANSULEM L. New baseline correction algorithm for text-line recognition with bidirectional recurrent neural networks[J]. Journal of Electronic Imaging, 2013, 22(2):023028.
[13] WOLLMER M, SCHULLER B, EYBEN F, et al. Combining long short-term memory and dynamic Bayesian networks for incremental emotion-sensitive artificial listening[J]. IEEE Journal of Selected Topics in Signal Processing, 2010, 4(5):867-881.
[14] SAINATH T N, VINYALS O, SENIOR A, et al. Convolutional, long short-term memory, fully connected deep neural networks[C]//Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway, NJ:IEEE, 2015:4580-4584.
[15] MAAS A, XIE Z, DAN J, et al. Lexicon-free conversational speech recognition with neural networks[EB/OL].[2017-12-15]. http://www.stanfordlibrary.us/~jurafsky/pubs/N15-1038.pdf.
[16] HANNUN A, CASE C, CASPER J, et al. Deep speech:scaling up end-to-end speech recognition[EB/OL].[2017-12-15]. http://web.stanford.edu/class/cs224s/papers/baidu_speech.pdf.
[17] POVEY D, GHOSHAL A, BOULIANNE G, et al. The Kaldi speech recognition toolkit[EB/OL].[2017-12-20]. http://homepages.inf.ed.ac.uk/aghoshal/pubs/asru11-kaldi.pdf.
[18] DROSTE M, KUICH W, VOGLER H. Handbook of weighted automata[J]. Monographs in Theoretical Computer Science An Eatcs, 2009, 380(1/2):69-86.

基于双向长短时记忆联结时序分类和加权有限状态转换器的端到端中文语音识别系统

End-to-end Chinese speech recognition system using bidirectional long short-term memory networks and weighted finite-state transducers

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	赖华, 孙童, 王文君, 余正涛, 高盛祥, 董凌. 多模态特征的越南语语音识别文本标点恢复[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 418-423.
[2]	高建清, 屠彦辉, 马峰, 付中华. 基于渐进比率掩蔽目标的自适应噪声估计方法[J]. 《计算机应用》唯一官方网站, 2023, 43(4): 1303-1308.
[3]	刘聪, 万根顺, 高建清, 付中华. 基于韵律特征辅助的端到端语音识别方法[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 380-384.
[4]	柏财通, 崔翛龙, 郑会吉, 李爱. 基于自监督知识迁移的鲁棒性语音识别技术[J]. 《计算机应用》唯一官方网站, 2022, 42(10): 3217-3223.
[5]	付倩慧, 李庆奎, 傅景楠, 王羽. 基于空间维度循环感知网络的密集人群计数模型[J]. 计算机应用, 2021, 41(2): 544-549.
[6]	陈玉娜, 史晓东. 通过标点恢复提高机器同传效果[J]. 计算机应用, 2020, 40(4): 972-977.
[7]	王昆, 郑毅, 方书雅, 刘守印. 基于文本筛选和改进BERT的长文本方面级情感分析[J]. 计算机应用, 2020, 40(10): 2838-2844.
[8]	赵宏, 王乐, 王伟杰. 基于BiLSTM-CNN串行混合模型的文本情感分析[J]. 计算机应用, 2020, 40(1): 16-22.
[9]	张德正, 翁理国, 夏旻, 曹辉. 基于深度卷积长短时神经网络的视频帧预测[J]. 计算机应用, 2019, 39(6): 1657-1662.
[10]	杨朔, 蒲宝明, 李相泽, 王帅, 常战国. 基于深度长短时记忆神经网络模型的心律失常检测算法[J]. 计算机应用, 2019, 39(3): 930-934.
[11]	刘伟波, 曾庆宁, 卜玉婷, 郑展恒. 基于双微阵列与卷积神经网络的语音识别方法[J]. 计算机应用, 2019, 39(11): 3268-3273.
[12]	解本铭, 韩明明, 张攀, 张威. 飞机牵引车语音识别的动态时间规整优化算法[J]. 计算机应用, 2018, 38(6): 1771-1776.
[13]	曹晶晶, 许洁萍, 邵聖淇. 多噪声环境下的层级语音识别模型[J]. 计算机应用, 2018, 38(6): 1790-1794.
[14]	秦楚雄, 张连海. 低资源语音识别中融合多流特征的卷积神经网络声学建模方法[J]. 计算机应用, 2016, 36(9): 2609-2615.
[15]	刘金刚, 周翊, 马永保, 刘宏清. 用于自动语音识别系统的切换语音功率谱估计算法[J]. 计算机应用, 2016, 36(12): 3369-3373.