计算机应用 ›› 2018, Vol. 38 ›› Issue (9): 2495-2499.DOI: 10.11772/j.issn.1001-9081.2018020402

• 人工智能 • 上一篇    下一篇

基于双向长短时记忆联结时序分类和加权有限状态转换器的端到端中文语音识别系统

姚煜, RYAD Chellali   

  1. 南京工业大学 电气工程与控制科学学院, 南京 211816
  • 收稿日期:2018-03-01 修回日期:2018-04-23 出版日期:2018-09-10 发布日期:2018-09-06
  • 通讯作者: RYAD Chellali
  • 作者简介:姚煜(1991—),男,江苏镇江人,硕士研究生,主要研究方向:自动语音识别、深度学习;RYAD Chellali(1964—),男,法国人,教授,博士,主要研究方向:机器学习、计算机听觉、机器人运动学。

End-to-end Chinese speech recognition system using bidirectional long short-term memory networks and weighted finite-state transducers

YAO Yu, RYAD Chellali   

  1. College of Electrical Engineering and Control Science, Nanjing Tech University, Nanjing Jiangsu 211816, China
  • Received:2018-03-01 Revised:2018-04-23 Online:2018-09-10 Published:2018-09-06
  • Contact: RYAD Chellali

摘要: 针对隐马尔可夫模型(HMM)在语音识别中存在的不合理条件假设,进一步研究循环神经网络的序列建模能力,提出了基于双向长短时记忆神经网络的声学模型构建方法,并将联结时序分类(CTC)训练准则成功地应用于该声学模型训练中,搭建出不依赖于隐马尔可夫模型的端到端中文语音识别系统;同时设计了基于加权有限状态转换器(WFST)的语音解码方法,有效解决了发音词典和语言模型难以融入解码过程的问题。与传统GMM-HMM系统和混合DNN-HMM系统对比,实验结果显示该端到端系统不仅明显降低了识别错误率,而且大幅提高了语音解码速度,表明了该声学模型可以有效地增强模型区分度和优化系统结构。

关键词: 语音识别, 长短时记忆神经网络, 联结时序分类, 加权有限状态转换器, 端到端系统

Abstract: For the assumption of unreasonable conditions in speech recognition by Hidden Markov Model (HMM), the ability of sequence modeling of recurrent neural networks was further studied, an acoustic model based on Bidirectional Long Short-Term Memory (BLSTM) neural networks was proposed. The training criterion based on Connectionist Temporal Classification (CTC) was successfully applied to the acoustic model training, and an end-to-end Chinese speech recognition system was built which does not rely on HMM. Meanwhile, a speech decoding method based on Weighted Finite-State Transducer (WFST) was designed to effectively solve the problem that lexicon and language model are difficult to integrate into the decoding process. Compared with the traditional GMM-HMM system and hybrid DNN-HMM system, the experimental results show that the end-to-end system not only significantly reduces the recognition error rate, but also significantly improves the speech decoding speed, indicating that the proposed acoustic model can effectively enhance the model discrimination and optimize the system structure.

Key words: speech recognition, Long Short-Term Memory (LSTM) neural network, Connectionist Temporal Classification (CTC), Weight Finite-State Transducer (WFST), end-to-end system

中图分类号: