计算机应用 ›› 2016, Vol. 36 ›› Issue (9): 2609-2615.DOI: 10.11772/j.issn.1001-9081.2016.09.2609

• 虚拟现实与数字媒体 • 上一篇    下一篇

低资源语音识别中融合多流特征的卷积神经网络声学建模方法

秦楚雄, 张连海   

  1. 信息工程大学 信息系统工程学院, 郑州 450001
  • 收稿日期:2016-02-02 修回日期:2016-03-29 出版日期:2016-09-10 发布日期:2016-09-08
  • 通讯作者: 秦楚雄
  • 作者简介:秦楚雄(1991-),男,山东章丘人,硕士研究生,主要研究方向:智能信息处理、语音信号处理;张连海(1971-),男,山东单县人,副教授,硕士,主要研究方向:语音信号处理、语音识别。
  • 基金资助:
    国家自然科学基金资助项目(61175017,61403415)。

Acoustic modeling approach of multi-stream feature incorporated convolutional neural network for low-resource speech recognition

QIN Chuxiong, ZHANG Lianhai   

  1. School of Information System Engineering, Information Engineering University, Zhengzhou Henan 450001, China
  • Received:2016-02-02 Revised:2016-03-29 Online:2016-09-10 Published:2016-09-08
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61175017, 61403415).

摘要: 针对卷积神经网络(CNN)声学建模参数在低资源训练数据条件下的语音识别任务中存在训练不充分的问题,提出一种利用多流特征提升低资源卷积神经网络声学模型性能的方法。首先,为了在低资源声学建模过程中充分利用有限训练数据中更多数量的声学特征,先对训练数据提取几类不同的特征;其次,对每一类类特征分别构建卷积子网络,形成一个并行结构,使得多特征数据在概率分布上得以规整;然后通过在并行卷积子网络之上加入全连接层进行融合,从而得到一种新的卷积神经网络声学模型;最后,基于该声学模型搭建低资源语音识别系统。实验结果表明,并行卷积层子网络可以将不同特征空间规整得更为相似,且该方法相对传统多特征拼接方法和单特征CNN建模方法分别提升了3.27%和2.08%的识别率;当引入多语言训练时,该方法依然适用,且识别率分别相对提升了5.73%和4.57%。

关键词: 低资源语音识别, 卷积神经网络, 特征规整, 多流特征

Abstract: Aiming at solving the problem of insufficient training of Convolutional Neural Network (CNN) acoustic modeling parameters under the low-resource training data condition in speech recognition tasks, a method for improving CNN acoustic modeling performance in low-resource speech recognition was proposed by utilizing multi-stream features. Firstly, in order to make use of enough acoustic information of features from limited data to build acoustic model, multiple features of low-resource data were extracted from training data. Secondly, convolutional subnetworks were built for each type of features to form a parallel structure, and to regularize distributions of multiple features. Then, some fully connected layers were added above the parallel convolutional subnetworks to incorporate multi-stream features, and to form a new CNN acoustic model. Finally, a low-resource speech recognition system was built based on this acoustic model. Experimental results show that parallel convolutional subnetworks normalize different feature spaces more similar, and it gains 3.27% and 2.08% recognition accuracy improvement respectively compared with traditional multi-feature splicing training approach and baseline CNN system. Furthermore, when multilingual training is introduced, the proposed method is still applicable, and the recognition accuracy is improved by 5.73% and 4.57% respectively.

Key words: low-resource speech recognition, Convolutional Neural Network (CNN), feature normalization, multi-stream feature

中图分类号: