Acoustic modeling approach of multi-stream feature incorporated convolutional neural network for low-resource speech recognition

doi:10.11772/j.issn.1001-9081.2016.09.2609

Journal of Computer Applications ›› 2016, Vol. 36 ›› Issue (9): 2609-2615.DOI: 10.11772/j.issn.1001-9081.2016.09.2609

Previous Articles Next Articles

Acoustic modeling approach of multi-stream feature incorporated convolutional neural network for low-resource speech recognition

QIN Chuxiong, ZHANG Lianhai

School of Information System Engineering, Information Engineering University, Zhengzhou Henan 450001, China

Received:2016-02-02 Revised:2016-03-29 Online:2016-09-10 Published:2016-09-08
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61175017, 61403415).

低资源语音识别中融合多流特征的卷积神经网络声学建模方法

秦楚雄, 张连海

信息工程大学信息系统工程学院, 郑州 450001

通讯作者: 秦楚雄
作者简介:秦楚雄(1991-),男,山东章丘人,硕士研究生,主要研究方向:智能信息处理、语音信号处理;张连海(1971-),男,山东单县人,副教授,硕士,主要研究方向:语音信号处理、语音识别。
基金资助:
国家自然科学基金资助项目（61175017，61403415）。

Abstract

Abstract: Aiming at solving the problem of insufficient training of Convolutional Neural Network (CNN) acoustic modeling parameters under the low-resource training data condition in speech recognition tasks, a method for improving CNN acoustic modeling performance in low-resource speech recognition was proposed by utilizing multi-stream features. Firstly, in order to make use of enough acoustic information of features from limited data to build acoustic model, multiple features of low-resource data were extracted from training data. Secondly, convolutional subnetworks were built for each type of features to form a parallel structure, and to regularize distributions of multiple features. Then, some fully connected layers were added above the parallel convolutional subnetworks to incorporate multi-stream features, and to form a new CNN acoustic model. Finally, a low-resource speech recognition system was built based on this acoustic model. Experimental results show that parallel convolutional subnetworks normalize different feature spaces more similar, and it gains 3.27% and 2.08% recognition accuracy improvement respectively compared with traditional multi-feature splicing training approach and baseline CNN system. Furthermore, when multilingual training is introduced, the proposed method is still applicable, and the recognition accuracy is improved by 5.73% and 4.57% respectively.

Key words: low-resource speech recognition, Convolutional Neural Network (CNN), feature normalization, multi-stream feature

摘要： 针对卷积神经网络（CNN）声学建模参数在低资源训练数据条件下的语音识别任务中存在训练不充分的问题，提出一种利用多流特征提升低资源卷积神经网络声学模型性能的方法。首先，为了在低资源声学建模过程中充分利用有限训练数据中更多数量的声学特征，先对训练数据提取几类不同的特征；其次，对每一类类特征分别构建卷积子网络，形成一个并行结构，使得多特征数据在概率分布上得以规整；然后通过在并行卷积子网络之上加入全连接层进行融合，从而得到一种新的卷积神经网络声学模型；最后，基于该声学模型搭建低资源语音识别系统。实验结果表明，并行卷积层子网络可以将不同特征空间规整得更为相似，且该方法相对传统多特征拼接方法和单特征CNN建模方法分别提升了3.27%和2.08%的识别率；当引入多语言训练时，该方法依然适用，且识别率分别相对提升了5.73%和4.57%。

关键词: 低资源语音识别, 卷积神经网络, 特征规整, 多流特征

CLC Number:

TN912.34

QIN Chuxiong, ZHANG Lianhai. Acoustic modeling approach of multi-stream feature incorporated convolutional neural network for low-resource speech recognition[J]. Journal of Computer Applications, 2016, 36(9): 2609-2615.

秦楚雄, 张连海. 低资源语音识别中融合多流特征的卷积神经网络声学建模方法[J]. 计算机应用, 2016, 36(9): 2609-2615.

References

[1] HINTON G, LI D, DONG Y, et al. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups [J]. IEEE Signal Processing Magazine, 2012, 29(6): 82-97.
[2] DAHL G E, YU D, DENG L, et al. Context-dependent pre-trained deep neural networks for large vocabulary speech recognition [J]. IEEE Transactions on Audio, Speech and Language Processing, 2012, 20(1): 30-42
[3] ABDEL-HAMID O, MOHAMED A-R, JIANG H, et al. Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition [C]// ICASSP 2012: Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway, NJ: IEEE, 2012: 4277-4280.
[4] ABDEL-HAMID O, MOHAMED A-R, JIANG H, et al. Convolutional neural networks for speech recognition [J]. IEEE Transactions on Audio, Speech, and Language Processing, 2014, 22(10): 1533-1545.
[5] ABDEL-HAMID O, DENG L, YU D. Exploring convolutional neural network structures and optimization techniques for speech recognition [EB/OL]. [2016-01-05]. https://www.researchgate.net/publication/264859599_Exploring_Convolutional_Neural_Network_Structures_and_Optimization_Techniques_for_Speech_Recognition.
[6] SAINATH T N, MOHAMED A-R, KINGSBURY B, et al. Deep convolutional neural networks for LVCSR [C]// ICASSP 2013: Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway, NJ: IEEE, 2013: 8614-8618.
[7] SAINATH T N, MOHAMED A-R, KINGSBURY B, et al. Improvements to deep convolutional neural networks for LVCSR [C]// ASRU 2013: Proceedings of the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding. Piscataway, NJ: IEEE, 2013: 315-320.
[8] MIAO Y J, METZE F. Improving language-universal feature extraction with deep maxout and convolutional neural networks [C]// INTERSPEECH 2014: Proceedings of the 2014 International Speech Communication Association Annual Conference. Singapore: International Speech Communication Association, 2013: 800-804.
[9] CHAN W, LANE I. Deep convolutional neural networks for acoustic modeling in low resource languages [C]// ICASSP 2015: Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway, NJ: IEEE, 2015: 2056-2060.
[10] HUANG J T, LI J Y, YU D, et al. Cross language knowledge transfer using multilingual deep neural network with shared hidden layers [C]// ICASSP 2013: Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway, NJ: IEEE, 2013: 7304-7308.
[11] MIAO Y, METZE F. Improving low-resource CD-DNN-HMM using dropout and multilingual DNN training [EB/OL]. [2015-11-22]. http://www.isca-speech.org/archive/archive_papers/interspeech_2013/i13_2237.pdf.
[12] KORVAS M, PLÁTEK O, DUŠEK O, et al. Vystadial 2013—English data [EB/OL]. [2015-10-12]. https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-4671-4.
[13] KORVAS M, PLÁTEK O, DUŠEK O, et al. Vystadial 2013—Czech data [EB/OL]. [2015-11-12]. https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-4670-6?locale-attribute=cs.
[14] POVEY D, GHOSHAL A, BOULIANNE G, et al. The Kaldi speech recognition toolkit [EB/OL]. [2015-11-12]. https://www.researchgate.net/publication/228828379_The_Kaldi_speech_recognition_toolkit.

[1]	SONG Zhongshan, LIANG Jiarui, ZHENG Lu, LIU Zhenyu, TIE Jun. Remote sensing scene classification based on bidirectional gated scale feature fusion [J]. Journal of Computer Applications, 2021, 41(9): 2726-2735.
[2]	LI Kangkang, ZHANG Jing. Multi-layer encoding and decoding model for image captioning based on attention mechanism [J]. Journal of Computer Applications, 2021, 41(9): 2504-2509.
[3]	ZHANG Yongbin, CHANG Wenxin, SUN Lianshan, ZHANG Hang. Detection method of domains generated by dictionary-based domain generation algorithm [J]. Journal of Computer Applications, 2021, 41(9): 2609-2614.
[4]	ZHAO Hong, KONG Dongyi. Chinese description of image content based on fusion of image feature attention and adaptive attention [J]. Journal of Computer Applications, 2021, 41(9): 2496-2503.
[5]	XU Jianglang, LI Linyan, WAN Xinjun, HU Fuyuan. Indoor scene recognition method combined with object detection [J]. Journal of Computer Applications, 2021, 41(9): 2720-2725.
[6]	WANG Hebing, ZHANG Chunmei. Facial landmark detection based on ResNeXt with asymmetric convolution and squeeze excitation [J]. Journal of Computer Applications, 2021, 41(9): 2741-2747.
[7]	ZENG Xiangyin, ZHENG Bochuan, LIU Dan. Detection of left and right railway tracks based on deep convolutional neural network and clustering [J]. Journal of Computer Applications, 2021, 41(8): 2324-2329.
[8]	CAO Yuhong, XU Hai, LIU Sun'ao, WANG Zixiao, LI Hongliang. Review of deep learning-based medical image segmentation [J]. Journal of Computer Applications, 2021, 41(8): 2273-2287.
[9]	QIN Binbin, PENG Liangkang, LU Xiangming, QIAN Jiangbo. Research progress on driver distracted driving detection [J]. Journal of Computer Applications, 2021, 41(8): 2330-2337.
[10]	HUANG Chengcheng, DONG Xiaoxiao, LI Zhao. Deep pipeline 5×5 convolution method based on two-dimensional Winograd algorithm [J]. Journal of Computer Applications, 2021, 41(8): 2258-2264.
[11]	GAO Qinquan, HUANG Bingcheng, LIU Wenzhe, TONG Tong. Bamboo strip surface defect detection method based on improved CenterNet [J]. Journal of Computer Applications, 2021, 41(7): 1933-1938.
[12]	YANG Su, OUYANG Zhi, DU Nisuo. Unsupervised parallel hash image retrieval based on correlation distance [J]. Journal of Computer Applications, 2021, 41(7): 1902-1907.
[13]	TAN Daoqiang, ZENG Cheng, QIAO Jinxia, ZHANG Jun. Shadow detection method based on hybrid attention model [J]. Journal of Computer Applications, 2021, 41(7): 2076-2081.
[14]	WU Guangli, LI Leiting, GUO Zhenzhou, WANG Chengxiang. Video summarization generation model based on improved bi-directional long short-term memory network [J]. Journal of Computer Applications, 2021, 41(7): 1908-1914.
[15]	ZHAO Xiaohu, LI Xiao. Image captioning algorithm based on multi-feature extraction [J]. Journal of Computer Applications, 2021, 41(6): 1640-1646.

Acoustic modeling approach of multi-stream feature incorporated convolutional neural network for low-resource speech recognition

低资源语音识别中融合多流特征的卷积神经网络声学建模方法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics