基于自监督知识迁移的鲁棒性语音识别技术

doi:10.11772/j.issn.1001-9081.2021050808

《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (10): 3217-3223.DOI: 10.11772/j.issn.1001-9081.2021050808

• 多媒体计算与计算机仿真 • 上一篇

基于自监督知识迁移的鲁棒性语音识别技术

柏财通¹^,², 崔翛龙²^,³, 郑会吉¹^,², 李爱¹^,²

^1.武警工程大学研究生大队, 西安 710086
^2.武警工程大学反恐指挥信息工程研究团队, 西安 710086
^3.武警工程大学乌鲁木齐校区, 乌鲁木齐 830049

收稿日期:2021-05-20 修回日期:2021-09-13 接受日期:2021-09-22 发布日期:2022-10-14 出版日期:2022-10-10
通讯作者: 崔翛龙
作者简介:柏财通（1995—），男，山东济南人，硕士研究生，主要研究方向：深度边缘智能、鲁棒性语音识别；
郑会吉（1997—），男，重庆人，硕士研究生，主要研究方向：边缘计算；
李爱（1997—），女，湖南邵阳人，硕士，主要研究方向：人工智能。
第一联系人：崔翛龙（1973—），男，安徽长丰人，教授，博士，主要研究方向：指挥信息系统；787942392@qq.com
基金资助:
国家自然科学基金资助项目(U1603261);网信融合项目（LXJH-10（A）-09）

Robust speech recognition technology based on self-supervised knowledge transfer

Caitong BAI¹^,², Xiaolong CUI²^,³, Huiji ZHENG¹^,², Ai LI¹^,²

^1.Postgraduate Brigade，Engineering University of PAP，Xi’an Shaanxi 710086，China
^2.Counter?Terrorism Command Information Engineering Research Team，Engineering University of PAP，Xi’an Shaanxi 710086，China
^3.Urumqi Campus of Engineering University of PAP，Urumqi Xinjiang 830049，China

Received:2021-05-20 Revised:2021-09-13 Accepted:2021-09-22 Online:2022-10-14 Published:2022-10-10
Contact: Xiaolong CUI
About author:BAI Caitong， born in 1995， M. S. candidate. His research interests include deep edge intelligence， robust speech recognition.
CUI Xiaolong， born in 1973， Ph. D. ， professor. His research interests include command information system.
ZHENG Huiji， born in 1997， M. S. candidate. His research interests include edge computing.
LI Ai， born in 1997， M. S. Her research interests include artificial intelligence.
First author contact:CUI Xiaolong， born in 1973， Ph. D. ， professor. His research interests include command information system.
Supported by:
National Natural Science Foundation of China(U1603261);Netcom Integration Project(LXJH-10（A）-09)

摘要/Abstract

摘要：

针对标注神经网络训练数据的成本日益增加与噪声干扰阻碍语音识别系统性能提升的问题，提出一种基于自监督知识迁移的鲁棒性语音识别模型的模型训练算法。首先，在预处理阶段提取原始语音样本的三个人工特征；然后，在训练阶段将特征提取网络生成的高级特征分别通过三个浅层网络来拟合预处理阶段提取的人工特征；同时，把特征提取前端与语音识别后端进行交叉训练，并合并它们的损失函数；最后，通过梯度反向传播令特征提取网络学会提取更有助于去噪语音识别的高级特征，从而实现人工知识迁移与去噪，并高效利用了训练数据。在军事装备控制的应用场景下，基于加噪后的THCHS-30、希尔贝壳数据集AISHELL-1与ST-CMDS这三个开源中文语音识别数据集以及军事装备控制指令的数据集上进行测试，实验结果表明，基于自监督知识迁移的鲁棒性语音识别模型的模型训练算法词错率可以降低到0.12，不仅可以实现对鲁棒性语音识别模型的模型训练，同时通过自监督知识迁移提高了训练样本的利用率，可完成装备控制任务。

关键词: 知识迁移, 鲁棒性语音识别, 自监督学习, 中文语音识别, 语音去噪

Abstract:

A robust speech recognition model training algorithm based on self-supervised knowledge transfer was proposed to solve the problems of the increasingly high cost of tagging neural network training data and noise interference hindering performance improvement of speech recognition system. Firstly， three artificial features of the original speech samples were extracted in the pre-processing stage. Then， the advanced features generated by the feature extraction network were fitted to the artificial features extracted in the pre-processing stage through three shallow networks respectively in the training stage. At the same time， the feature extraction front-end and the speech recognition back-end were cross-trained， and their loss functions were integrated. Finally， the advanced features that are more conducive to denoised speech recognition were extracted by the feature extraction network after using the gradient back propagation， thereby realizing the artificial knowledge transfer and denoising as well as using training data efficiently. In the application scenario of military equipment control， the word error rate of the proposed method can be reduced to 0.12 based on the test on three open source Chinese speech recognition datasets THCHS-30 （TsingHua Continuous Chinese Speech）， Aishell-1 and ST-CMDS （Surfing Technology Commands） as well as the military equipment control command dataset. Experimental results show that the proposed method can not only train robust speech recognition models， but also improve the utilization rate of training samples through self-supervised knowledge transfer， and can complete equipment control tasks.

Key words: knowledge transfer, robust speech recognition, self-supervised learning, Chinese speech recognition, speech denoising

中图分类号:

TP181

柏财通, 崔翛龙, 郑会吉, 李爱. 基于自监督知识迁移的鲁棒性语音识别技术[J]. 计算机应用, 2022, 42(10): 3217-3223.

Caitong BAI, Xiaolong CUI, Huiji ZHENG, Ai LI. Robust speech recognition technology based on self-supervised knowledge transfer[J]. Journal of Computer Applications, 2022, 42(10): 3217-3223.

图/表 14

参考文献 28

1	HE Y Z， SAINATH T N， PRABHAVALKAR R， et al. Streaming end-to-end speech recognition for mobile devices ［C］// Proceedings of the 2019 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2019： 6381-6385. 10.1109/icassp.2019.8682336
2	JUANG B H， RABINER L R. Hidden Markov models for speech recognition［J］. Technometrics， 1991， 33（3）： 251-272. 10.1080/00401706.1991.10484833
3	GRAVES A， SCHMIDHUBER J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures［J］. Neural Networks， 2005， 18（5/6）： 602-610. 10.1016/j.neunet.2005.06.042
4	HINTON G， DENG L， YU D， et al. Deep neural networks for acoustic modeling in speech recognition： the shared views of four research groups［J］. IEEE Signal Processing Magazine， 2012， 29（6）： 82-97. 10.1109/msp.2012.2205597
5	CHAN W， JAITLY N， LE Q， et al. Listen， attend and spell： a neural network for large vocabulary conversational speech recognition ［C］// Proceedings of the 2016 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2016： 4960-4964. 10.1109/icassp.2016.7472621
6	GRAVES A， FERNÁNDEZ S， GOMEZ F， et al. Connectionist temporal classification： labelling unsegmented sequence data with recurrent neural networks ［C］// Proceedings of the 23rd International Conference on Machine Learning. New York： ACM， 2006： 369-376. 10.1145/1143844.1143891
7	GRAVES A. Sequence transduction with recurrent neural networks［EB/OL］. （2012-11-14）［2021-05-01］. . 10.1007/978-3-642-24797-2_3
8	JAITLY N， SUSSILLO D， LE Q V， et al. A neural transducer［EB/OL］. （2016-08-04）［2021-05-01］. .
9	CHIU C C， RAFFEL C. Monotonic chunkwise attention［EB/OL］. （2018-02-23）［2021-05-01］. .
10	ZHANG Z X， GEIGER J， POHJALAINEN J， et al. Deep learning for environmentally robust speech recognition： an overview of recent developments［J］. ACM Transactions on Intelligent Systems and Technology， 2018， 9（5）： No.49. 10.1145/3178115
11	柏财通，高志强，李爱，等.基于门控网络的军事装备控制指令语音识别研究［J］.计算机工程， 2021， 47（7）： 301-306. 10.19678/j.issn.1000-3428.0058590
	BAI C T， GAO Z Q， LI A， et al. Research on voice recognition of military equipment control commands based on gated network［J］. Computer Engineering， 2021， 47（7）： 301-306. 10.19678/j.issn.1000-3428.0058590
12	ZHAO X J， SHAO Y， WANG D L. CASA-based robust speaker identification［J］. IEEE Transactions on Audio， Speech， and Language Processing， 2012， 20（5）： 1608-1616. 10.1109/tasl.2012.2186803
13	DAUPHIN Y N， FAN A， AULI M， et al. Language modeling with gated convolutional networks ［C］// Proceedings of the 34th International Conference on Machine Learning. New York： JMLR.org， 2017： 933-941.
14	RAVANELLI M， OMOLOGO M. Contaminated speech training methods for robust DNN-HMM distant speech recognition ［C］// Proceedings of the Interspeech 2015. ［S.l.］： International Speech Communication Association， 2015： 756-760. 10.21437/interspeech.2015-251
15	RAVANELLI M， ZHONG J Y， PASCUAL S， et al. Multi-task self-supervised learning for robust speech recognition ［C］// Proceedings of the 2020 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2020： 6989-6993. 10.1109/icassp40776.2020.9053569
16	ALLEN J B， BERKLEY D A. Image method for efficiently simulating small-room acoustics［J］. The Journal of the Acoustical Society of America， 1979， 65（4）： 943-950. 10.1121/1.382599
17	HE K M， ZHANG X Y， REN S Q， et al. Deep residual learning for image recognition ［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 770-778. 10.1109/cvpr.2016.90
18	HOCHREITER S， SCHMIDHUBER J. Long short-term memory［J］. Neural Computation， 1997， 9（8）： 1735-1780. 10.1162/neco.1997.9.8.1735
19	POLS L C W. Spectral analysis and identification of Dutch vowels in monosyllabic words［D］. Amsterdam： University of Amsterdam， 1977： 152.
20	KINGMA D P， BA J L. Adam： a method for stochastic optimization［EB/OL］. （2017-01-30）［2021-05-01］. .
21	PASZKE A， GROSS S， MASSA F， et al. PyTorch： an imperative style， high-performance deep learning library［C/OL］// Proceedings of the 33rd Conference on Neural Information Processing Systems. ［2021-05-01］. . 10.7551/mitpress/11474.003.0014
22	WANG D， ZHANG X W. THCHS-30： a free Chinese speech corpus［EB/OL］. （2015-12-10）［2021-05-01］. .
23	BU H， DU J Y， NA X Y， et al. AISHELL-1： an open-source Mandarin speech corpus and a speech recognition baseline ［C］// Proceedings of the 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment. Piscataway： IEEE， 2017： 1-5. 10.1109/icsda.2017.8384449
24	ST-CMDS- 20170001_1， Free ST Chinese Mandarin corpus［DS/OL］. ［2021-05-01］. .
25	KIM S， HORI T， WATANABE S. Joint CTC-attention based end-to-end speech recognition using multi-task learning ［C］// Proceedings of the 2017 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2017： 4835-4839. 10.1109/icassp.2017.7953075
26	KLAKOW D， PETERS J. Testing the correlation of word error rate and perplexity［J］. Speech Communication， 2002， 38（1/2）： 19-28. 10.1016/s0167-6393(01)00041-3
27	BA J L， KIROS J R， HINTON G E. Layer normalization［EB/OL］. （2016-07-21）［2021-05-01］. .
28	HINTON G E， SRIVASTAVA N， KRIZHEVSKY A， et al. Improving neural networks by preventing co-adaptation of feature detectors［EB/OL］. （2012-07-03）［2021-05-01］. .

模块	尺寸参数	参数量/10⁶
输入张量	（30，1，32 000）	—
Gated Block 1	（1，64，1，1）	64
Gated Block 2	（64，64，20，10）	4 096
Gated Block 3	（64，128，11，2）	8 192
Gated Block 4	（128，128，11，1）	16 384
Gated Block 5	（128，256，11，2）	32 768
Gated Block 6	（256，256，11，1）	65 536
Gated Block 7	（256，512，11，2）	131 072
Gated Block 8	（512，512，11，2）	262 144
LSTM	（512）	—
MFCC	（1，256）	—
FBANK	（1，256）	—
WAVE	（1，256）	—

模块	尺寸参数	参数量/10⁶
输入张量	（30，1，32 000）	—
Gated Block 1	（1，64，1，1）	64
Gated Block 2	（64，64，20，10）	4 096
Gated Block 3	（64，128，11，2）	8 192
Gated Block 4	（128，128，11，1）	16 384
Gated Block 5	（128，256，11，2）	32 768
Gated Block 6	（256，256，11，1）	65 536
Gated Block 7	（256，512，11，2）	131 072
Gated Block 8	（512，512，11，2）	262 144
LSTM	（512）	—
MFCC	（1，256）	—
FBANK	（1，256）	—
WAVE	（1，256）	—

模块	尺寸参数	参数量/10⁶
输入张量	（64，161，601）	6.0
Gated Block 1	（161，500，48，2，97）	3.8
7*Gated Block 2	（250，500，7，1）	6.1
Gated Block 3	（250，2 000，32，1）	16.0
Gated Block 4	（1 000，2 000，1，1）	2.0
Conv1d	（1 000，Output Units，1，1）	—
中间张量	（64，1 000， Output Units）	—
LSTM	（1 000，Dictionary Dim，2）	—
Softmax	（Output Units， Dictionary Dim）	—
集束搜索器	3	—

模块	尺寸参数	参数量/10⁶
输入张量	（64，161，601）	6.0
Gated Block 1	（161，500，48，2，97）	3.8
7*Gated Block 2	（250，500，7，1）	6.1
Gated Block 3	（250，2 000，32，1）	16.0
Gated Block 4	（1 000，2 000，1，1）	2.0
Conv1d	（1 000，Output Units，1，1）	—
中间张量	（64，1 000， Output Units）	—
LSTM	（1 000，Dictionary Dim，2）	—
Softmax	（Output Units， Dictionary Dim）	—
集束搜索器	3	—

结构变化	THCHS-30		AISHELL-1		ST-CMDS
结构变化	Clean	Noise	Clean	Noise	Clean	Noise
Base structure	0.320	0.370	0.320	0.400	0.450	0.580
+gated cnn	0.200	0.230	0.240	0.260	0.443	0.460
+50 hours	0.130	0.160	0.130	0.170	0.153	0.260
+skip conection	0.180	0.220	0.220	0.240	0.430	0.400
+new workers	0.160	0.140	0.120	0.140	0.150	0.200

基于自监督知识迁移的鲁棒性语音识别技术

Robust speech recognition technology based on self-supervised knowledge transfer

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 14

参考文献 28

相关文章 5

编辑推荐

Metrics

提取特征器	THCHS-30		AISHELL-1		ST-CMDS
提取特征器	Clean	Noise	Clean	Noise	Clean	Noise
MFCC	0.280	0.310	0.190	0.230	0.201	0.450
FBANK	0.300	0.400	0.200	0.300	0.300	0.500
WAVE	0.320	0.430	0.210	0.360	0.370	0.580
GSDNet+（Supervised）	0.120	0.150	0.130	0.156	0.152	0.260
GSDNet+（Finetuned）	0.110	0.130	0.120	0.146	0.142	0.200
GSDNet+（Frozen）	0.123	0.160	0.126	0.160	0.150	0.270

算法	THCHS-30		AISHELL-1		ST-CMDS
算法	Clean	Noise	Clean	Noise	Clean	Noise
Baseline	0.170	0.180	0.200	0.270	0.250	0.450
LAS	0.150	0.160	0.160	0.190	0.201	0.443
CTC	0.130	0.156	0.140	0.160	0.160	0.420
GSDNet	0.120	0.150	0.130	0.156	0.152	0.260

[1]	代雨柔, 杨庆, 张凤荔, 周帆. 基于自监督学习的社交网络用户轨迹预测模型[J]. 计算机应用, 2021, 41(9): 2545-2551.
[2]	魏淳武, 赵涓涓, 唐笑先, 强彦. 基于多时期蒸馏网络的随访数据知识提取方法[J]. 计算机应用, 2021, 41(10): 2871-2878.
[3]	吴崇数, 林霖, 薛蕴菁, 时鹏. 基于自监督学习的病理图像层次分割[J]. 计算机应用, 2020, 40(6): 1856-1862.
[4]	俞璜悦, 王晗, 郭梦婷. 基于用户兴趣语义的视频关键帧提取[J]. 计算机应用, 2017, 37(11): 3139-3144.
[5]	朱苏阳, 惠浩添, 钱龙华, 张民. 基于自监督学习的维基百科家庭关系抽取[J]. 计算机应用, 2015, 35(4): 1013-1016.

训练方式	词错率
线性	0.4
交叉	0.2

训练方式	词错率
线性	0.4
交叉	0.2