Robust speech recognition technology based on self-supervised knowledge transfer

doi:10.11772/j.issn.1001-9081.2021050808

Journal of Computer Applications ›› 2022, Vol. 42 ›› Issue (10): 3217-3223.DOI: 10.11772/j.issn.1001-9081.2021050808

Special Issue: 多媒体计算与计算机仿真

• Multimedia computing and computer simulation • Previous Articles Next Articles

Robust speech recognition technology based on self-supervised knowledge transfer

Caitong BAI¹^,², Xiaolong CUI²^,³, Huiji ZHENG¹^,², Ai LI¹^,²

^1.Postgraduate Brigade，Engineering University of PAP，Xi’an Shaanxi 710086，China
^2.Counter?Terrorism Command Information Engineering Research Team，Engineering University of PAP，Xi’an Shaanxi 710086，China
^3.Urumqi Campus of Engineering University of PAP，Urumqi Xinjiang 830049，China

Received:2021-05-20 Revised:2021-09-13 Accepted:2021-09-22 Online:2022-10-14 Published:2022-10-10
Contact: Xiaolong CUI
About author:BAI Caitong， born in 1995， M. S. candidate. His research interests include deep edge intelligence， robust speech recognition.
CUI Xiaolong， born in 1973， Ph. D. ， professor. His research interests include command information system.
ZHENG Huiji， born in 1997， M. S. candidate. His research interests include edge computing.
LI Ai， born in 1997， M. S. Her research interests include artificial intelligence.
First author contact:CUI Xiaolong， born in 1973， Ph. D. ， professor. His research interests include command information system.
Supported by:
National Natural Science Foundation of China(U1603261);Netcom Integration Project(LXJH-10（A）-09)

基于自监督知识迁移的鲁棒性语音识别技术

柏财通¹^,², 崔翛龙²^,³, 郑会吉¹^,², 李爱¹^,²

^1.武警工程大学研究生大队, 西安 710086
^2.武警工程大学反恐指挥信息工程研究团队, 西安 710086
^3.武警工程大学乌鲁木齐校区, 乌鲁木齐 830049

通讯作者: 崔翛龙
作者简介:柏财通（1995—），男，山东济南人，硕士研究生，主要研究方向：深度边缘智能、鲁棒性语音识别；
郑会吉（1997—），男，重庆人，硕士研究生，主要研究方向：边缘计算；
李爱（1997—），女，湖南邵阳人，硕士，主要研究方向：人工智能。
第一联系人：崔翛龙（1973—），男，安徽长丰人，教授，博士，主要研究方向：指挥信息系统；787942392@qq.com
基金资助:
国家自然科学基金资助项目(U1603261);网信融合项目（LXJH-10（A）-09）

Abstract

Abstract:

A robust speech recognition model training algorithm based on self-supervised knowledge transfer was proposed to solve the problems of the increasingly high cost of tagging neural network training data and noise interference hindering performance improvement of speech recognition system. Firstly， three artificial features of the original speech samples were extracted in the pre-processing stage. Then， the advanced features generated by the feature extraction network were fitted to the artificial features extracted in the pre-processing stage through three shallow networks respectively in the training stage. At the same time， the feature extraction front-end and the speech recognition back-end were cross-trained， and their loss functions were integrated. Finally， the advanced features that are more conducive to denoised speech recognition were extracted by the feature extraction network after using the gradient back propagation， thereby realizing the artificial knowledge transfer and denoising as well as using training data efficiently. In the application scenario of military equipment control， the word error rate of the proposed method can be reduced to 0.12 based on the test on three open source Chinese speech recognition datasets THCHS-30 （TsingHua Continuous Chinese Speech）， Aishell-1 and ST-CMDS （Surfing Technology Commands） as well as the military equipment control command dataset. Experimental results show that the proposed method can not only train robust speech recognition models， but also improve the utilization rate of training samples through self-supervised knowledge transfer， and can complete equipment control tasks.

Key words: knowledge transfer, robust speech recognition, self-supervised learning, Chinese speech recognition, speech denoising

摘要：

针对标注神经网络训练数据的成本日益增加与噪声干扰阻碍语音识别系统性能提升的问题，提出一种基于自监督知识迁移的鲁棒性语音识别模型的模型训练算法。首先，在预处理阶段提取原始语音样本的三个人工特征；然后，在训练阶段将特征提取网络生成的高级特征分别通过三个浅层网络来拟合预处理阶段提取的人工特征；同时，把特征提取前端与语音识别后端进行交叉训练，并合并它们的损失函数；最后，通过梯度反向传播令特征提取网络学会提取更有助于去噪语音识别的高级特征，从而实现人工知识迁移与去噪，并高效利用了训练数据。在军事装备控制的应用场景下，基于加噪后的THCHS-30、希尔贝壳数据集AISHELL-1与ST-CMDS这三个开源中文语音识别数据集以及军事装备控制指令的数据集上进行测试，实验结果表明，基于自监督知识迁移的鲁棒性语音识别模型的模型训练算法词错率可以降低到0.12，不仅可以实现对鲁棒性语音识别模型的模型训练，同时通过自监督知识迁移提高了训练样本的利用率，可完成装备控制任务。

关键词: 知识迁移, 鲁棒性语音识别, 自监督学习, 中文语音识别, 语音去噪

CLC Number:

TP181

Caitong BAI, Xiaolong CUI, Huiji ZHENG, Ai LI. Robust speech recognition technology based on self-supervised knowledge transfer[J]. Journal of Computer Applications, 2022, 42(10): 3217-3223.

柏财通, 崔翛龙, 郑会吉, 李爱. 基于自监督知识迁移的鲁棒性语音识别技术[J]. 《计算机应用》唯一官方网站, 2022, 42(10): 3217-3223.

Figures/Tables 14

References 28

1	HE Y Z， SAINATH T N， PRABHAVALKAR R， et al. Streaming end-to-end speech recognition for mobile devices ［C］// Proceedings of the 2019 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2019： 6381-6385. 10.1109/icassp.2019.8682336
2	JUANG B H， RABINER L R. Hidden Markov models for speech recognition［J］. Technometrics， 1991， 33（3）： 251-272. 10.1080/00401706.1991.10484833
3	GRAVES A， SCHMIDHUBER J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures［J］. Neural Networks， 2005， 18（5/6）： 602-610. 10.1016/j.neunet.2005.06.042
4	HINTON G， DENG L， YU D， et al. Deep neural networks for acoustic modeling in speech recognition： the shared views of four research groups［J］. IEEE Signal Processing Magazine， 2012， 29（6）： 82-97. 10.1109/msp.2012.2205597
5	CHAN W， JAITLY N， LE Q， et al. Listen， attend and spell： a neural network for large vocabulary conversational speech recognition ［C］// Proceedings of the 2016 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2016： 4960-4964. 10.1109/icassp.2016.7472621
6	GRAVES A， FERNÁNDEZ S， GOMEZ F， et al. Connectionist temporal classification： labelling unsegmented sequence data with recurrent neural networks ［C］// Proceedings of the 23rd International Conference on Machine Learning. New York： ACM， 2006： 369-376. 10.1145/1143844.1143891
7	GRAVES A. Sequence transduction with recurrent neural networks［EB/OL］. （2012-11-14）［2021-05-01］. . 10.1007/978-3-642-24797-2_3
8	JAITLY N， SUSSILLO D， LE Q V， et al. A neural transducer［EB/OL］. （2016-08-04）［2021-05-01］. .
9	CHIU C C， RAFFEL C. Monotonic chunkwise attention［EB/OL］. （2018-02-23）［2021-05-01］. .
10	ZHANG Z X， GEIGER J， POHJALAINEN J， et al. Deep learning for environmentally robust speech recognition： an overview of recent developments［J］. ACM Transactions on Intelligent Systems and Technology， 2018， 9（5）： No.49. 10.1145/3178115
11	柏财通，高志强，李爱，等.基于门控网络的军事装备控制指令语音识别研究［J］.计算机工程， 2021， 47（7）： 301-306. 10.19678/j.issn.1000-3428.0058590
	BAI C T， GAO Z Q， LI A， et al. Research on voice recognition of military equipment control commands based on gated network［J］. Computer Engineering， 2021， 47（7）： 301-306. 10.19678/j.issn.1000-3428.0058590
12	ZHAO X J， SHAO Y， WANG D L. CASA-based robust speaker identification［J］. IEEE Transactions on Audio， Speech， and Language Processing， 2012， 20（5）： 1608-1616. 10.1109/tasl.2012.2186803
13	DAUPHIN Y N， FAN A， AULI M， et al. Language modeling with gated convolutional networks ［C］// Proceedings of the 34th International Conference on Machine Learning. New York： JMLR.org， 2017： 933-941.
14	RAVANELLI M， OMOLOGO M. Contaminated speech training methods for robust DNN-HMM distant speech recognition ［C］// Proceedings of the Interspeech 2015. ［S.l.］： International Speech Communication Association， 2015： 756-760. 10.21437/interspeech.2015-251
15	RAVANELLI M， ZHONG J Y， PASCUAL S， et al. Multi-task self-supervised learning for robust speech recognition ［C］// Proceedings of the 2020 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2020： 6989-6993. 10.1109/icassp40776.2020.9053569
16	ALLEN J B， BERKLEY D A. Image method for efficiently simulating small-room acoustics［J］. The Journal of the Acoustical Society of America， 1979， 65（4）： 943-950. 10.1121/1.382599
17	HE K M， ZHANG X Y， REN S Q， et al. Deep residual learning for image recognition ［C］// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway： IEEE， 2016： 770-778. 10.1109/cvpr.2016.90
18	HOCHREITER S， SCHMIDHUBER J. Long short-term memory［J］. Neural Computation， 1997， 9（8）： 1735-1780. 10.1162/neco.1997.9.8.1735
19	POLS L C W. Spectral analysis and identification of Dutch vowels in monosyllabic words［D］. Amsterdam： University of Amsterdam， 1977： 152.
20	KINGMA D P， BA J L. Adam： a method for stochastic optimization［EB/OL］. （2017-01-30）［2021-05-01］. .
21	PASZKE A， GROSS S， MASSA F， et al. PyTorch： an imperative style， high-performance deep learning library［C/OL］// Proceedings of the 33rd Conference on Neural Information Processing Systems. ［2021-05-01］. . 10.7551/mitpress/11474.003.0014
22	WANG D， ZHANG X W. THCHS-30： a free Chinese speech corpus［EB/OL］. （2015-12-10）［2021-05-01］. .
23	BU H， DU J Y， NA X Y， et al. AISHELL-1： an open-source Mandarin speech corpus and a speech recognition baseline ［C］// Proceedings of the 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment. Piscataway： IEEE， 2017： 1-5. 10.1109/icsda.2017.8384449
24	ST-CMDS- 20170001_1， Free ST Chinese Mandarin corpus［DS/OL］. ［2021-05-01］. .
25	KIM S， HORI T， WATANABE S. Joint CTC-attention based end-to-end speech recognition using multi-task learning ［C］// Proceedings of the 2017 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2017： 4835-4839. 10.1109/icassp.2017.7953075
26	KLAKOW D， PETERS J. Testing the correlation of word error rate and perplexity［J］. Speech Communication， 2002， 38（1/2）： 19-28. 10.1016/s0167-6393(01)00041-3
27	BA J L， KIROS J R， HINTON G E. Layer normalization［EB/OL］. （2016-07-21）［2021-05-01］. .
28	HINTON G E， SRIVASTAVA N， KRIZHEVSKY A， et al. Improving neural networks by preventing co-adaptation of feature detectors［EB/OL］. （2012-07-03）［2021-05-01］. .

模块	尺寸参数	参数量/10⁶
输入张量	（30，1，32 000）	—
Gated Block 1	（1，64，1，1）	64
Gated Block 2	（64，64，20，10）	4 096
Gated Block 3	（64，128，11，2）	8 192
Gated Block 4	（128，128，11，1）	16 384
Gated Block 5	（128，256，11，2）	32 768
Gated Block 6	（256，256，11，1）	65 536
Gated Block 7	（256，512，11，2）	131 072
Gated Block 8	（512，512，11，2）	262 144
LSTM	（512）	—
MFCC	（1，256）	—
FBANK	（1，256）	—
WAVE	（1，256）	—

模块	尺寸参数	参数量/10⁶
输入张量	（30，1，32 000）	—
Gated Block 1	（1，64，1，1）	64
Gated Block 2	（64，64，20，10）	4 096
Gated Block 3	（64，128，11，2）	8 192
Gated Block 4	（128，128，11，1）	16 384
Gated Block 5	（128，256，11，2）	32 768
Gated Block 6	（256，256，11，1）	65 536
Gated Block 7	（256，512，11，2）	131 072
Gated Block 8	（512，512，11，2）	262 144
LSTM	（512）	—
MFCC	（1，256）	—
FBANK	（1，256）	—
WAVE	（1，256）	—

模块	尺寸参数	参数量/10⁶
输入张量	（64，161，601）	6.0
Gated Block 1	（161，500，48，2，97）	3.8
7*Gated Block 2	（250，500，7，1）	6.1
Gated Block 3	（250，2 000，32，1）	16.0
Gated Block 4	（1 000，2 000，1，1）	2.0
Conv1d	（1 000，Output Units，1，1）	—
中间张量	（64，1 000， Output Units）	—
LSTM	（1 000，Dictionary Dim，2）	—
Softmax	（Output Units， Dictionary Dim）	—
集束搜索器	3	—

模块	尺寸参数	参数量/10⁶
输入张量	（64，161，601）	6.0
Gated Block 1	（161，500，48，2，97）	3.8
7*Gated Block 2	（250，500，7，1）	6.1
Gated Block 3	（250，2 000，32，1）	16.0
Gated Block 4	（1 000，2 000，1，1）	2.0
Conv1d	（1 000，Output Units，1，1）	—
中间张量	（64，1 000， Output Units）	—
LSTM	（1 000，Dictionary Dim，2）	—
Softmax	（Output Units， Dictionary Dim）	—
集束搜索器	3	—

结构变化	THCHS-30		AISHELL-1		ST-CMDS
结构变化	Clean	Noise	Clean	Noise	Clean	Noise
Base structure	0.320	0.370	0.320	0.400	0.450	0.580
+gated cnn	0.200	0.230	0.240	0.260	0.443	0.460
+50 hours	0.130	0.160	0.130	0.170	0.153	0.260
+skip conection	0.180	0.220	0.220	0.240	0.430	0.400
+new workers	0.160	0.140	0.120	0.140	0.150	0.200

Robust speech recognition technology based on self-supervised knowledge transfer

基于自监督知识迁移的鲁棒性语音识别技术

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 14

References 28

Related Articles 15

Recommended Articles

Metrics

提取特征器	THCHS-30		AISHELL-1		ST-CMDS
提取特征器	Clean	Noise	Clean	Noise	Clean	Noise
MFCC	0.280	0.310	0.190	0.230	0.201	0.450
FBANK	0.300	0.400	0.200	0.300	0.300	0.500
WAVE	0.320	0.430	0.210	0.360	0.370	0.580
GSDNet+（Supervised）	0.120	0.150	0.130	0.156	0.152	0.260
GSDNet+（Finetuned）	0.110	0.130	0.120	0.146	0.142	0.200
GSDNet+（Frozen）	0.123	0.160	0.126	0.160	0.150	0.270

算法	THCHS-30		AISHELL-1		ST-CMDS
算法	Clean	Noise	Clean	Noise	Clean	Noise
Baseline	0.170	0.180	0.200	0.270	0.250	0.450
LAS	0.150	0.160	0.160	0.190	0.201	0.443
CTC	0.130	0.156	0.140	0.160	0.160	0.420
GSDNet	0.120	0.150	0.130	0.156	0.152	0.260

[1]	Tingjie TANG, Jiajin HUANG, Jin QIN. Session-based recommendation with graph auxiliary learning [J]. Journal of Computer Applications, 2024, 44(9): 2711-2718.
[2]	Jiong WANG, Taotao TANG, Caiyan JIA. PAGCL： positive augmentation graph contrastive learning recommendation method without negative sampling [J]. Journal of Computer Applications, 2024, 44(5): 1485-1492.
[3]	Guijin HAN, Xinyuan ZHANG, Wentao ZHANG, Ya HUANG. Self-supervised image registration algorithm based on multi-feature fusion [J]. Journal of Computer Applications, 2024, 44(5): 1597-1604.
[4]	Yue WU, Hangqi DING, Hao HE, Shunjie BI, Jun JIANG, Maoguo GONG, Qiguang MIAO, Wenping MA. Research review of multitasking optimization algorithms and applications [J]. Journal of Computer Applications, 2024, 44(5): 1338-1347.
[5]	Jiawei ZHAO, Xuefeng CHEN, Liang FENG, Yaqing HOU, Zexuan ZHU, Yew‑Soon Ong. Review of evolutionary multitasking from the perspective of optimization scenarios [J]. Journal of Computer Applications, 2024, 44(5): 1325-1337.
[6]	Rong HUANG, Junjie SONG, Shubo ZHOU, Hao LIU. Image aesthetic quality evaluation method based on self-supervised vision Transformer [J]. Journal of Computer Applications, 2024, 44(4): 1269-1276.
[7]	Yuning ZHANG, Abudukelimu ABULIZI, Tisheng MEI, Chun XU, Maierdana MAIMAITIREYIMU, Halidanmu ABUDUKELIMU, Yutao HOU. Anomaly detection method for skeletal X-ray images based on self-supervised feature extraction [J]. Journal of Computer Applications, 2024, 44(1): 175-181.
[8]	Xiaobing WANG, Xiongwei ZHANG, Tieyong CAO, Yunfei ZHENG, Yong WANG. Self-distillation object segmentation method via scale-attention knowledge transfer [J]. Journal of Computer Applications, 2024, 44(1): 129-137.
[9]	Shengwei MA, Ruizhang HUANG, Lina REN, Chuan LIN. Structured deep text clustering model based on multi-layer semantic fusion [J]. Journal of Computer Applications, 2023, 43(8): 2364-2369.
[10]	Zhongbo HU, Xupeng WANG. Multifactorial backtracking search optimization algorithm for solving automated test case generation problem [J]. Journal of Computer Applications, 2023, 43(4): 1214-1219.
[11]	Lei LIU, Peng WU, Kai XIE, Beizhi CHENG, Guanqun SHENG. Parking space detection method based on self-supervised learning HOG prediction auxiliary task [J]. Journal of Computer Applications, 2023, 43(12): 3933-3940.
[12]	DAI Yurou, YANG Qing, ZHANG Fengli, ZHOU Fan. Trajectory prediction model of social network users based on self-supervised learning [J]. Journal of Computer Applications, 2021, 41(9): 2545-2551.
[13]	WEI Chunwu, ZHAO Juanjuan, TANG Xiaoxian, QIANG Yan. Knowledge extraction method for follow-up data based on multi-term distillation network [J]. Journal of Computer Applications, 2021, 41(10): 2871-2878.
[14]	WU Chongshu, LIN Lin, XUE Yunjing, SHI Peng. Hierarchical segmentation of pathological images based on self-supervised learning [J]. Journal of Computer Applications, 2020, 40(6): 1856-1862.
[15]	YU Huangyue, WANG Han, GUO Mengting. Video keyframe extraction based on users' interests [J]. Journal of Computer Applications, 2017, 37(11): 3139-3144.

训练方式	词错率
线性	0.4
交叉	0.2

训练方式	词错率
线性	0.4
交叉	0.2