Journal of Computer Applications ›› 2022, Vol. 42 ›› Issue (10): 3217-3223.DOI: 10.11772/j.issn.1001-9081.2021050808
Special Issue: 多媒体计算与计算机仿真
• Multimedia computing and computer simulation • Previous Articles Next Articles
Caitong BAI1,2, Xiaolong CUI2,3, Huiji ZHENG1,2, Ai LI1,2
Received:
2021-05-20
Revised:
2021-09-13
Accepted:
2021-09-22
Online:
2022-10-14
Published:
2022-10-10
Contact:
Xiaolong CUI
About author:
BAI Caitong, born in 1995, M. S. candidate. His research interests include deep edge intelligence, robust speech recognition.Supported by:
柏财通1,2, 崔翛龙2,3, 郑会吉1,2, 李爱1,2
通讯作者:
崔翛龙
作者简介:
柏财通(1995—),男,山东济南人,硕士研究生,主要研究方向:深度边缘智能、鲁棒性语音识别;基金资助:
CLC Number:
Caitong BAI, Xiaolong CUI, Huiji ZHENG, Ai LI. Robust speech recognition technology based on self-supervised knowledge transfer[J]. Journal of Computer Applications, 2022, 42(10): 3217-3223.
柏财通, 崔翛龙, 郑会吉, 李爱. 基于自监督知识迁移的鲁棒性语音识别技术[J]. 《计算机应用》唯一官方网站, 2022, 42(10): 3217-3223.
Add to citation manager EndNote|Ris|BibTeX
URL: https://www.joca.cn/EN/10.11772/j.issn.1001-9081.2021050808
模块 | 尺寸参数 | 参数量/106 |
---|---|---|
输入张量 | (30,1,32 000) | — |
Gated Block 1 | (1,64,1,1) | 64 |
Gated Block 2 | (64,64,20,10) | 4 096 |
Gated Block 3 | (64,128,11,2) | 8 192 |
Gated Block 4 | (128,128,11,1) | 16 384 |
Gated Block 5 | (128,256,11,2) | 32 768 |
Gated Block 6 | (256,256,11,1) | 65 536 |
Gated Block 7 | (256,512,11,2) | 131 072 |
Gated Block 8 | (512,512,11,2) | 262 144 |
LSTM | (512) | — |
MFCC | (1,256) | — |
FBANK | (1,256) | — |
WAVE | (1,256) | — |
Tab. 1 Feature extraction front-end network parameters
模块 | 尺寸参数 | 参数量/106 |
---|---|---|
输入张量 | (30,1,32 000) | — |
Gated Block 1 | (1,64,1,1) | 64 |
Gated Block 2 | (64,64,20,10) | 4 096 |
Gated Block 3 | (64,128,11,2) | 8 192 |
Gated Block 4 | (128,128,11,1) | 16 384 |
Gated Block 5 | (128,256,11,2) | 32 768 |
Gated Block 6 | (256,256,11,1) | 65 536 |
Gated Block 7 | (256,512,11,2) | 131 072 |
Gated Block 8 | (512,512,11,2) | 262 144 |
LSTM | (512) | — |
MFCC | (1,256) | — |
FBANK | (1,256) | — |
WAVE | (1,256) | — |
模块 | 尺寸参数 | 参数量/106 |
---|---|---|
输入张量 | (64,161,601) | 6.0 |
Gated Block 1 | (161,500,48,2,97) | 3.8 |
7*Gated Block 2 | (250,500,7,1) | 6.1 |
Gated Block 3 | (250,2 000,32,1) | 16.0 |
Gated Block 4 | (1 000,2 000,1,1) | 2.0 |
Conv1d | (1 000,Output Units,1,1) | — |
中间张量 | (64,1 000, Output Units) | — |
LSTM | (1 000,Dictionary Dim,2) | — |
Softmax | (Output Units, Dictionary Dim) | — |
集束搜索器 | 3 | — |
Tab. 2 Speech recognition back-end parameters
模块 | 尺寸参数 | 参数量/106 |
---|---|---|
输入张量 | (64,161,601) | 6.0 |
Gated Block 1 | (161,500,48,2,97) | 3.8 |
7*Gated Block 2 | (250,500,7,1) | 6.1 |
Gated Block 3 | (250,2 000,32,1) | 16.0 |
Gated Block 4 | (1 000,2 000,1,1) | 2.0 |
Conv1d | (1 000,Output Units,1,1) | — |
中间张量 | (64,1 000, Output Units) | — |
LSTM | (1 000,Dictionary Dim,2) | — |
Softmax | (Output Units, Dictionary Dim) | — |
集束搜索器 | 3 | — |
结构变化 | THCHS-30 | AISHELL-1 | ST-CMDS | |||
---|---|---|---|---|---|---|
Clean | Noise | Clean | Noise | Clean | Noise | |
Base structure | 0.320 | 0.370 | 0.320 | 0.400 | 0.450 | 0.580 |
+gated cnn | 0.200 | 0.230 | 0.240 | 0.260 | 0.443 | 0.460 |
+50 hours | 0.130 | 0.160 | 0.130 | 0.170 | 0.153 | 0.260 |
+skip conection | 0.180 | 0.220 | 0.220 | 0.240 | 0.430 | 0.400 |
+new workers | 0.160 | 0.140 | 0.120 | 0.140 | 0.150 | 0.200 |
Tab. 3 Experimental results of model performance (word error rate) affected by artificial knowledge transfer module
结构变化 | THCHS-30 | AISHELL-1 | ST-CMDS | |||
---|---|---|---|---|---|---|
Clean | Noise | Clean | Noise | Clean | Noise | |
Base structure | 0.320 | 0.370 | 0.320 | 0.400 | 0.450 | 0.580 |
+gated cnn | 0.200 | 0.230 | 0.240 | 0.260 | 0.443 | 0.460 |
+50 hours | 0.130 | 0.160 | 0.130 | 0.170 | 0.153 | 0.260 |
+skip conection | 0.180 | 0.220 | 0.220 | 0.240 | 0.430 | 0.400 |
+new workers | 0.160 | 0.140 | 0.120 | 0.140 | 0.150 | 0.200 |
提取特征器 | THCHS-30 | AISHELL-1 | ST-CMDS | |||
---|---|---|---|---|---|---|
Clean | Noise | Clean | Noise | Clean | Noise | |
MFCC | 0.280 | 0.310 | 0.190 | 0.230 | 0.201 | 0.450 |
FBANK | 0.300 | 0.400 | 0.200 | 0.300 | 0.300 | 0.500 |
WAVE | 0.320 | 0.430 | 0.210 | 0.360 | 0.370 | 0.580 |
GSDNet+(Supervised) | 0.120 | 0.150 | 0.130 | 0.156 | 0.152 | 0.260 |
GSDNet+(Finetuned) | 0.110 | 0.130 | 0.120 | 0.146 | 0.142 | 0.200 |
GSDNet+(Frozen) | 0.123 | 0.160 | 0.126 | 0.160 | 0.150 | 0.270 |
Tab. 4 Performance (word error rate) comparison of self-supervised feature extraction and manual feature extraction
提取特征器 | THCHS-30 | AISHELL-1 | ST-CMDS | |||
---|---|---|---|---|---|---|
Clean | Noise | Clean | Noise | Clean | Noise | |
MFCC | 0.280 | 0.310 | 0.190 | 0.230 | 0.201 | 0.450 |
FBANK | 0.300 | 0.400 | 0.200 | 0.300 | 0.300 | 0.500 |
WAVE | 0.320 | 0.430 | 0.210 | 0.360 | 0.370 | 0.580 |
GSDNet+(Supervised) | 0.120 | 0.150 | 0.130 | 0.156 | 0.152 | 0.260 |
GSDNet+(Finetuned) | 0.110 | 0.130 | 0.120 | 0.146 | 0.142 | 0.200 |
GSDNet+(Frozen) | 0.123 | 0.160 | 0.126 | 0.160 | 0.150 | 0.270 |
训练方式 | 词错率 |
---|---|
线性 | 0.4 |
交叉 | 0.2 |
Tab. 5 Word error rate comparison of linear and cross -training methods
训练方式 | 词错率 |
---|---|
线性 | 0.4 |
交叉 | 0.2 |
算法 | THCHS-30 | AISHELL-1 | ST-CMDS | |||
---|---|---|---|---|---|---|
Clean | Noise | Clean | Noise | Clean | Noise | |
Baseline | 0.170 | 0.180 | 0.200 | 0.270 | 0.250 | 0.450 |
LAS | 0.150 | 0.160 | 0.160 | 0.190 | 0.201 | 0.443 |
CTC | 0.130 | 0.156 | 0.140 | 0.160 | 0.160 | 0.420 |
GSDNet | 0.120 | 0.150 | 0.130 | 0.156 | 0.152 | 0.260 |
Tab. 6 Performance comparison of different algorithms
算法 | THCHS-30 | AISHELL-1 | ST-CMDS | |||
---|---|---|---|---|---|---|
Clean | Noise | Clean | Noise | Clean | Noise | |
Baseline | 0.170 | 0.180 | 0.200 | 0.270 | 0.250 | 0.450 |
LAS | 0.150 | 0.160 | 0.160 | 0.190 | 0.201 | 0.443 |
CTC | 0.130 | 0.156 | 0.140 | 0.160 | 0.160 | 0.420 |
GSDNet | 0.120 | 0.150 | 0.130 | 0.156 | 0.152 | 0.260 |
1 | HE Y Z, SAINATH T N, PRABHAVALKAR R, et al. Streaming end-to-end speech recognition for mobile devices [C]// Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2019: 6381-6385. 10.1109/icassp.2019.8682336 |
2 | JUANG B H, RABINER L R. Hidden Markov models for speech recognition[J]. Technometrics, 1991, 33(3): 251-272. 10.1080/00401706.1991.10484833 |
3 | GRAVES A, SCHMIDHUBER J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures[J]. Neural Networks, 2005, 18(5/6): 602-610. 10.1016/j.neunet.2005.06.042 |
4 | HINTON G, DENG L, YU D, et al. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups[J]. IEEE Signal Processing Magazine, 2012, 29(6): 82-97. 10.1109/msp.2012.2205597 |
5 | CHAN W, JAITLY N, LE Q, et al. Listen, attend and spell: a neural network for large vocabulary conversational speech recognition [C]// Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2016: 4960-4964. 10.1109/icassp.2016.7472621 |
6 | GRAVES A, FERNÁNDEZ S, GOMEZ F, et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks [C]// Proceedings of the 23rd International Conference on Machine Learning. New York: ACM, 2006: 369-376. 10.1145/1143844.1143891 |
7 | GRAVES A. Sequence transduction with recurrent neural networks[EB/OL]. (2012-11-14) [2021-05-01]. . 10.1007/978-3-642-24797-2_3 |
8 | JAITLY N, SUSSILLO D, LE Q V, et al. A neural transducer[EB/OL]. (2016-08-04) [2021-05-01]. . |
9 | CHIU C C, RAFFEL C. Monotonic chunkwise attention[EB/OL]. (2018-02-23) [2021-05-01]. . |
10 | ZHANG Z X, GEIGER J, POHJALAINEN J, et al. Deep learning for environmentally robust speech recognition: an overview of recent developments[J]. ACM Transactions on Intelligent Systems and Technology, 2018, 9(5): No.49. 10.1145/3178115 |
11 | 柏财通,高志强,李爱,等.基于门控网络的军事装备控制指令语音识别研究[J].计算机工程, 2021, 47(7): 301-306. 10.19678/j.issn.1000-3428.0058590 |
BAI C T, GAO Z Q, LI A, et al. Research on voice recognition of military equipment control commands based on gated network[J]. Computer Engineering, 2021, 47(7): 301-306. 10.19678/j.issn.1000-3428.0058590 | |
12 | ZHAO X J, SHAO Y, WANG D L. CASA-based robust speaker identification[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(5): 1608-1616. 10.1109/tasl.2012.2186803 |
13 | DAUPHIN Y N, FAN A, AULI M, et al. Language modeling with gated convolutional networks [C]// Proceedings of the 34th International Conference on Machine Learning. New York: JMLR.org, 2017: 933-941. |
14 | RAVANELLI M, OMOLOGO M. Contaminated speech training methods for robust DNN-HMM distant speech recognition [C]// Proceedings of the Interspeech 2015. [S.l.]: International Speech Communication Association, 2015: 756-760. 10.21437/interspeech.2015-251 |
15 | RAVANELLI M, ZHONG J Y, PASCUAL S, et al. Multi-task self-supervised learning for robust speech recognition [C]// Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2020: 6989-6993. 10.1109/icassp40776.2020.9053569 |
16 | ALLEN J B, BERKLEY D A. Image method for efficiently simulating small-room acoustics[J]. The Journal of the Acoustical Society of America, 1979, 65(4): 943-950. 10.1121/1.382599 |
17 | HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition [C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 770-778. 10.1109/cvpr.2016.90 |
18 | HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780. 10.1162/neco.1997.9.8.1735 |
19 | POLS L C W. Spectral analysis and identification of Dutch vowels in monosyllabic words[D]. Amsterdam: University of Amsterdam, 1977: 152. |
20 | KINGMA D P, BA J L. Adam: a method for stochastic optimization[EB/OL]. (2017-01-30) [2021-05-01]. . |
21 | PASZKE A, GROSS S, MASSA F, et al. PyTorch: an imperative style, high-performance deep learning library[C/OL]// Proceedings of the 33rd Conference on Neural Information Processing Systems. [2021-05-01]. . 10.7551/mitpress/11474.003.0014 |
22 | WANG D, ZHANG X W. THCHS-30: a free Chinese speech corpus[EB/OL]. (2015-12-10) [2021-05-01]. . |
23 | BU H, DU J Y, NA X Y, et al. AISHELL-1: an open-source Mandarin speech corpus and a speech recognition baseline [C]// Proceedings of the 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment. Piscataway: IEEE, 2017: 1-5. 10.1109/icsda.2017.8384449 |
24 | ST-CMDS- 20170001_1, Free ST Chinese Mandarin corpus[DS/OL]. [2021-05-01]. . |
25 | KIM S, HORI T, WATANABE S. Joint CTC-attention based end-to-end speech recognition using multi-task learning [C]// Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2017: 4835-4839. 10.1109/icassp.2017.7953075 |
26 | KLAKOW D, PETERS J. Testing the correlation of word error rate and perplexity[J]. Speech Communication, 2002, 38(1/2): 19-28. 10.1016/s0167-6393(01)00041-3 |
27 | BA J L, KIROS J R, HINTON G E. Layer normalization[EB/OL]. (2016-07-21) [2021-05-01]. . |
28 | HINTON G E, SRIVASTAVA N, KRIZHEVSKY A, et al. Improving neural networks by preventing co-adaptation of feature detectors[EB/OL]. (2012-07-03) [2021-05-01]. . |
[1] | Tingjie TANG, Jiajin HUANG, Jin QIN. Session-based recommendation with graph auxiliary learning [J]. Journal of Computer Applications, 2024, 44(9): 2711-2718. |
[2] | Jiong WANG, Taotao TANG, Caiyan JIA. PAGCL: positive augmentation graph contrastive learning recommendation method without negative sampling [J]. Journal of Computer Applications, 2024, 44(5): 1485-1492. |
[3] | Guijin HAN, Xinyuan ZHANG, Wentao ZHANG, Ya HUANG. Self-supervised image registration algorithm based on multi-feature fusion [J]. Journal of Computer Applications, 2024, 44(5): 1597-1604. |
[4] | Yue WU, Hangqi DING, Hao HE, Shunjie BI, Jun JIANG, Maoguo GONG, Qiguang MIAO, Wenping MA. Research review of multitasking optimization algorithms and applications [J]. Journal of Computer Applications, 2024, 44(5): 1338-1347. |
[5] | Jiawei ZHAO, Xuefeng CHEN, Liang FENG, Yaqing HOU, Zexuan ZHU, Yew‑Soon Ong. Review of evolutionary multitasking from the perspective of optimization scenarios [J]. Journal of Computer Applications, 2024, 44(5): 1325-1337. |
[6] | Rong HUANG, Junjie SONG, Shubo ZHOU, Hao LIU. Image aesthetic quality evaluation method based on self-supervised vision Transformer [J]. Journal of Computer Applications, 2024, 44(4): 1269-1276. |
[7] | Yuning ZHANG, Abudukelimu ABULIZI, Tisheng MEI, Chun XU, Maierdana MAIMAITIREYIMU, Halidanmu ABUDUKELIMU, Yutao HOU. Anomaly detection method for skeletal X-ray images based on self-supervised feature extraction [J]. Journal of Computer Applications, 2024, 44(1): 175-181. |
[8] | Xiaobing WANG, Xiongwei ZHANG, Tieyong CAO, Yunfei ZHENG, Yong WANG. Self-distillation object segmentation method via scale-attention knowledge transfer [J]. Journal of Computer Applications, 2024, 44(1): 129-137. |
[9] | Shengwei MA, Ruizhang HUANG, Lina REN, Chuan LIN. Structured deep text clustering model based on multi-layer semantic fusion [J]. Journal of Computer Applications, 2023, 43(8): 2364-2369. |
[10] | Zhongbo HU, Xupeng WANG. Multifactorial backtracking search optimization algorithm for solving automated test case generation problem [J]. Journal of Computer Applications, 2023, 43(4): 1214-1219. |
[11] | Lei LIU, Peng WU, Kai XIE, Beizhi CHENG, Guanqun SHENG. Parking space detection method based on self-supervised learning HOG prediction auxiliary task [J]. Journal of Computer Applications, 2023, 43(12): 3933-3940. |
[12] | DAI Yurou, YANG Qing, ZHANG Fengli, ZHOU Fan. Trajectory prediction model of social network users based on self-supervised learning [J]. Journal of Computer Applications, 2021, 41(9): 2545-2551. |
[13] | WEI Chunwu, ZHAO Juanjuan, TANG Xiaoxian, QIANG Yan. Knowledge extraction method for follow-up data based on multi-term distillation network [J]. Journal of Computer Applications, 2021, 41(10): 2871-2878. |
[14] | WU Chongshu, LIN Lin, XUE Yunjing, SHI Peng. Hierarchical segmentation of pathological images based on self-supervised learning [J]. Journal of Computer Applications, 2020, 40(6): 1856-1862. |
[15] | YU Huangyue, WANG Han, GUO Mengting. Video keyframe extraction based on users' interests [J]. Journal of Computer Applications, 2017, 37(11): 3139-3144. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||