《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (10): 3217-3223.DOI: 10.11772/j.issn.1001-9081.2021050808

• 多媒体计算与计算机仿真 • 上一篇    

基于自监督知识迁移的鲁棒性语音识别技术

柏财通1,2, 崔翛龙2,3, 郑会吉1,2, 李爱1,2   

  1. 1.武警工程大学 研究生大队, 西安 710086
    2.武警工程大学 反恐指挥信息工程研究团队, 西安 710086
    3.武警工程大学 乌鲁木齐校区, 乌鲁木齐 830049
  • 收稿日期:2021-05-20 修回日期:2021-09-13 接受日期:2021-09-22 发布日期:2022-10-14 出版日期:2022-10-10
  • 通讯作者: 崔翛龙
  • 作者简介:柏财通(1995—),男,山东济南人,硕士研究生,主要研究方向:深度边缘智能、鲁棒性语音识别;
    郑会吉(1997—),男,重庆人,硕士研究生,主要研究方向:边缘计算;
    李爱(1997—),女,湖南邵阳人,硕士,主要研究方向:人工智能。
    第一联系人:崔翛龙(1973—),男,安徽长丰人,教授,博士,主要研究方向:指挥信息系统;787942392@qq.com
  • 基金资助:
    国家自然科学基金资助项目(U1603261);网信融合项目(LXJH-10(A)-09)

Robust speech recognition technology based on self-supervised knowledge transfer

Caitong BAI1,2, Xiaolong CUI2,3, Huiji ZHENG1,2, Ai LI1,2   

  1. 1.Postgraduate Brigade,Engineering University of PAP,Xi’an Shaanxi 710086,China
    2.Counter?Terrorism Command Information Engineering Research Team,Engineering University of PAP,Xi’an Shaanxi 710086,China
    3.Urumqi Campus of Engineering University of PAP,Urumqi Xinjiang 830049,China
  • Received:2021-05-20 Revised:2021-09-13 Accepted:2021-09-22 Online:2022-10-14 Published:2022-10-10
  • Contact: Xiaolong CUI
  • About author:BAI Caitong, born in 1995, M. S. candidate. His research interests include deep edge intelligence, robust speech recognition.
    CUI Xiaolong, born in 1973, Ph. D. , professor. His research interests include command information system.
    ZHENG Huiji, born in 1997, M. S. candidate. His research interests include edge computing.
    LI Ai, born in 1997, M. S. Her research interests include artificial intelligence.
    First author contact:CUI Xiaolong, born in 1973, Ph. D. , professor. His research interests include command information system.
  • Supported by:
    National Natural Science Foundation of China(U1603261);Netcom Integration Project(LXJH-10(A)-09)

摘要:

针对标注神经网络训练数据的成本日益增加与噪声干扰阻碍语音识别系统性能提升的问题,提出一种基于自监督知识迁移的鲁棒性语音识别模型的模型训练算法。首先,在预处理阶段提取原始语音样本的三个人工特征;然后,在训练阶段将特征提取网络生成的高级特征分别通过三个浅层网络来拟合预处理阶段提取的人工特征;同时,把特征提取前端与语音识别后端进行交叉训练,并合并它们的损失函数;最后,通过梯度反向传播令特征提取网络学会提取更有助于去噪语音识别的高级特征,从而实现人工知识迁移与去噪,并高效利用了训练数据。在军事装备控制的应用场景下,基于加噪后的THCHS-30、希尔贝壳数据集AISHELL-1与ST-CMDS这三个开源中文语音识别数据集以及军事装备控制指令的数据集上进行测试,实验结果表明,基于自监督知识迁移的鲁棒性语音识别模型的模型训练算法词错率可以降低到0.12,不仅可以实现对鲁棒性语音识别模型的模型训练,同时通过自监督知识迁移提高了训练样本的利用率,可完成装备控制任务。

关键词: 知识迁移, 鲁棒性语音识别, 自监督学习, 中文语音识别, 语音去噪

Abstract:

A robust speech recognition model training algorithm based on self-supervised knowledge transfer was proposed to solve the problems of the increasingly high cost of tagging neural network training data and noise interference hindering performance improvement of speech recognition system. Firstly, three artificial features of the original speech samples were extracted in the pre-processing stage. Then, the advanced features generated by the feature extraction network were fitted to the artificial features extracted in the pre-processing stage through three shallow networks respectively in the training stage. At the same time, the feature extraction front-end and the speech recognition back-end were cross-trained, and their loss functions were integrated. Finally, the advanced features that are more conducive to denoised speech recognition were extracted by the feature extraction network after using the gradient back propagation, thereby realizing the artificial knowledge transfer and denoising as well as using training data efficiently. In the application scenario of military equipment control, the word error rate of the proposed method can be reduced to 0.12 based on the test on three open source Chinese speech recognition datasets THCHS-30 (TsingHua Continuous Chinese Speech), Aishell-1 and ST-CMDS (Surfing Technology Commands) as well as the military equipment control command dataset. Experimental results show that the proposed method can not only train robust speech recognition models, but also improve the utilization rate of training samples through self-supervised knowledge transfer, and can complete equipment control tasks.

Key words: knowledge transfer, robust speech recognition, self-supervised learning, Chinese speech recognition, speech denoising

中图分类号: