《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (1): 318-324.DOI: 10.11772/j.issn.1001-9081.2023121878

• 多媒体计算与计算机仿真 • 上一篇    下一篇

结合字节级别字节对编码的端到端中文语音识别方法

付强, 徐振平(), 盛文星, 叶青   

  1. 长江大学 计算机科学学院,湖北 荆州 434023
  • 收稿日期:2024-01-10 修回日期:2024-04-25 接受日期:2024-05-07 发布日期:2024-05-21 出版日期:2025-01-10
  • 通讯作者: 徐振平
  • 作者简介:付强(1999—),男,湖北广水人,硕士研究生,主要研究方向:深度学习、语音识别;
    盛文星(1999—),男,湖北宜昌人,硕士研究生,主要研究方向:计算机视觉、视觉问答;
    叶青(1983—),女,湖北荆州人,副教授,博士,主要研究方向:人工智能、深度学习、智能状态监测。
  • 基金资助:
    湖北省自然科学基金资助项目(2023AFB909)

End-to-end Chinese speech recognition method with byte-level byte pair encoding

Qiang FU, Zhenping XU(), Wenxing SHENG, Qing YE   

  1. School of Computer Science,Yangtze University,Jingzhou Hubei 434023,China
  • Received:2024-01-10 Revised:2024-04-25 Accepted:2024-05-07 Online:2024-05-21 Published:2025-01-10
  • Contact: Zhenping XU
  • About author:FU Qiang, born in 1999, M. S. candidate. His research interests include deep learning, speech recognition.
    SHENG Wenxing, born in 1999, M. S. candidate. His research interests include computer vision, visual question answering.
    YE Qing, born in 1983, Ph. D., associate professor. Her research interests include artificial intelligence, deep learning, intelligent condition monitoring.
  • Supported by:
    Natural Science Foundation of Hubei Province(2023AFB909)

摘要:

针对语音识别中对中文这种复杂字符集的语言词汇表过大以及训练效率太低的问题,提出一种基于字节级别字节对编码(BBPE)的端到端中文语音识别方法。首先,将256个不同的字节用于初始化词汇表;其次,统计每个词汇单元在语料中出现的频率,并合并频率最高的词汇单元;最后,重复上一步直至无法合并,以得到最终的词汇表。在中文语音数据集AISHELL-1上,该方法生成的词汇表相较于字符级别词汇表的词汇量减少了88.5%,降低了模型训练的复杂度。同时,鉴于Conformer-Transducer (Conformer-T)模型在端到端语音识别中的出色表现,为了实现更好的识别效果,将最新的Zipformer模型与Transducer模型相结合提出Zipformer-Transducer (Zipformer-T)模型,并在该模型上对BBPE方法进行验证。实验结果表明,Zipformer-T模型使用的BBPE方法相较于字符级别分词方法在AISHELL-1测试集和验证集上的字错率(CER)分别降低了0.12和0.08个百分点,且分别达到4.26%和3.98%的最低CER,充分说明该方法能有效提升中文语音识别的性能。

关键词: 语音识别, Conformer, Zipformer, 字节级别字节对编码, 端到端

Abstract:

To address the problems of large vocabulary size and low training efficiency in speech recognition for complex and large character sets such as Chinese, a method for end-to-end Chinese speech recognition based on Byte-Level Byte Pair Encoding (BBPE) was proposed. Firstly, 256 different bytes were used to initialize the vocabulary. Then, the frequency of each vocabulary unit appeared in the corpus was counted, and the units with the highest frequency were merged together. Finally, this process was repeated until no further merging was possible, thereby resulting in the final vocabulary. On Chinese speech dataset AISHELL-1, the vocabulary generated by this method reduces the number of words compared to the character-level vocabulary by 88.5%, thereby lowering the complexity of model training. Moreover, considering the outstanding performance of the Conformer-Transducer (Conformer-T) model in end-to-end speech recognition, the latest Zipformer model was combined with Transducer model to propose Zipformer-Transducer (Zipformer-T) model for better recognition performance. The BBPE method was validated on this model. Experimental results show that Zipformer-T model using BBPE method reduces the Character Error Rate (CER) by 0.12 and 0.08 percentage points on AISHELL-1 test set and validation set respectively, compared to the character-level tokenization method, with the lowest CERs of 4.26% and 3.98% respectively, which explains the effectiveness of the method in enhancing Chinese speech recognition performance convincingly.

Key words: speech recognition, Conformer, Zipformer, Byte-level Byte Pair Encoding (BBPE), end-to-end

中图分类号: