Journals
  Publication Years
  Keywords
Search within results Open Search
Please wait a minute...
For Selected: Toggle Thumbnails
End-to-end Chinese speech recognition method with byte-level byte pair encoding
Qiang FU, Zhenping XU, Wenxing SHENG, Qing YE
Journal of Computer Applications    2025, 45 (1): 318-324.   DOI: 10.11772/j.issn.1001-9081.2023121878
Abstract126)   HTML0)    PDF (1657KB)(47)       Save

To address the problems of large vocabulary size and low training efficiency in speech recognition for complex and large character sets such as Chinese, a method for end-to-end Chinese speech recognition based on Byte-Level Byte Pair Encoding (BBPE) was proposed. Firstly, 256 different bytes were used to initialize the vocabulary. Then, the frequency of each vocabulary unit appeared in the corpus was counted, and the units with the highest frequency were merged together. Finally, this process was repeated until no further merging was possible, thereby resulting in the final vocabulary. On Chinese speech dataset AISHELL-1, the vocabulary generated by this method reduces the number of words compared to the character-level vocabulary by 88.5%, thereby lowering the complexity of model training. Moreover, considering the outstanding performance of the Conformer-Transducer (Conformer-T) model in end-to-end speech recognition, the latest Zipformer model was combined with Transducer model to propose Zipformer-Transducer (Zipformer-T) model for better recognition performance. The BBPE method was validated on this model. Experimental results show that Zipformer-T model using BBPE method reduces the Character Error Rate (CER) by 0.12 and 0.08 percentage points on AISHELL-1 test set and validation set respectively, compared to the character-level tokenization method, with the lowest CERs of 4.26% and 3.98% respectively, which explains the effectiveness of the method in enhancing Chinese speech recognition performance convincingly.

Table and Figures | Reference | Related Articles | Metrics