End-to-end Chinese speech recognition method with byte-level byte pair encoding

doi:10.11772/j.issn.1001-9081.2023121878

Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (1): 318-324.DOI: 10.11772/j.issn.1001-9081.2023121878

• Multimedia computing and computer simulation • Previous Articles Next Articles

End-to-end Chinese speech recognition method with byte-level byte pair encoding

Qiang FU, Zhenping XU(), Wenxing SHENG, Qing YE

School of Computer Science，Yangtze University，Jingzhou Hubei 434023，China

Received:2024-01-10 Revised:2024-04-25 Accepted:2024-05-07 Online:2024-05-21 Published:2025-01-10
Contact: Zhenping XU
About author:FU Qiang， born in 1999， M. S. candidate. His research interests include deep learning， speech recognition.
SHENG Wenxing， born in 1999， M. S. candidate. His research interests include computer vision， visual question answering.
YE Qing， born in 1983， Ph. D.， associate professor. Her research interests include artificial intelligence， deep learning， intelligent condition monitoring.
Supported by:
Natural Science Foundation of Hubei Province(2023AFB909)

结合字节级别字节对编码的端到端中文语音识别方法

付强, 徐振平(), 盛文星, 叶青

长江大学计算机科学学院，湖北荆州 434023

通讯作者: 徐振平
作者简介:付强（1999—），男，湖北广水人，硕士研究生，主要研究方向：深度学习、语音识别；
盛文星（1999—），男，湖北宜昌人，硕士研究生，主要研究方向：计算机视觉、视觉问答；
叶青（1983—），女，湖北荆州人，副教授，博士，主要研究方向：人工智能、深度学习、智能状态监测。
基金资助:
湖北省自然科学基金资助项目(2023AFB909)

Abstract

Abstract:

To address the problems of large vocabulary size and low training efficiency in speech recognition for complex and large character sets such as Chinese， a method for end-to-end Chinese speech recognition based on Byte-Level Byte Pair Encoding （BBPE） was proposed. Firstly， 256 different bytes were used to initialize the vocabulary. Then， the frequency of each vocabulary unit appeared in the corpus was counted， and the units with the highest frequency were merged together. Finally， this process was repeated until no further merging was possible， thereby resulting in the final vocabulary. On Chinese speech dataset AISHELL-1， the vocabulary generated by this method reduces the number of words compared to the character-level vocabulary by 88.5%， thereby lowering the complexity of model training. Moreover， considering the outstanding performance of the Conformer-Transducer （Conformer-T） model in end-to-end speech recognition， the latest Zipformer model was combined with Transducer model to propose Zipformer-Transducer （Zipformer-T） model for better recognition performance. The BBPE method was validated on this model. Experimental results show that Zipformer-T model using BBPE method reduces the Character Error Rate （CER） by 0.12 and 0.08 percentage points on AISHELL-1 test set and validation set respectively， compared to the character-level tokenization method， with the lowest CERs of 4.26% and 3.98% respectively， which explains the effectiveness of the method in enhancing Chinese speech recognition performance convincingly.

Key words: speech recognition, Conformer, Zipformer, Byte-level Byte Pair Encoding (BBPE), end-to-end

摘要：

针对语音识别中对中文这种复杂字符集的语言词汇表过大以及训练效率太低的问题，提出一种基于字节级别字节对编码（BBPE）的端到端中文语音识别方法。首先，将256个不同的字节用于初始化词汇表；其次，统计每个词汇单元在语料中出现的频率，并合并频率最高的词汇单元；最后，重复上一步直至无法合并，以得到最终的词汇表。在中文语音数据集AISHELL-1上，该方法生成的词汇表相较于字符级别词汇表的词汇量减少了88.5%，降低了模型训练的复杂度。同时，鉴于Conformer-Transducer （Conformer-T）模型在端到端语音识别中的出色表现，为了实现更好的识别效果，将最新的Zipformer模型与Transducer模型相结合提出Zipformer-Transducer （Zipformer-T）模型，并在该模型上对BBPE方法进行验证。实验结果表明，Zipformer-T模型使用的BBPE方法相较于字符级别分词方法在AISHELL-1测试集和验证集上的字错率（CER）分别降低了0.12和0.08个百分点，且分别达到4.26%和3.98%的最低CER，充分说明该方法能有效提升中文语音识别的性能。

关键词: 语音识别, Conformer, Zipformer, 字节级别字节对编码, 端到端

CLC Number:

TN912.34

Qiang FU, Zhenping XU, Wenxing SHENG, Qing YE. End-to-end Chinese speech recognition method with byte-level byte pair encoding[J]. Journal of Computer Applications, 2025, 45(1): 318-324.

付强, 徐振平, 盛文星, 叶青. 结合字节级别字节对编码的端到端中文语音识别方法[J]. 《计算机应用》唯一官方网站, 2025, 45(1): 318-324.

Figures/Tables 9

References 27

1	JELINEK F. Continuous speech recognition by statistical methods ［J］. Proceedings of the IEEE， 1976， 64（4）： 532-556.
2	HANNUN A， CASE C， CASPER J， et al. Deep Speech： scaling up end-to-end speech recognition ［EB/OL］. ［2023-03-04］. .
3	GRAVES A， FERNÁNDEZ S， GOMEZ F， et al. Connectionist temporal classification： labelling unsegmented sequence data with recurrent neural networks ［C］// Proceedings of the 23rd International Conference on Machine Learning. New York： ACM， 2006： 369-376.
4	SHAN C， ZHANG J， WANG Y， et al. Attention-based end-to-end speech recognition on voice search ［C］// Proceedings of the 2018 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2018： 4764-4768.
5	GRAVES A. Sequence transduction with recurrent neural networks ［EB/OL］. ［2021-11-08］. .
6	GONG C， TAN X， HE D， et al. Sentence-wise smooth regularization for sequence to sequence learning ［C］// Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2019： 6449-6456.
7	SENNRICH R， HADDOW B， BIRCH A. Neural machine translation of rare words with subword units ［C］// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg： ACL， 2016： 1715-1725.
8	BAZZI I. Modeling out-of-vocabulary words for robust speech recognition ［D］. Cambridge： Massachusetts Institute of Technology， 2002： 47-79.
9	WANG C， CHO K， GU J. Neural machine translation with byte-level subwords ［C］// Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2020： 9154-9160.
10	DENG L， HSIAO R， GHOSHAL A. Bilingual end-to-end ASR with byte-level subwords ［C］// Proceedings of the 2022 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2022： 6417-6421.
11	YAO Z， GUO L， YANG X， et al. Zipformer： a faster and better encoder for automatic speech recognition ［EB/OL］. ［2024-06-20］. .
12	SUTSKEVER I， VINYALS O， LE Q V. Sequence to sequence learning with neural networks ［C］// Proceedings of the 27th International Conference on Neural Information Processing Systems — Volume 2. Cambridge： MIT Press， 2014： 3104-3112.
13	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need ［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2017： 6000-6010.
14	BU H， DU J， NA X， et al. AISHELL-1： an open-source Mandarin speech corpus and a speech recognition baseline ［C］// Proceedings of the 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment. Piscataway： IEEE， 2017： 1-5.
15	GULATI A， QIN J， CHIU C C， et al. Conformer： convolution-augmented transformer for speech recognition ［C］// Proceedings of the INTERSPEECH 2022. ［S.l.］： International Speech Communication Association， 2022： 5036-5040.
16	许鸿奎，卢江坤，张子枫，等.结合Conformer与N-gram的中文语音识别［J］.计算机系统应用， 2022， 31（7）： 194-202.
	XU H K， LU J K， ZHANG Z F， et al. Chinese speech recognition combining Conformer and N-gram ［J］. Computer Systems and Applications， 2022， 31（7）： 194-202.
17	BA J L， KIROS J R， HINTON G E. Layer normalization ［EB/OL］. ［2023-05-03］. .
18	KINGMA D P， BA J L. Adam： a method for stochastic optimization ［EB/OL］. ［2023-04-18］. .
19	GHODSI M， LIU X， APFEL J， et al. RNN-Transducer with stateless prediction network ［C］// Proceedings of the 2020 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2020： 7049-7053.
20	陈戈，谢旭康，孙俊，等.使用Conformer增强的混合CTC/Attention端到端中文语音识别［J］.计算机工程与应用， 2023， 59（4）： 97-103.
	CHEN G， XIE X K， SUN J， et al. Hybrid CTC/Attention end-to-end Chinese speech recognition enhanced by Conformer ［J］. Computer Engineering and Applications， 2023， 59（4）： 97-103.
21	POVEY D， GHOSHAL A， BOULIANNE G， et al. The Kaldi speech recognition toolkit ［EB/OL］. ［2023-04-18］. .
22	杭州电子科技大学.基于Fbank特征和MFCC特征融合的声纹识别方法： 202110586134.6 ［P］. 2021-09-14.
	Hangzhou Dianzi University. Method for voiceprint recognition based on fusion of Fbank features and MFCC features： 202110586134.6 ［P］. 2021-09-14.
23	KO T， PEDDINTI V， POVEY D， et al. Audio augmentation for speech recognition ［C］// Proceedings of the INTERSPEECH 2015. ［S.l.］： International Speech Communication Association， 2015： 3586-3589.
24	PARK D S， CHAN W， ZHANG Y， et al. SpecAugment： a simple data augmentation method for automatic speech recognition ［C］// Proceedings of the INTERSPEECH 2019. ［S.l.］： International Speech Communication Association， 2019： 2613-2617.
25	MICIKEVICIUS P， NARANG S， ALBEN J， et al. Mixed precision training ［EB/OL］. ［2023-06-18］. .
26	JAIN M， SCHUBERT K， MAHADEOKAR J， et al. RNN-T for latency controlled ASR with improved beam search ［EB/OL］. ［2023-10-21］. .
27	WAIBEL A， HANAZAWA T， HINTON G， et al. Phoneme recognition using time-delay neural networks ［M］// CHAUVIN Y， RUMELHART D E. Backpropagation： theory， architectures， and applications. New York： Psychology Press， 1995： 35-61.

参数	值	备注
num_encoder_layers	2，2，3，4，3，2	Zipformer块数
downsampling_factor	1，2，4，8，4，2	下采样因子
feedforward_dim	512，768，1 024，1 536，1 024，768	前馈层维度
num_heads	4，4，4，8，4，4	注意力头数
encoder_dim	192，256，384，512，384，256	嵌入层维度
cnn_module_kernel	31，31，15，15，15，31	卷积块大小

参数	值	备注
num_encoder_layers	2，2，3，4，3，2	Zipformer块数
downsampling_factor	1，2，4，8，4，2	下采样因子
feedforward_dim	512，768，1 024，1 536，1 024，768	前馈层维度
num_heads	4，4，4，8，4，4	注意力头数
encoder_dim	192，256，384，512，384，256	嵌入层维度
cnn_module_kernel	31，31，15，15，15，31	卷积块大小

模型	CER
模型	测试集	验证集
Conformer-T + char	5.14	4.71
Conformer-T + BBPE	5.06	4.68
Zipformer-T + char	4.49	4.24
Zipformer-T + BBPE	4.37	4.16
Zipformer-T + BBPE + CTC loss	4.27	3.99

模型	CER
模型	测试集	验证集
Conformer-T + char	5.14	4.71
Conformer-T + BBPE	5.06	4.68
Zipformer-T + char	4.49	4.24
Zipformer-T + BBPE	4.37	4.16
Zipformer-T + BBPE + CTC loss	4.27	3.99

模型	CER
模型	测试集	验证集
Zipformer-T + BBPE + CTC loss	4.27	3.99
Zipformer-T + BBPE + CTC loss （采用RNNLM）	4.30	4.02
Zipformer-T + BBPE + CTC loss （采用TransformerLM）	4.26	3.98

End-to-end Chinese speech recognition method with byte-level byte pair encoding

结合字节级别字节对编码的端到端中文语音识别方法

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 9

References 27

Related Articles 15

Recommended Articles

Metrics

[1]	Hua LAI, Tong SUN, Wenjun WANG, Zhengtao YU, Shengxiang GAO, Ling DONG. Text punctuation restoration for Vietnamese speech recognition with multimodal features [J]. Journal of Computer Applications, 2024, 44(2): 418-423.
[2]	Jianqing GAO, Yanhui TU, Feng MA, Zhonghua FU. Progressive ratio mask-based adaptive noise estimation method [J]. Journal of Computer Applications, 2023, 43(4): 1303-1308.
[3]	Cong LIU, Genshun WAN, Jianqing GAO, Zhonghua FU. End-to-end speech recognition method based on prosodic features [J]. Journal of Computer Applications, 2023, 43(2): 380-384.
[4]	Yutang JIN, Yisong WANG, Lihui WANG, Pengli ZHAO. Speech enhancement algorithm based on multi-scale ladder-type time-frequency Conformer GAN [J]. Journal of Computer Applications, 2023, 43(11): 3607-3615.
[5]	Lei YANG, Hongdong ZHAO, Kuaikuai YU. End-to-end speech emotion recognition based on multi-head attention [J]. Journal of Computer Applications, 2022, 42(6): 1869-1875.
[6]	Caitong BAI, Xiaolong CUI, Huiji ZHENG, Ai LI. Robust speech recognition technology based on self-supervised knowledge transfer [J]. Journal of Computer Applications, 2022, 42(10): 3217-3223.
[7]	GUO Shuai, SU Yang. Encrypted traffic classification method based on data stream [J]. Journal of Computer Applications, 2021, 41(5): 1386-1391.
[8]	WU Saisai, LIANG Xiaohe, XIE Nengfu, ZHOU Ailian, HAO Xinning. Annotation method for joint extraction of domain-oriented entities and relations [J]. Journal of Computer Applications, 2021, 41(10): 2858-2863.
[9]	CHEN Xiukai, LU Zhihua, ZHOU Yu. Speech separation algorithm based on convolutional encoder decoder and gated recurrent unit [J]. Journal of Computer Applications, 2020, 40(7): 2137-2141.
[10]	HU Xuemin, TONG Xiuchi, GUO Lin, ZHANG Ruohan, KONG Li. End-to-end autonomous driving model based on deep visual attention neural network [J]. Journal of Computer Applications, 2020, 40(7): 1926-1931.
[11]	CHEN Yuna, SHI Xiaodong. Improving machine simultaneous interpretation by punctuation recovery [J]. Journal of Computer Applications, 2020, 40(4): 972-977.
[12]	JIA Yongchao, HE Xiaowei, ZHENG Zhonglong. Object tracking algorithm combining re-detection mechanism and convolutional regression network [J]. Journal of Computer Applications, 2019, 39(8): 2247-2251.
[13]	QIU Zeyu, QU Dan, ZHANG Lianhai. End-to-end speech synthesis based on WaveNet [J]. Journal of Computer Applications, 2019, 39(5): 1325-1329.
[14]	PAN Peike, WANG Yan, LUO Yong, ZHOU Jiliu. Automatic segmentation of nasopharyngeal neoplasm in MR image based on U-net model [J]. Journal of Computer Applications, 2019, 39(4): 1183-1188.
[15]	LIU Weibo, ZENG Qingning, BU Yuting, ZHENG Zhanheng. Speech recognition method based on dual micro-array and convolutional neural network [J]. Journal of Computer Applications, 2019, 39(11): 3268-3273.