结合字节级别字节对编码的端到端中文语音识别方法

doi:10.11772/j.issn.1001-9081.2023121878

《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (1): 318-324.DOI: 10.11772/j.issn.1001-9081.2023121878

• 多媒体计算与计算机仿真 • 上一篇下一篇

结合字节级别字节对编码的端到端中文语音识别方法

付强, 徐振平(), 盛文星, 叶青

长江大学计算机科学学院，湖北荆州 434023

收稿日期:2024-01-10 修回日期:2024-04-25 接受日期:2024-05-07 发布日期:2024-05-21 出版日期:2025-01-10
通讯作者: 徐振平
作者简介:付强（1999—），男，湖北广水人，硕士研究生，主要研究方向：深度学习、语音识别；
盛文星（1999—），男，湖北宜昌人，硕士研究生，主要研究方向：计算机视觉、视觉问答；
叶青（1983—），女，湖北荆州人，副教授，博士，主要研究方向：人工智能、深度学习、智能状态监测。
基金资助:
湖北省自然科学基金资助项目(2023AFB909)

End-to-end Chinese speech recognition method with byte-level byte pair encoding

Qiang FU, Zhenping XU(), Wenxing SHENG, Qing YE

School of Computer Science，Yangtze University，Jingzhou Hubei 434023，China

Received:2024-01-10 Revised:2024-04-25 Accepted:2024-05-07 Online:2024-05-21 Published:2025-01-10
Contact: Zhenping XU
About author:FU Qiang， born in 1999， M. S. candidate. His research interests include deep learning， speech recognition.
SHENG Wenxing， born in 1999， M. S. candidate. His research interests include computer vision， visual question answering.
YE Qing， born in 1983， Ph. D.， associate professor. Her research interests include artificial intelligence， deep learning， intelligent condition monitoring.
Supported by:
Natural Science Foundation of Hubei Province(2023AFB909)

摘要/Abstract

摘要：

针对语音识别中对中文这种复杂字符集的语言词汇表过大以及训练效率太低的问题，提出一种基于字节级别字节对编码（BBPE）的端到端中文语音识别方法。首先，将256个不同的字节用于初始化词汇表；其次，统计每个词汇单元在语料中出现的频率，并合并频率最高的词汇单元；最后，重复上一步直至无法合并，以得到最终的词汇表。在中文语音数据集AISHELL-1上，该方法生成的词汇表相较于字符级别词汇表的词汇量减少了88.5%，降低了模型训练的复杂度。同时，鉴于Conformer-Transducer （Conformer-T）模型在端到端语音识别中的出色表现，为了实现更好的识别效果，将最新的Zipformer模型与Transducer模型相结合提出Zipformer-Transducer （Zipformer-T）模型，并在该模型上对BBPE方法进行验证。实验结果表明，Zipformer-T模型使用的BBPE方法相较于字符级别分词方法在AISHELL-1测试集和验证集上的字错率（CER）分别降低了0.12和0.08个百分点，且分别达到4.26%和3.98%的最低CER，充分说明该方法能有效提升中文语音识别的性能。

关键词: 语音识别, Conformer, Zipformer, 字节级别字节对编码, 端到端

Abstract:

To address the problems of large vocabulary size and low training efficiency in speech recognition for complex and large character sets such as Chinese， a method for end-to-end Chinese speech recognition based on Byte-Level Byte Pair Encoding （BBPE） was proposed. Firstly， 256 different bytes were used to initialize the vocabulary. Then， the frequency of each vocabulary unit appeared in the corpus was counted， and the units with the highest frequency were merged together. Finally， this process was repeated until no further merging was possible， thereby resulting in the final vocabulary. On Chinese speech dataset AISHELL-1， the vocabulary generated by this method reduces the number of words compared to the character-level vocabulary by 88.5%， thereby lowering the complexity of model training. Moreover， considering the outstanding performance of the Conformer-Transducer （Conformer-T） model in end-to-end speech recognition， the latest Zipformer model was combined with Transducer model to propose Zipformer-Transducer （Zipformer-T） model for better recognition performance. The BBPE method was validated on this model. Experimental results show that Zipformer-T model using BBPE method reduces the Character Error Rate （CER） by 0.12 and 0.08 percentage points on AISHELL-1 test set and validation set respectively， compared to the character-level tokenization method， with the lowest CERs of 4.26% and 3.98% respectively， which explains the effectiveness of the method in enhancing Chinese speech recognition performance convincingly.

Key words: speech recognition, Conformer, Zipformer, Byte-level Byte Pair Encoding (BBPE), end-to-end

中图分类号:

TN912.34

付强, 徐振平, 盛文星, 叶青. 结合字节级别字节对编码的端到端中文语音识别方法[J]. 计算机应用, 2025, 45(1): 318-324.

Qiang FU, Zhenping XU, Wenxing SHENG, Qing YE. End-to-end Chinese speech recognition method with byte-level byte pair encoding[J]. Journal of Computer Applications, 2025, 45(1): 318-324.

图/表 9

图1 Zipformer整体架构

Fig. 1 Zipformer overall architecture

图2 Zipformer块的内部结构

Fig. 2 Internal structure of Zipformer block

图3 Transducer模型结构

Fig. 3 Transducer model structure

图4 Zipformer-T模型结构

Fig. 4 Zipformer-T model structure

图5 端到端语音识别系统结构

Fig. 5 Structure of end-to-end speech recognition system

表1 Zipformer-T模型参数说明

Tab. 1 Description of Zipformer-T model parameters

参数	值	备注
num_encoder_layers	2，2，3，4，3，2	Zipformer块数
downsampling_factor	1，2，4，8，4，2	下采样因子
feedforward_dim	512，768，1 024，1 536，1 024，768	前馈层维度
num_heads	4，4，4，8，4，4	注意力头数
encoder_dim	192，256，384，512，384，256	嵌入层维度
cnn_module_kernel	31，31，15，15，15，31	卷积块大小

表2 不同模型在AISHELL-1数据集上的字错率 ( %)

Tab. 2 CERs of different models on AISHELL-1 dataset

模型	CER
模型	测试集	验证集
TDNN-LSTM-CTC	12.90	11.82
Conformer-T	5.14	4.71
Zipformer-T	4.49	4.24

表3 不同建模单元的模型在AISHELL-1数据集上的字错率 ( %)

Tab. 3 CERs of models with different modeling units on AISHELL-1 dataset

模型	CER
模型	测试集	验证集
Conformer-T + char	5.14	4.71
Conformer-T + BBPE	5.06	4.68
Zipformer-T + char	4.49	4.24
Zipformer-T + BBPE	4.37	4.16
Zipformer-T + BBPE + CTC loss	4.27	3.99

表4 结合语言模型在AISHELL-1数据集上的字错率 ( %)

Tab. 4 CERs of combining language models on AISHELL-1 dataset

模型	CER
模型	测试集	验证集
Zipformer-T + BBPE + CTC loss	4.27	3.99
Zipformer-T + BBPE + CTC loss （采用RNNLM）	4.30	4.02
Zipformer-T + BBPE + CTC loss （采用TransformerLM）	4.26	3.98

参考文献 27

1	JELINEK F. Continuous speech recognition by statistical methods ［J］. Proceedings of the IEEE， 1976， 64（4）： 532-556.
2	HANNUN A， CASE C， CASPER J， et al. Deep Speech： scaling up end-to-end speech recognition ［EB/OL］. ［2023-03-04］. .
3	GRAVES A， FERNÁNDEZ S， GOMEZ F， et al. Connectionist temporal classification： labelling unsegmented sequence data with recurrent neural networks ［C］// Proceedings of the 23rd International Conference on Machine Learning. New York： ACM， 2006： 369-376.
4	SHAN C， ZHANG J， WANG Y， et al. Attention-based end-to-end speech recognition on voice search ［C］// Proceedings of the 2018 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2018： 4764-4768.
5	GRAVES A. Sequence transduction with recurrent neural networks ［EB/OL］. ［2021-11-08］. .
6	GONG C， TAN X， HE D， et al. Sentence-wise smooth regularization for sequence to sequence learning ［C］// Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2019： 6449-6456.
7	SENNRICH R， HADDOW B， BIRCH A. Neural machine translation of rare words with subword units ［C］// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg： ACL， 2016： 1715-1725.
8	BAZZI I. Modeling out-of-vocabulary words for robust speech recognition ［D］. Cambridge： Massachusetts Institute of Technology， 2002： 47-79.
9	WANG C， CHO K， GU J. Neural machine translation with byte-level subwords ［C］// Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2020： 9154-9160.
10	DENG L， HSIAO R， GHOSHAL A. Bilingual end-to-end ASR with byte-level subwords ［C］// Proceedings of the 2022 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2022： 6417-6421.
11	YAO Z， GUO L， YANG X， et al. Zipformer： a faster and better encoder for automatic speech recognition ［EB/OL］. ［2024-06-20］. .
12	SUTSKEVER I， VINYALS O， LE Q V. Sequence to sequence learning with neural networks ［C］// Proceedings of the 27th International Conference on Neural Information Processing Systems — Volume 2. Cambridge： MIT Press， 2014： 3104-3112.
13	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need ［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2017： 6000-6010.
14	BU H， DU J， NA X， et al. AISHELL-1： an open-source Mandarin speech corpus and a speech recognition baseline ［C］// Proceedings of the 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment. Piscataway： IEEE， 2017： 1-5.
15	GULATI A， QIN J， CHIU C C， et al. Conformer： convolution-augmented transformer for speech recognition ［C］// Proceedings of the INTERSPEECH 2022. ［S.l.］： International Speech Communication Association， 2022： 5036-5040.
16	许鸿奎，卢江坤，张子枫，等.结合Conformer与N-gram的中文语音识别［J］.计算机系统应用， 2022， 31（7）： 194-202.
	XU H K， LU J K， ZHANG Z F， et al. Chinese speech recognition combining Conformer and N-gram ［J］. Computer Systems and Applications， 2022， 31（7）： 194-202.
17	BA J L， KIROS J R， HINTON G E. Layer normalization ［EB/OL］. ［2023-05-03］. .
18	KINGMA D P， BA J L. Adam： a method for stochastic optimization ［EB/OL］. ［2023-04-18］. .
19	GHODSI M， LIU X， APFEL J， et al. RNN-Transducer with stateless prediction network ［C］// Proceedings of the 2020 IEEE International Conference on Acoustics， Speech and Signal Processing. Piscataway： IEEE， 2020： 7049-7053.
20	陈戈，谢旭康，孙俊，等.使用Conformer增强的混合CTC/Attention端到端中文语音识别［J］.计算机工程与应用， 2023， 59（4）： 97-103.
	CHEN G， XIE X K， SUN J， et al. Hybrid CTC/Attention end-to-end Chinese speech recognition enhanced by Conformer ［J］. Computer Engineering and Applications， 2023， 59（4）： 97-103.
21	POVEY D， GHOSHAL A， BOULIANNE G， et al. The Kaldi speech recognition toolkit ［EB/OL］. ［2023-04-18］. .
22	杭州电子科技大学.基于Fbank特征和MFCC特征融合的声纹识别方法： 202110586134.6 ［P］. 2021-09-14.
	Hangzhou Dianzi University. Method for voiceprint recognition based on fusion of Fbank features and MFCC features： 202110586134.6 ［P］. 2021-09-14.
23	KO T， PEDDINTI V， POVEY D， et al. Audio augmentation for speech recognition ［C］// Proceedings of the INTERSPEECH 2015. ［S.l.］： International Speech Communication Association， 2015： 3586-3589.
24	PARK D S， CHAN W， ZHANG Y， et al. SpecAugment： a simple data augmentation method for automatic speech recognition ［C］// Proceedings of the INTERSPEECH 2019. ［S.l.］： International Speech Communication Association， 2019： 2613-2617.
25	MICIKEVICIUS P， NARANG S， ALBEN J， et al. Mixed precision training ［EB/OL］. ［2023-06-18］. .
26	JAIN M， SCHUBERT K， MAHADEOKAR J， et al. RNN-T for latency controlled ASR with improved beam search ［EB/OL］. ［2023-10-21］. .
27	WAIBEL A， HANAZAWA T， HINTON G， et al. Phoneme recognition using time-delay neural networks ［M］// CHAUVIN Y， RUMELHART D E. Backpropagation： theory， architectures， and applications. New York： Psychology Press， 1995： 35-61.

[1]	赵晓焱, 匡燕, 王梦含, 袁培燕. 基于知识图谱的端到端内容共享机制[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 995-1001.
[2]	赖华, 孙童, 王文君, 余正涛, 高盛祥, 董凌. 多模态特征的越南语语音识别文本标点恢复[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 418-423.
[3]	高建清, 屠彦辉, 马峰, 付中华. 基于渐进比率掩蔽目标的自适应噪声估计方法[J]. 《计算机应用》唯一官方网站, 2023, 43(4): 1303-1308.
[4]	刘聪, 万根顺, 高建清, 付中华. 基于韵律特征辅助的端到端语音识别方法[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 380-384.
[5]	金玉堂, 王以松, 王丽会, 赵鹏利. 基于多尺度阶梯时频Conformer GAN的语音增强算法[J]. 《计算机应用》唯一官方网站, 2023, 43(11): 3607-3615.
[6]	杨磊, 赵红东, 于快快. 基于多头注意力机制的端到端语音情感识别[J]. 《计算机应用》唯一官方网站, 2022, 42(6): 1869-1875.
[7]	柏财通, 崔翛龙, 郑会吉, 李爱. 基于自监督知识迁移的鲁棒性语音识别技术[J]. 《计算机应用》唯一官方网站, 2022, 42(10): 3217-3223.
[8]	郭帅, 苏旸. 基于数据流的加密流量分类方法[J]. 计算机应用, 2021, 41(5): 1386-1391.
[9]	吴赛赛, 梁晓贺, 谢能付, 周爱莲, 郝心宁. 面向领域实体关系联合抽取的标注方法[J]. 计算机应用, 2021, 41(10): 2858-2863.
[10]	胡学敏, 童秀迟, 郭琳, 张若晗, 孔力. 基于深度视觉注意神经网络的端到端自动驾驶模型[J]. 计算机应用, 2020, 40(7): 1926-1931.
[11]	陈修凯, 陆志华, 周宇. 基于卷积编解码器和门控循环单元的语音分离算法[J]. 计算机应用, 2020, 40(7): 2137-2141.
[12]	陈玉娜, 史晓东. 通过标点恢复提高机器同传效果[J]. 计算机应用, 2020, 40(4): 972-977.
[13]	贾永超, 何小卫, 郑忠龙. 融合重检测机制的卷积回归网络目标跟踪算法[J]. 计算机应用, 2019, 39(8): 2247-2251.
[14]	文凯, 谭笑. 基于用户偏好与副本阈值的端到端缓存算法[J]. 计算机应用, 2019, 39(7): 2051-2055.
[15]	邱泽宇, 屈丹, 张连海. 基于WaveNet的端到端语音合成方法[J]. 计算机应用, 2019, 39(5): 1325-1329.

结合字节级别字节对编码的端到端中文语音识别方法

End-to-end Chinese speech recognition method with byte-level byte pair encoding

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 9

参考文献 27

相关文章 15

编辑推荐

Metrics