| 1 | JELINEK F. Continuous speech recognition by statistical methods [J]. Proceedings of the IEEE, 1976, 64(4): 532-556. | 
																													
																						| 2 | HANNUN A, CASE C, CASPER J, et al. Deep Speech: scaling up end-to-end speech recognition [EB/OL]. [2023-03-04]. . | 
																													
																						| 3 | GRAVES A, FERNÁNDEZ S, GOMEZ F, et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks [C]// Proceedings of the 23rd International Conference on Machine Learning. New York: ACM, 2006: 369-376. | 
																													
																						| 4 | SHAN C, ZHANG J, WANG Y, et al. Attention-based end-to-end speech recognition on voice search [C]// Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2018: 4764-4768. | 
																													
																						| 5 | GRAVES A. Sequence transduction with recurrent neural networks [EB/OL]. [2021-11-08]. . | 
																													
																						| 6 | GONG C, TAN X, HE D, et al. Sentence-wise smooth regularization for sequence to sequence learning [C]// Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2019: 6449-6456. | 
																													
																						| 7 | SENNRICH R, HADDOW B, BIRCH A. Neural machine translation of rare words with subword units [C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg: ACL, 2016: 1715-1725. | 
																													
																						| 8 | BAZZI I. Modeling out-of-vocabulary words for robust speech recognition [D]. Cambridge: Massachusetts Institute of Technology, 2002: 47-79. | 
																													
																						| 9 | WANG C, CHO K, GU J. Neural machine translation with byte-level subwords [C]// Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2020: 9154-9160. | 
																													
																						| 10 | DENG L, HSIAO R, GHOSHAL A. Bilingual end-to-end ASR with byte-level subwords [C]// Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2022: 6417-6421. | 
																													
																						| 11 | YAO Z, GUO L, YANG X, et al. Zipformer: a faster and better encoder for automatic speech recognition [EB/OL]. [2024-06-20]. . | 
																													
																						| 12 | SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequence learning with neural networks [C]// Proceedings of the 27th International Conference on Neural Information Processing Systems — Volume 2. Cambridge: MIT Press, 2014: 3104-3112. | 
																													
																						| 13 | VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2017: 6000-6010. | 
																													
																						| 14 | BU H, DU J, NA X, et al. AISHELL-1: an open-source Mandarin speech corpus and a speech recognition baseline [C]// Proceedings of the 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment. Piscataway: IEEE, 2017: 1-5. | 
																													
																						| 15 | GULATI A, QIN J, CHIU C C, et al. Conformer: convolution-augmented transformer for speech recognition [C]// Proceedings of the INTERSPEECH 2022. [S.l.]: International Speech Communication Association, 2022: 5036-5040. | 
																													
																						| 16 | 许鸿奎,卢江坤,张子枫,等.结合Conformer与N-gram的中文语音识别[J].计算机系统应用, 2022, 31(7): 194-202. | 
																													
																						|  | XU H K, LU J K, ZHANG Z F, et al. Chinese speech recognition combining Conformer and N-gram [J]. Computer Systems and Applications, 2022, 31(7): 194-202. | 
																													
																						| 17 | BA J L, KIROS J R, HINTON G E. Layer normalization [EB/OL]. [2023-05-03]. . | 
																													
																						| 18 | KINGMA D P, BA J L. Adam: a method for stochastic optimization [EB/OL]. [2023-04-18]. . | 
																													
																						| 19 | GHODSI M, LIU X, APFEL J, et al. RNN-Transducer with stateless prediction network [C]// Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2020: 7049-7053. | 
																													
																						| 20 | 陈戈,谢旭康,孙俊,等.使用Conformer增强的混合CTC/Attention端到端中文语音识别[J].计算机工程与应用, 2023, 59(4): 97-103. | 
																													
																						|  | CHEN G, XIE X K, SUN J, et al. Hybrid CTC/Attention end-to-end Chinese speech recognition enhanced by Conformer [J]. Computer Engineering and Applications, 2023, 59(4): 97-103. | 
																													
																						| 21 | POVEY D, GHOSHAL A, BOULIANNE G, et al. The Kaldi speech recognition toolkit [EB/OL]. [2023-04-18]. . | 
																													
																						| 22 | 杭州电子科技大学.基于Fbank特征和MFCC特征融合的声纹识别方法: 202110586134.6 [P]. 2021-09-14. | 
																													
																						|  | Hangzhou Dianzi University. Method for voiceprint recognition based on fusion of Fbank features and MFCC features: 202110586134.6 [P]. 2021-09-14. | 
																													
																						| 23 | KO T, PEDDINTI V, POVEY D, et al. Audio augmentation for speech recognition [C]// Proceedings of the INTERSPEECH 2015. [S.l.]: International Speech Communication Association, 2015: 3586-3589. | 
																													
																						| 24 | PARK D S, CHAN W, ZHANG Y, et al. SpecAugment: a simple data augmentation method for automatic speech recognition [C]// Proceedings of the INTERSPEECH 2019. [S.l.]: International Speech Communication Association, 2019: 2613-2617. | 
																													
																						| 25 | MICIKEVICIUS P, NARANG S, ALBEN J, et al. Mixed precision training [EB/OL]. [2023-06-18]. . | 
																													
																						| 26 | JAIN M, SCHUBERT K, MAHADEOKAR J, et al. RNN-T for latency controlled ASR with improved beam search [EB/OL]. [2023-10-21]. . | 
																													
																						| 27 | WAIBEL A, HANAZAWA T, HINTON G, et al. Phoneme recognition using time-delay neural networks [M]// CHAUVIN Y, RUMELHART D E. Backpropagation: theory, architectures, and applications. New York: Psychology Press, 1995: 35-61. |