End-to-end speech recognition method based on prosodic features

doi:10.11772/j.issn.1001-9081.2022010009

Journal of Computer Applications ›› 2023, Vol. 43 ›› Issue (2): 380-384.DOI: 10.11772/j.issn.1001-9081.2022010009

• Artificial intelligence • Previous Articles

End-to-end speech recognition method based on prosodic features

Cong LIU¹, Genshun WAN¹(), Jianqing GAO¹, Zhonghua FU²

^1.AI Institute，iFLYTEK Company Limited，Hefei Anhui 230088，China
^2.Xi’an iFLYTEK Hyper?brain Information Technology Company Limited，Xi’an Shaanxi 710000，China

Received:2022-01-06 Revised:2022-04-06 Accepted:2022-04-11 Online:2022-05-24 Published:2023-02-10
Contact: Genshun WAN
About author:LIU Cong， born in 1984， Ph. D.， senior engineer. His research interests include speech recognition， face recognition.
GAO Jianqing， born in 1983， Ph. D.， senior engineer. His research interests include speech recognition， speech information processing.
FU Zhonghua， born in 1977， Ph. D.， associate professor. His research interests include hearing and audio， speech information processing.
Supported by:
Scientific and Technological Innovation 2030 — Major Project of New Generation Artificial Intelligence(2020AAA0103600)

基于韵律特征辅助的端到端语音识别方法

刘聪¹, 万根顺¹(), 高建清¹, 付中华²

^1.科大讯飞股份有限公司 AI研究院，合肥 230088
^2.西安讯飞超脑信息科技有限公司，西安 710000

通讯作者: 万根顺
作者简介:刘聪（1984—），男，安徽铜陵人，高级工程师，博士，CCF会员，主要研究方向：语音识别、人脸识别
高建清（1983—），男，安徽淮南人，高级工程师，博士，CCF会员，主要研究方向：语音识别、语音信息处理
付中华（1977—），男，湖北十堰人，副教授，博士，CCF会员，主要研究方向：听觉与音频、语音信号处理。
基金资助:
科技创新2030-“新一代人工智能”重大项目(2020AAA0103600)

Abstract

Abstract:

In the traditional speech recognition system， the optimal decoding paths are determined by a language model restrained by the training data. Almost inevitably， the right pronunciation may produce wrong character recognition results in some scenarios. In order to use the prosodic information in speech to enhance the probability of correct character combination in language model， an end-to-end speech recognition method based on prosodic features was proposed. Based on the attention mechanism based encoder-decoder speech recognition framework， firstly， the coefficient distribution of attention mechanism was used to extract prosodic features such as pronunciation interval and pronunciation energy. Then， the prosodic features were combined with decoder to significantly improve the accuracy of speech recognition in the cases with the same or similar pronunciation and semantic ambiguity. Experimental results show that the proposed method achieves a relative accuracy improvement of 5.2% and 5.0% respectively compared with the baseline end-to-end speech recognition method on 1 000 h and 10 000 h speech recognition tasks and improves the intelligibility of speech recognition results.

Key words: speech recognition, end-to-end, semantic ambiguity, attention mechanism, prosodic feature

摘要：

针对传统的语音识别系统采用数据驱动并利用语言模型来决策最优的解码路径，导致在部分场景下的解码结果存在明显的音对字错的问题，提出一种基于韵律特征辅助的端到端语音识别方法，利用语音中的韵律信息辅助增强正确汉字组合在语言模型中的概率。在基于注意力机制的编码-解码语音识别框架的基础上，首先利用注意力机制的系数分布提取发音间隔、发音能量等韵律特征；然后将韵律特征与解码端结合，从而显著提升了发音相同或相近、语义歧义情况下的语音识别准确率。实验结果表明，该方法在1 000 h及10 000 h级别的语音识别任务上分别较端到端语音识别基线方法在准确率上相对提升了5.2%和5.0%，进一步改善了语音识别结果的可懂度。

关键词: 语音识别, 端到端, 语义歧义, 注意力机制, 韵律特征

CLC Number:

TP391.4

Cong LIU, Genshun WAN, Jianqing GAO, Zhonghua FU. End-to-end speech recognition method based on prosodic features[J]. Journal of Computer Applications, 2023, 43(2): 380-384.

刘聪, 万根顺, 高建清, 付中华. 基于韵律特征辅助的端到端语音识别方法[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 380-384.

Figures/Tables 7

Fig.1 Encoder-decoder speech recognition framework based on attention mechanism

Fig.2 Time domain distribution representation based on attention coefficient

Fig.3 Encoder-decoder speech recognition framework based on prosodic features

Tab. 1 Effect of ED speech recognition based on prosodic features

配置	增加不同维度的韵律特征				准确率/%
配置	$F i p$	$F i l$	$F i g$	$F i e$	准确率/%
ED基线					78.66
ED改进方案	√				79.45
		√			79.03
			√		79.87
				√	79.18
			√	√	80.05

Tab. 1 Effect of ED speech recognition based on prosodic features

配置	增加不同维度的韵律特征				准确率/%
配置	$F i p$	$F i l$	$F i g$	$F i e$	准确率/%
ED基线					78.66
ED改进方案	√				79.45
		√			79.03
			√		79.87
				√	79.18
			√	√	80.05

Tab. 2 Effect of ED speech recognition based on rescoring

配置	增加韵律特征		rescore		准确率/%
配置	$F i g$	$F i e$	常规方案	改进方案	准确率/%
ED基线					78.66
ED基线			√		79.41
ED改进方案				√	79.78
	√	√			80.05
	√	√	√		80.42
	√	√		√	80.49

Tab. 2 Effect of ED speech recognition based on rescoring

配置	增加韵律特征		rescore		准确率/%
配置	$F i g$	$F i e$	常规方案	改进方案	准确率/%
ED基线					78.66
ED基线			√		79.41
ED改进方案				√	79.78
	√	√			80.05
	√	√	√		80.42
	√	√		√	80.49

Tab. 3 Effect of ED speech recognition on big data

配置	增加韵律特征		rescore		准确率/%
配置	$F i g$	$F i e$	常规方案	改进方案	准确率/%
ED基线					89.12
ED基线			√		89.42
ED改进方案	√	√			89.69
	√	√	√		89.89
	√	√		√	89.95

Tab. 3 Effect of ED speech recognition on big data

配置	增加韵律特征		rescore		准确率/%
配置	$F i g$	$F i e$	常规方案	改进方案	准确率/%
ED基线					89.12
ED基线			√		89.42
ED改进方案	√	√			89.69
	√	√	√		89.89
	√	√		√	89.95

Tab. 4 Examples of change in speech recognition results

示例

识别结果

标注：因五毛钱产生的纠纷

基线方法：鹦鹉毛钱产生的纠纷

改进方法：因五毛钱产生的纠纷

标注：往内拨弦接着无名指抬起

基线方法：往内拨衔接着无名指抬起

改进方法：往内拨弦接着无名指抬起

References 21

1	HINTON G E， OSINDERO S， TEH Y W. A fast learning algorithm for deep belief nets［J］. Neural Computation， 2006， 18（7）： 1527-1554. 10.1162/neco.2006.18.7.1527
2	PUNDAK G， SAINATH T N， PRABHAVALKAR R， et al. Deep context： end-to-end contextual speech recognition［C］// Proceedings of the 2018 IEEE Spoken Language Technology Workshop. Piscataway： IEEE， 2018： 418-425. 10.1109/slt.2018.8639034
3	刘丙哲. 韵律信息在汉语语音识别中的应用［D］. 上海：复旦大学， 2002： 33-41.
	LIU B Z. Application of prosodic information in Chinese speech recognition［D］. Shanghai： Fudan University， 2002： 33-41.
4	CHEN K， HASEGAWA-JOHNSON M， COHEN A， et al. Prosody dependent speech recognition on radio news corpus of American English［J］. IEEE Transactions on Audio， Speech， and Language Processing， 2006， 14（1）：232-245. 10.1109/tsa.2005.853208
5	GADDE V R R. Modeling word durations［C］// Proceedings of the 6th International Conference on Spoken Language Processing. ［S.l.］： International Speech Communication Association， 2000， 1： 601-604. 10.21437/icslp.2000-149
6	HANNUN A. Sequence modeling with CTC［J］. Distill， 2017， 2（11）： No.8. 10.23915/distill.00008
7	ZHAO H B， HIGUCHI Y， OGAWA T， et al. An investigation of enhancing CTC model for triggered attention-based streaming ASR［EB/OL］. （2021-10-20）［2021-12-15］..
8	LEE J， WATANABE S. Intermediate loss regularization for CTC-based speech recognition［C］// Proceedings of the 2021 IEEE International Conference on Acoustics， Speech， and Signal Processing. Piscataway： IEEE， 2021： 6224-6228. 10.1109/icassp39728.2021.9414594
9	ZHOU W， ZHENG Z Y， SCHLÜTER R， et al. On language model integration for RNN transducer based speech recognition［C］// Proceedings of the 2022 IEEE International Conference on Acoustics， Speech， and Signal Processing. Piscataway： IEEE， 2022： 8407-8411. 10.1109/icassp43922.2022.9746948
10	KIM J， LEE J. Generalizing RNN-transducer to out-domain audio via sparse self-attention layers［C］// Proceedings of the Interspeech 2022. ［S.l.］： International Speech Communication Association， 2022： 4123-4127. 10.21437/interspeech.2022-581
11	MORITZ N， HORI T， WATANABE S， et al. Sequence transduction with graph-based supervision［C］// Proceedings of the 2022 IEEE International Conference on Acoustics， Speech， and Signal Processing. Piscataway： IEEE， 2022： 7212-7216. 10.1109/icassp43922.2022.9747788
12	CHAN W， JAITLY N， LE Q V， et al. Listen， attend and spell［EB/OL］. （2015-08-20）［2021-12-15］.. 10.1109/icassp.2016.7472621
13	JOSHI R， KANNAN V. Attention based end to end speech recognition for voice search in Hindi and English［C］// Proceedings of the 13th Annual Meeting of the Forum for Information Retrieval Evaluation. New York： ACM， 2021： 107-113. 10.1145/3503162.3503173
14	HE B， RADFAR M. The performance evaluation of attention-based neural ASR under mixed speech input［EB/OL］. （2021-08-03）［2021-12-15］..
15	HOCHREITER S， SCHMIDHUBER J. Long short-term memory［J］. Neural Computation， 1997， 9（8）：1735-1780. 10.1162/neco.1997.9.8.1735
16	MEDSKER L R， JAIN L C. Recurrent neural networks［J］. Design and Applications， 2001， 5： 64-67.
17	ABDEL-HAMID O， DENG L， YU D. Exploring convolutional neural network structures and optimization techniques for speech recognition［C］// Proceedings of the Interspeech 2013. ［S.l.］： International Speech Communication Association， 2013： 3366-3370. 10.21437/interspeech.2013-744
18	NEWATIA S， AGGARWAL R K. Convolutional neural network for ASR［C］// Proceedings of the 2nd International Conference on Electronics， Communication and Aerospace Technology. Piscataway： IEEE， 2018： 638-642. 10.1109/iceca.2018.8474688
19	GULATI A， QIN J， CHIU C C， et al. Conformer： convolution-augmented transformer for speech recognition［EB/OL］. （2020-05-16）［2021-12-15］.. 10.21437/interspeech.2020-3015
20	ZEINELDEEN M， XU J J， LÜSCHER C， et al. Conformer-based hybrid ASR system for switchboard dataset［C］// Proceedings of the 2022 IEEE International Conference on Acoustics， Speech， and Signal Processing. Piscataway： IEEE， 2022： 7437-7441. 10.1109/icassp43922.2022.9746377
21	舒帆，屈丹，张文林，等. 采用长短时记忆网络的低资源语音识别方法［J］. 西安交通大学学报， 2017， 51（10）：120-127. 10.7652/xjtuxb201710020
	SHU F， QU D， ZHANG W L， et al. A speech recognition method using long short-term memory network in low resources［J］. Journal of Xi’an Jiaotong University， 2017， 51（10）： 120-127. 10.7652/xjtuxb201710020

[1]	Xiaomeng SHAO, Meng ZHANG. Temporal convolutional knowledge tracing model with attention mechanism [J]. Journal of Computer Applications, 2023, 43(2): 343-348.
[2]	Ming XU, Linhao LI, Qiaoling QI, Liqin WANG. Abductive reasoning model based on attention balance list [J]. Journal of Computer Applications, 2023, 43(2): 349-355.
[3]	Zeqiang SUN, Bingcai CHEN, Xiaobo CUI, Lei WANG, Yanuo LU. Strip steel surface defect detection by YOLOv5 algorithm fusing frequency domain attention mechanism and decoupled head [J]. Journal of Computer Applications, 2023, 43(1): 242-249.
[4]	Honggang YANG, Jiejie CHEN, Mengfei XU. Bilinear involution neural network for image classification of fundus diseases [J]. Journal of Computer Applications, 2023, 43(1): 259-264.
[5]	Jun ZHANG, Pengli WU, Lukui SHI, Jin SHI, Bin PAN. Deep learning model for multi-station temperature prediction combined with MOD11A1 and surface meteorological station data [J]. Journal of Computer Applications, 2023, 43(1): 321-328.
[6]	Bin ZOU, Cong ZHANG. Dense crowd detection algorithm based on Faster R-CNN [J]. Journal of Computer Applications, 2023, 43(1): 61-66.
[7]	Hui LIU, Xiang MA, Linyu ZHANG, Rujin HE. Aspect-based sentiment analysis model integrating match-LSTM network and grammatical distance [J]. Journal of Computer Applications, 2023, 43(1): 45-50.
[8]	Zihao GUO, Lele DONG, Zhijian QU. Arthropod object detection method based on improved Faster RCNN [J]. Journal of Computer Applications, 2023, 43(1): 88-97.
[9]	Zhijun SHEN, Lina MU, Jing GAO, Yuanhang SHI, Zhiqiang LIU. Review of fine-grained image categorization [J]. Journal of Computer Applications, 2023, 43(1): 51-60.
[10]	Wentao ZHANG, Yuanyu WANG, Saize LI. Depth estimation model of single haze image based on conditional generative adversarial network [J]. Journal of Computer Applications, 2022, 42(9): 2865-2875.
[11]	Haiyun WEI, Qianying ZHENG, Jinling YU. Motion blurred image restoration algorithm based on multi-scale network [J]. Journal of Computer Applications, 2022, 42(9): 2838-2844.
[12]	Yaoshun LI, Lizhi LIU. Lightweight network for rebar detection with attention mechanism [J]. Journal of Computer Applications, 2022, 42(9): 2900-2908.
[13]	Xudong HOU, Fei TENG, Yi ZHANG. Medical named entity recognition model based on deep auto-encoding [J]. Journal of Computer Applications, 2022, 42(9): 2686-2692.
[14]	Hongjun HENG, Tianbao XU. Attention sentiment analysis model based on multi-scale convolution and gating mechanism [J]. Journal of Computer Applications, 2022, 42(9): 2674-2679.
[15]	Yuefeng LIU, Xiaoyan ZHANG, Wei GUO, Haodong BIAN, Yingjie HE. Remaining useful life prediction method of aero-engine based on optimized hybrid model [J]. Journal of Computer Applications, 2022, 42(9): 2960-2968.

End-to-end speech recognition method based on prosodic features

基于韵律特征辅助的端到端语音识别方法

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 7

References 21

Related Articles 15

Recommended Articles

Metrics