基于韵律特征辅助的端到端语音识别方法

doi:10.11772/j.issn.1001-9081.2022010009

《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (2): 380-384.DOI: 10.11772/j.issn.1001-9081.2022010009

所属专题：人工智能

基于韵律特征辅助的端到端语音识别方法

刘聪¹, 万根顺¹(), 高建清¹, 付中华²

^1.科大讯飞股份有限公司 AI研究院，合肥 230088
^2.西安讯飞超脑信息科技有限公司，西安 710000

收稿日期:2022-01-06 修回日期:2022-04-06 接受日期:2022-04-11 发布日期:2022-05-24 出版日期:2023-02-10
通讯作者: 万根顺
作者简介:刘聪（1984—），男，安徽铜陵人，高级工程师，博士，CCF会员，主要研究方向：语音识别、人脸识别
高建清（1983—），男，安徽淮南人，高级工程师，博士，CCF会员，主要研究方向：语音识别、语音信息处理
付中华（1977—），男，湖北十堰人，副教授，博士，CCF会员，主要研究方向：听觉与音频、语音信号处理。
基金资助:
科技创新2030-“新一代人工智能”重大项目(2020AAA0103600)

End-to-end speech recognition method based on prosodic features

Cong LIU¹, Genshun WAN¹(), Jianqing GAO¹, Zhonghua FU²

^1.AI Institute，iFLYTEK Company Limited，Hefei Anhui 230088，China
^2.Xi’an iFLYTEK Hyper?brain Information Technology Company Limited，Xi’an Shaanxi 710000，China

Received:2022-01-06 Revised:2022-04-06 Accepted:2022-04-11 Online:2022-05-24 Published:2023-02-10
Contact: Genshun WAN
About author:LIU Cong， born in 1984， Ph. D.， senior engineer. His research interests include speech recognition， face recognition.
GAO Jianqing， born in 1983， Ph. D.， senior engineer. His research interests include speech recognition， speech information processing.
FU Zhonghua， born in 1977， Ph. D.， associate professor. His research interests include hearing and audio， speech information processing.
Supported by:
Scientific and Technological Innovation 2030 — Major Project of New Generation Artificial Intelligence(2020AAA0103600)

摘要/Abstract

摘要：

针对传统的语音识别系统采用数据驱动并利用语言模型来决策最优的解码路径，导致在部分场景下的解码结果存在明显的音对字错的问题，提出一种基于韵律特征辅助的端到端语音识别方法，利用语音中的韵律信息辅助增强正确汉字组合在语言模型中的概率。在基于注意力机制的编码-解码语音识别框架的基础上，首先利用注意力机制的系数分布提取发音间隔、发音能量等韵律特征；然后将韵律特征与解码端结合，从而显著提升了发音相同或相近、语义歧义情况下的语音识别准确率。实验结果表明，该方法在1 000 h及10 000 h级别的语音识别任务上分别较端到端语音识别基线方法在准确率上相对提升了5.2%和5.0%，进一步改善了语音识别结果的可懂度。

关键词: 语音识别, 端到端, 语义歧义, 注意力机制, 韵律特征

Abstract:

In the traditional speech recognition system， the optimal decoding paths are determined by a language model restrained by the training data. Almost inevitably， the right pronunciation may produce wrong character recognition results in some scenarios. In order to use the prosodic information in speech to enhance the probability of correct character combination in language model， an end-to-end speech recognition method based on prosodic features was proposed. Based on the attention mechanism based encoder-decoder speech recognition framework， firstly， the coefficient distribution of attention mechanism was used to extract prosodic features such as pronunciation interval and pronunciation energy. Then， the prosodic features were combined with decoder to significantly improve the accuracy of speech recognition in the cases with the same or similar pronunciation and semantic ambiguity. Experimental results show that the proposed method achieves a relative accuracy improvement of 5.2% and 5.0% respectively compared with the baseline end-to-end speech recognition method on 1 000 h and 10 000 h speech recognition tasks and improves the intelligibility of speech recognition results.

Key words: speech recognition, end-to-end, semantic ambiguity, attention mechanism, prosodic feature

中图分类号:

TP391.4

刘聪, 万根顺, 高建清, 付中华. 基于韵律特征辅助的端到端语音识别方法[J]. 计算机应用, 2023, 43(2): 380-384.

Cong LIU, Genshun WAN, Jianqing GAO, Zhonghua FU. End-to-end speech recognition method based on prosodic features[J]. Journal of Computer Applications, 2023, 43(2): 380-384.

图/表 7

图1 基于注意力机制的编码-解码语音识别框架

Fig.1 Encoder-decoder speech recognition framework based on attention mechanism

图2 基于注意力系数的时域分布表示

Fig.2 Time domain distribution representation based on attention coefficient

图3 基于韵律特征辅助的编码-解码语音识别框架

Fig.3 Encoder-decoder speech recognition framework based on prosodic features

表1 基于韵律特征辅助的ED语音识别效果

Tab. 1 Effect of ED speech recognition based on prosodic features

配置	增加不同维度的韵律特征				准确率/%
配置	$F i p$	$F i l$	$F i g$	$F i e$	准确率/%
ED基线					78.66
ED改进方案	√				79.45
		√			79.03
			√		79.87
				√	79.18
			√	√	80.05

表1 基于韵律特征辅助的ED语音识别效果

Tab. 1 Effect of ED speech recognition based on prosodic features

配置	增加不同维度的韵律特征				准确率/%
配置	$F i p$	$F i l$	$F i g$	$F i e$	准确率/%
ED基线					78.66
ED改进方案	√				79.45
		√			79.03
			√		79.87
				√	79.18
			√	√	80.05

表2 基于二遍重打分的ED语音识别效果

Tab. 2 Effect of ED speech recognition based on rescoring

配置	增加韵律特征		rescore		准确率/%
配置	$F i g$	$F i e$	常规方案	改进方案	准确率/%
ED基线					78.66
ED基线			√		79.41
ED改进方案				√	79.78
	√	√			80.05
	√	√	√		80.42
	√	√		√	80.49

表2 基于二遍重打分的ED语音识别效果

Tab. 2 Effect of ED speech recognition based on rescoring

配置	增加韵律特征		rescore		准确率/%
配置	$F i g$	$F i e$	常规方案	改进方案	准确率/%
ED基线					78.66
ED基线			√		79.41
ED改进方案				√	79.78
	√	√			80.05
	√	√	√		80.42
	√	√		√	80.49

表3 基于大数据的ED语音识别效果

Tab. 3 Effect of ED speech recognition on big data

配置	增加韵律特征		rescore		准确率/%
配置	$F i g$	$F i e$	常规方案	改进方案	准确率/%
ED基线					89.12
ED基线			√		89.42
ED改进方案	√	√			89.69
	√	√	√		89.89
	√	√		√	89.95

表3 基于大数据的ED语音识别效果

Tab. 3 Effect of ED speech recognition on big data

配置	增加韵律特征		rescore		准确率/%
配置	$F i g$	$F i e$	常规方案	改进方案	准确率/%
ED基线					89.12
ED基线			√		89.42
ED改进方案	√	√			89.69
	√	√	√		89.89
	√	√		√	89.95

表4 语音识别结果变化示例

Tab. 4 Examples of change in speech recognition results

示例

识别结果

标注：因五毛钱产生的纠纷

基线方法：鹦鹉毛钱产生的纠纷

改进方法：因五毛钱产生的纠纷

标注：往内拨弦接着无名指抬起

基线方法：往内拨衔接着无名指抬起

改进方法：往内拨弦接着无名指抬起

参考文献 21

1	HINTON G E， OSINDERO S， TEH Y W. A fast learning algorithm for deep belief nets［J］. Neural Computation， 2006， 18（7）： 1527-1554. 10.1162/neco.2006.18.7.1527
2	PUNDAK G， SAINATH T N， PRABHAVALKAR R， et al. Deep context： end-to-end contextual speech recognition［C］// Proceedings of the 2018 IEEE Spoken Language Technology Workshop. Piscataway： IEEE， 2018： 418-425. 10.1109/slt.2018.8639034
3	刘丙哲. 韵律信息在汉语语音识别中的应用［D］. 上海：复旦大学， 2002： 33-41.
	LIU B Z. Application of prosodic information in Chinese speech recognition［D］. Shanghai： Fudan University， 2002： 33-41.
4	CHEN K， HASEGAWA-JOHNSON M， COHEN A， et al. Prosody dependent speech recognition on radio news corpus of American English［J］. IEEE Transactions on Audio， Speech， and Language Processing， 2006， 14（1）：232-245. 10.1109/tsa.2005.853208
5	GADDE V R R. Modeling word durations［C］// Proceedings of the 6th International Conference on Spoken Language Processing. ［S.l.］： International Speech Communication Association， 2000， 1： 601-604. 10.21437/icslp.2000-149
6	HANNUN A. Sequence modeling with CTC［J］. Distill， 2017， 2（11）： No.8. 10.23915/distill.00008
7	ZHAO H B， HIGUCHI Y， OGAWA T， et al. An investigation of enhancing CTC model for triggered attention-based streaming ASR［EB/OL］. （2021-10-20）［2021-12-15］..
8	LEE J， WATANABE S. Intermediate loss regularization for CTC-based speech recognition［C］// Proceedings of the 2021 IEEE International Conference on Acoustics， Speech， and Signal Processing. Piscataway： IEEE， 2021： 6224-6228. 10.1109/icassp39728.2021.9414594
9	ZHOU W， ZHENG Z Y， SCHLÜTER R， et al. On language model integration for RNN transducer based speech recognition［C］// Proceedings of the 2022 IEEE International Conference on Acoustics， Speech， and Signal Processing. Piscataway： IEEE， 2022： 8407-8411. 10.1109/icassp43922.2022.9746948
10	KIM J， LEE J. Generalizing RNN-transducer to out-domain audio via sparse self-attention layers［C］// Proceedings of the Interspeech 2022. ［S.l.］： International Speech Communication Association， 2022： 4123-4127. 10.21437/interspeech.2022-581
11	MORITZ N， HORI T， WATANABE S， et al. Sequence transduction with graph-based supervision［C］// Proceedings of the 2022 IEEE International Conference on Acoustics， Speech， and Signal Processing. Piscataway： IEEE， 2022： 7212-7216. 10.1109/icassp43922.2022.9747788
12	CHAN W， JAITLY N， LE Q V， et al. Listen， attend and spell［EB/OL］. （2015-08-20）［2021-12-15］.. 10.1109/icassp.2016.7472621
13	JOSHI R， KANNAN V. Attention based end to end speech recognition for voice search in Hindi and English［C］// Proceedings of the 13th Annual Meeting of the Forum for Information Retrieval Evaluation. New York： ACM， 2021： 107-113. 10.1145/3503162.3503173
14	HE B， RADFAR M. The performance evaluation of attention-based neural ASR under mixed speech input［EB/OL］. （2021-08-03）［2021-12-15］..
15	HOCHREITER S， SCHMIDHUBER J. Long short-term memory［J］. Neural Computation， 1997， 9（8）：1735-1780. 10.1162/neco.1997.9.8.1735
16	MEDSKER L R， JAIN L C. Recurrent neural networks［J］. Design and Applications， 2001， 5： 64-67.
17	ABDEL-HAMID O， DENG L， YU D. Exploring convolutional neural network structures and optimization techniques for speech recognition［C］// Proceedings of the Interspeech 2013. ［S.l.］： International Speech Communication Association， 2013： 3366-3370. 10.21437/interspeech.2013-744
18	NEWATIA S， AGGARWAL R K. Convolutional neural network for ASR［C］// Proceedings of the 2nd International Conference on Electronics， Communication and Aerospace Technology. Piscataway： IEEE， 2018： 638-642. 10.1109/iceca.2018.8474688
19	GULATI A， QIN J， CHIU C C， et al. Conformer： convolution-augmented transformer for speech recognition［EB/OL］. （2020-05-16）［2021-12-15］.. 10.21437/interspeech.2020-3015
20	ZEINELDEEN M， XU J J， LÜSCHER C， et al. Conformer-based hybrid ASR system for switchboard dataset［C］// Proceedings of the 2022 IEEE International Conference on Acoustics， Speech， and Signal Processing. Piscataway： IEEE， 2022： 7437-7441. 10.1109/icassp43922.2022.9746377
21	舒帆，屈丹，张文林，等. 采用长短时记忆网络的低资源语音识别方法［J］. 西安交通大学学报， 2017， 51（10）：120-127. 10.7652/xjtuxb201710020
	SHU F， QU D， ZHANG W L， et al. A speech recognition method using long short-term memory network in low resources［J］. Journal of Xi’an Jiaotong University， 2017， 51（10）： 120-127. 10.7652/xjtuxb201710020

[1]	赵志强, 马培红, 黑新宏. 基于双重注意力机制的人群计数方法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2886-2892.
[2]	秦璟, 秦志光, 李发礼, 彭悦恒. 基于概率稀疏自注意力神经网络的重性抑郁疾患诊断[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2970-2974.
[3]	李力铤, 华蓓, 贺若舟, 徐况. 基于解耦注意力机制的多变量时序预测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2732-2738.
[4]	薛凯鹏, 徐涛, 廖春节. 融合自监督和多层交叉注意力的多模态情感分析网络[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2387-2392.
[5]	汪雨晴, 朱广丽, 段文杰, 李书羽, 周若彤. 基于交互注意力机制的心理咨询文本情感分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2393-2399.
[6]	高鹏淇, 黄鹤鸣, 樊永红. 融合坐标与多头注意力机制的交互语音情感识别[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2400-2406.
[7]	李钟华, 白云起, 王雪津, 黄雷雷, 林初俊, 廖诗宇. 基于图像增强的低照度人脸检测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2588-2594.
[8]	莫尚斌, 王文君, 董凌, 高盛祥, 余正涛. 基于多路信息聚合协同解码的单通道语音增强[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2611-2617.
[9]	刘丽, 侯海金, 王安红, 张涛. 基于多尺度注意力的生成式信息隐藏算法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2102-2109.
[10]	徐松, 张文博, 王一帆. 基于时空信息的轻量视频显著性目标检测网络[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2192-2199.
[11]	李大海, 王忠华, 王振东. 结合空间域和频域信息的双分支低光照图像增强网络[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2175-2182.
[12]	魏文亮, 王阳萍, 岳彪, 王安政, 张哲. 基于光照权重分配和注意力的红外与可见光图像融合深度学习模型[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2183-2191.
[13]	熊武, 曹从军, 宋雪芳, 邵云龙, 王旭升. 基于多尺度混合域注意力机制的笔迹鉴别方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2225-2232.
[14]	李欢欢, 黄添强, 丁雪梅, 罗海峰, 黄丽清. 基于多尺度时空图卷积网络的交通出行需求预测[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2065-2072.
[15]	毛典辉, 李学博, 刘峻岭, 张登辉, 颜文婧. 基于并行异构图和序列注意力机制的中文实体关系抽取模型[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2018-2025.

基于韵律特征辅助的端到端语音识别方法

End-to-end speech recognition method based on prosodic features

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 7

参考文献 21

相关文章 15

编辑推荐

Metrics