Journal of Computer Applications ›› 2023, Vol. 43 ›› Issue (2): 380-384.DOI: 10.11772/j.issn.1001-9081.2022010009

• Artificial intelligence • Previous Articles    

End-to-end speech recognition method based on prosodic features

Cong LIU1, Genshun WAN1(), Jianqing GAO1, Zhonghua FU2   

  1. 1.AI Institute,iFLYTEK Company Limited,Hefei Anhui 230088,China
    2.Xi’an iFLYTEK Hyper?brain Information Technology Company Limited,Xi’an Shaanxi 710000,China
  • Received:2022-01-06 Revised:2022-04-06 Accepted:2022-04-11 Online:2022-05-24 Published:2023-02-10
  • Contact: Genshun WAN
  • About author:LIU Cong, born in 1984, Ph. D., senior engineer. His research interests include speech recognition, face recognition.
    GAO Jianqing, born in 1983, Ph. D., senior engineer. His research interests include speech recognition, speech information processing.
    FU Zhonghua, born in 1977, Ph. D., associate professor. His research interests include hearing and audio, speech information processing.
  • Supported by:
    Scientific and Technological Innovation 2030 — Major Project of New Generation Artificial Intelligence(2020AAA0103600)


刘聪1, 万根顺1(), 高建清1, 付中华2   

  1. 1.科大讯飞股份有限公司 AI研究院,合肥 230088
    2.西安讯飞超脑信息科技有限公司,西安 710000
  • 通讯作者: 万根顺
  • 作者简介:刘聪(1984—),男,安徽铜陵人,高级工程师,博士,CCF会员,主要研究方向:语音识别、人脸识别
  • 基金资助:


In the traditional speech recognition system, the optimal decoding paths are determined by a language model restrained by the training data. Almost inevitably, the right pronunciation may produce wrong character recognition results in some scenarios. In order to use the prosodic information in speech to enhance the probability of correct character combination in language model, an end-to-end speech recognition method based on prosodic features was proposed. Based on the attention mechanism based encoder-decoder speech recognition framework, firstly, the coefficient distribution of attention mechanism was used to extract prosodic features such as pronunciation interval and pronunciation energy. Then, the prosodic features were combined with decoder to significantly improve the accuracy of speech recognition in the cases with the same or similar pronunciation and semantic ambiguity. Experimental results show that the proposed method achieves a relative accuracy improvement of 5.2% and 5.0% respectively compared with the baseline end-to-end speech recognition method on 1 000 h and 10 000 h speech recognition tasks and improves the intelligibility of speech recognition results.

Key words: speech recognition, end-to-end, semantic ambiguity, attention mechanism, prosodic feature


针对传统的语音识别系统采用数据驱动并利用语言模型来决策最优的解码路径,导致在部分场景下的解码结果存在明显的音对字错的问题,提出一种基于韵律特征辅助的端到端语音识别方法,利用语音中的韵律信息辅助增强正确汉字组合在语言模型中的概率。在基于注意力机制的编码-解码语音识别框架的基础上,首先利用注意力机制的系数分布提取发音间隔、发音能量等韵律特征;然后将韵律特征与解码端结合,从而显著提升了发音相同或相近、语义歧义情况下的语音识别准确率。实验结果表明,该方法在1 000 h及10 000 h级别的语音识别任务上分别较端到端语音识别基线方法在准确率上相对提升了5.2%和5.0%,进一步改善了语音识别结果的可懂度。

关键词: 语音识别, 端到端, 语义歧义, 注意力机制, 韵律特征

CLC Number: