计算机应用 ›› 2014, Vol. 34 ›› Issue (6): 1694-1698.DOI: 10.11772/j.issn.1001-9081.2014.06.1694

• 虚拟现实与数字媒体 • 上一篇    下一篇

发音错误检测中基于多数据流的Tandem特征方法

袁桦1,2,蔡猛1,2,赵军红3,4,5,张卫强1,2,刘加1,2   

  1. 1. 清华大学 电子工程系,北京100084;
    2. 清华信息科学与技术国家实验室(清华大学),北京 100084;
    3. 中国科学院 电子学研究所,北京100190;
    4. 中国科学院大学,北京100190
    5. 传感技术国家重点实验室(中国科学院),北京100190;
  • 收稿日期:2013-12-16 修回日期:2014-01-21 出版日期:2014-06-01 发布日期:2014-07-02
  • 通讯作者: 袁桦
  • 作者简介:袁桦(1985-),女,湖北浠水人,博士研究生,主要研究方向:发音错误检测;蔡猛(1987-),男,河北沧州人,博士研究生,主要研究方向:自动语音识别;赵军红(1987-),女,山东菏泽人,博士研究生,主要研究方向:语音合成;张卫强(1979-),男,河北雄县人,助理研究员,博士,主要研究方向:模式识别;刘加(1954-),男,福建福州人,教授,博士,主要研究方向:语音信号处理。
  • 基金资助:

    国家自然科学基金资助项目

Multi-stream based Tandem feature method for mispronunciation detection

YUAN Hua1,2,CAI Meng1,2,ZHAO Hongjun3,4,5,ZHANG Weiqiang1,2,LIU Jia1,2   

  1. 1. Department of Electronic Engineering, Tsinghua University, Beijing 100084, China;
    2. Tsinghua National Laboratory for Information Science and Technology (Tsinghua University), Beijing 100084, China;
    3. University of Chinese Academy of Sciences, Beijing 100190, China
    4. Institute of Electronics, Chinese Academy of Sciences, Beijing 100190, China;
    5. State Key Laboratory of Transducer Technology (Chinese Academy of Sciences), Beijing 100190, China;
  • Received:2013-12-16 Revised:2014-01-21 Online:2014-06-01 Published:2014-07-02
  • Contact: YUAN Hua

摘要:

针对发音错误检测中标注的发音数据资源有限的情况,提出在Tandem系统框架下利用其他数据来提高特征的区分性。以中国人的英语发音为研究对象,选取了相对容易获取的无校正发音数据、母语普通话和母语英语作为辅助数据,实验结果表明,这几种数据都能够有效地提高系统性能,其中无校正数据表现出最好的性能。同时,比较了不同的扩展帧长,以多层神经感知(MLP)和深度神经网络(DNN)作为典型的浅层和深层神经网络,以及Tandem特征的不同结构对系统性能的影响。最后,多数据流融合的策略用于进一步提高系统性能,基于DNN的无校正发音数据流和母语英语数据流合并的Tandem特征取得了最好的性能,与基线系统相比,识别正确率提高了7.96%,错误类型诊断正确率提高了14.71%。

Abstract:

To deal with the under-resourced labeled pronunciation data in mispronunciation detection, some other data were used to improve the discriminability of feature in the framework of Tandem system. Taking Chinese learning of English as object, unlabeled data, native Mandarin data and native English data which can be relatively easily accessed were selected as the assisted data. The experiments show that these types of data can effectively improve the performance of system, and the unlabeled data performs the best. And the effect to system performance was discussed with different length of frame context, the shallow and deep neural network typically represented by Multi-Layer Perception (MLP) and Deep Neural Network (DNN), and different structure of Tandem feature. Finally the strategy of merging multiple data streams was used to further improve the system performance, and the best system performance was achieved by combining the DNN based unlabeled data stream and native English stream. Compared with the baseline system, the recognition accuracy is increased by 7.96%, and the diagnostic accuracy of mispronunciation type is increased by 14.71%.

中图分类号: