《计算机应用》唯一官方网站 ›› 2021, Vol. 41 ›› Issue (11): 3145-3150.DOI: 10.11772/j.issn.1001-9081.2020122039

• 人工智能 • 上一篇    下一篇

基于语种相似性挖掘的神经机器翻译语料库扩充方法

李灿1,2,3, 杨雅婷1,2,3, 马玉鹏1,2,3(), 董瑞1,2,3   

  1. 1.中国科学院 新疆理化技术研究所,乌鲁木齐 830000
    2.中国科学院大学,北京 100049
    3.新疆民族语音语言信息处理实验室(中国科学院 新疆理化技术研究所),乌鲁木齐 830000
  • 收稿日期:2020-12-28 修回日期:2021-05-13 接受日期:2021-05-19 发布日期:2021-03-29 出版日期:2021-11-10
  • 通讯作者: 马玉鹏
  • 作者简介:李灿(1995—),男,湖北荆州人,硕士研究生,主要研究方向:自然语言处理、机器翻译
    杨雅婷(1984—),女,新疆昌吉人,研究员,博士,CCF会员,主要研究方向:自然语言处理、机器翻译
    马玉鹏(1979—),男,新疆阜康人,研究员,博士,主要研究方向:物联网、大数据分析
    董瑞(1985—),男,山东威海人,副研究员,博士,CCF会员,主要研究方向:自然语言处理、机器翻译。
  • 基金资助:
    国家自然科学基金资助项目(U1703133);国家重点研发计划项目(2017YFC0822505-04);中国科学院“西部之光”人才培养计划A类项目(2017-XBQNXZ-A-005);中国科学院青年创新促进会项目(2017472号);新疆高层次引进人才项目(新人社函[2017]699号)

Neural machine translation corpus expansion method based on language similarity mining

Can LI1,2,3, Yating YANG1,2,3, Yupeng MA1,2,3(), Rui DONG1,2,3   

  1. 1.Xinjiang Technical Institute of Physics and Chemistry,Chinese Academy of Sciences,Urumqi Xinjiang 830000,China
    2.University of Chinese Academy of Sciences,Beijing 100049,China
    3.Xinjiang Laboratory of Minority Speech and Language Information Processing (Xinjiang Technical Institute of Physics and Chemistry,Chinese Academy of Sciences),Urumqi Xinjiang 830000,China
  • Received:2020-12-28 Revised:2021-05-13 Accepted:2021-05-19 Online:2021-03-29 Published:2021-11-10
  • Contact: Yupeng MA
  • About author:LI Can,born in 1995,M. S. candidate. His research interests include natural language processing,machine translation
    YANG Yating, born in 1984, Ph. D., research fellow. Her research interests include natural language processing, machine translation
    MA Yupeng,born in 1979,Ph. D.,research fellow. His research interests include internet of things,big data analysis
    DONG Rui,born in 1985,Ph. D.,associate research fellow. His research interests
  • Supported by:
    the National Natural Science Foundation of China(U1703133);the National Key Research and Development Program of China(2017YFC0822505-04);the A Type Project of the Chinese Academy of Sciences “Light of West China” Talent Training Plan(2017-XBQNXZ-A-005);the Chinese Academy of Sciences Youth Innovation Promotion Association Program(2017472);the Xinjiang High-level Talent Introduction Program(XinRenSheHan [2017]699)

摘要:

针对低资源语言机器翻译任务上一直存在的标注数据资源匮乏问题,提出了基于语种相似性挖掘的神经机器翻译语料库扩充方法。首先,将维吾尔语和哈萨克语作为相似语言对并将其语料进行混合;然后,对混合后的语料分别进行字节对编码(BPE)处理、音节切分处理以及基于音节切分的BPE处理,从而深度挖掘哈语和维语的相似性;最后,引入“开始-中部-结束(BME)”序列标注方法对语料中已切分完成的音节进行标注,以消除音节输入所带来的一些歧义。在CWMT2015维汉平行语料和哈汉平行语料上的实验结果表明,所提方法相较于不进行特殊语料处理以及BPE语料处理训练所得普通模型在维吾尔语-汉语翻译上的双语评估替补(BLEU)值分别提升了9.66、4.55,在哈萨克语-汉语翻译上的BLEU值分别提升了9.44、4.36。所提方案实现了维语和哈语到汉语的跨语言神经机器翻译,提升了维吾尔语-汉语和哈萨克语-汉语机器翻译的翻译质量,可应用于维语和哈语的语料处理。

关键词: 相似语种, 语料扩充, 机器翻译, 字节对编码, 音节切分, 基于音节切分的字节对编码, “开始-中部-结束”序列标注方法

Abstract:

Concerning the lack of tagged data resources in machine translation tasks of low-resource languages, a new neural machine translation corpus expansion method based on language similarity mining was proposed. Firstly, Uyghur and Kazakh were considered as similar language pairs and their corpora were mixed. Then, Byte Pair Encoding (BPE), syllable segmentation and BPE based on syllable segmentation were carried out on the mixed corpus respectively to explore the similarity between Kazakh and Uyghur deeply. Finally, the “Begin-Middle-End (BME)” sequence tagging method was introduced to tag the segmented syllables in the corpus in order to eliminate some ambiguities caused by syllable input. Experimental results on CWMT2015 Uyghur-Chinese parallel corpus and Kazakh-Chinese parallel corpus show that, compared with the ordinary models without special corpus processing and trained by BPE corpus processing training, the proposed method increases the Bilingual Evaluation Understudy (BLEU) by 9.66, 4.55 respectively for the Uyghur-Chinese translation and by 9.44, 4.36 respectively for the Kazakh-Chinese translation. The proposed scheme achieves cross-language neural machine translation from Uyghur and Kazakh to Chinese, improves the translation quality of Uyghur-Chinese and Kazakh-Chinese machine translation, and can be applied to corpus processing of Uyghur and Kazakh.

Key words: similar language, corpus expansion, machine translation, Byte Pair Encoding (BPE), syllable segmentation, Byte Pair Encoding (BPE) based on syllable segmentation, “Begin-Middle-End (BME)” sequence tagging method

中图分类号: