Journal of Computer Applications ›› 2021, Vol. 41 ›› Issue (6): 1652-1658.DOI: 10.11772/j.issn.1001-9081.2020071017

Special Issue: 人工智能

• Artificial intelligence • Previous Articles     Next Articles

Chinese-Vietnamese pseudo-parallel corpus generation based on monolingual language model

JIA Chengxun1,2, LAI Hua1,2, YU Zhengtao1,2, WEN Yonghua1,2, YU Zhiqiang1,2   

  1. 1. Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming Yunnan 650504, China;
    2. Yunnan Key Laboratory of Artificial Intelligence(Kunming University of Science and Technology), Kunming Yunnan 650500, China
  • Received:2020-07-13 Revised:2021-01-27 Online:2021-06-10 Published:2021-06-23
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61672271, 61732005, 61761026, 61762056, 61866020), the National Key Research and Development Program of China (2019QY1801).


贾承勋1,2, 赖华1,2, 余正涛1,2, 文永华1,2, 于志强1,2   

  1. 1. 昆明理工大学 信息工程与自动化学院, 昆明 650504;
    2. 云南省人工智能重点实验室(昆明理工大学), 昆明 650500
  • 通讯作者: 余正涛
  • 作者简介:贾承勋(1994-),男,内蒙古赤峰人,硕士,主要研究方向:机器翻译、自然语言处理;赖华(1966-),男,广西钦州人,副教授,硕士,CCF会员,主要研究方向:智能信息处理;余正涛(1970-),男(蒙古族),云南曲靖人,教授,博士,CCF会员,主要研究方向:自然语言处理、机器翻译;文永华(1979-),男(白族),云南大理人,博士研究生,CCF会员,主要研究方向:机器翻译;于志强(1983-),男,内蒙古通辽人,博士研究生,主要研究方向:机器翻译。
  • 基金资助:

Abstract: Neural machine translation achieves good translation results on resource-rich languages, but due to data scarcity, it performs poorly on low-resource language pairs such as Chinese-Vietnamese. At present, one of the most effective ways to alleviate this problem is to use existing resources to generate pseudo-parallel data. Considering the availability of monolingual data, based on the back-translation method, firstly the language model trained by a large amount of monolingual data was fused with the neural machine translation model. Then, the language features were integrated into the language model in the back-translation process to generate more standardized and better quality pseudo-parallel data. Finally, the generated corpus was added to the original small-scale corpus to train the final translation model. Experimental results on the Chinese-Vietnamese translation tasks show that compared with the ordinary back-translation methods, the Chinese-Vietnamese neural machine translation has the BiLingual Evaluation Understudy (BLEU) value improved by 1.41 percentage points by fusing the pseudo-parallel data generated by the language model.

Key words: Chinese-Vietnamese neural machine translation, data augmentation, pseudo-parallel data, monolingual data, language model

摘要: 神经机器翻译在资源丰富的语种上取得了良好的翻译效果,但是由于数据稀缺问题在汉语-越南语这类低资源语言对上的性能不佳。目前缓解该问题最有效的方法之一是利用现有资源生成伪平行数据。考虑到单语数据的可利用性,在回译方法的基础上,首先将利用大量单语数据训练的语言模型与神经机器翻译模型进行融合,然后在回译过程中通过语言模型融入语言特性,以此生成更规范质量更优的伪平行数据,最后将生成的语料添加到原始小规模语料中训练最终翻译模型。在汉越翻译任务上的实验结果表明,与普通的回译方法相比,通过融合语言模型生成的伪平行数据使汉越神经机器翻译的BLEU值提升了1.41个百分点。

关键词: 汉越神经机器翻译, 数据增强, 伪平行数据, 单语数据, 语言模型

CLC Number: