Chinese-Vietnamese pseudo-parallel corpus generation based on monolingual language model

doi:10.11772/j.issn.1001-9081.2020071017

Journal of Computer Applications ›› 2021, Vol. 41 ›› Issue (6): 1652-1658.DOI: 10.11772/j.issn.1001-9081.2020071017

Special Issue: 人工智能

• Artificial intelligence • Previous Articles Next Articles

Chinese-Vietnamese pseudo-parallel corpus generation based on monolingual language model

JIA Chengxun^1,2, LAI Hua^1,2, YU Zhengtao^1,2, WEN Yonghua^1,2, YU Zhiqiang^1,2

1. Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming Yunnan 650504, China;
2. Yunnan Key Laboratory of Artificial Intelligence(Kunming University of Science and Technology), Kunming Yunnan 650500, China

Received:2020-07-13 Revised:2021-01-27 Online:2021-06-23 Published:2021-06-10
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61672271, 61732005, 61761026, 61762056, 61866020), the National Key Research and Development Program of China (2019QY1801).

融合单语语言模型的汉越伪平行语料生成

贾承勋^1,2, 赖华^1,2, 余正涛^1,2, 文永华^1,2, 于志强^1,2

1. 昆明理工大学信息工程与自动化学院, 昆明 650504;
2. 云南省人工智能重点实验室(昆明理工大学), 昆明 650500

通讯作者: 余正涛
作者简介:贾承勋(1994-),男,内蒙古赤峰人,硕士,主要研究方向:机器翻译、自然语言处理;赖华(1966-),男,广西钦州人,副教授,硕士,CCF会员,主要研究方向:智能信息处理;余正涛(1970-),男(蒙古族),云南曲靖人,教授,博士,CCF会员,主要研究方向:自然语言处理、机器翻译;文永华(1979-),男(白族),云南大理人,博士研究生,CCF会员,主要研究方向:机器翻译;于志强(1983-),男,内蒙古通辽人,博士研究生,主要研究方向:机器翻译。
基金资助:
国家自然科学基金资助项目（61672271，61732005，61761026，61762056，61866020）；国家重点研发计划项目（2019QY1801）。

Abstract

Abstract: Neural machine translation achieves good translation results on resource-rich languages, but due to data scarcity, it performs poorly on low-resource language pairs such as Chinese-Vietnamese. At present, one of the most effective ways to alleviate this problem is to use existing resources to generate pseudo-parallel data. Considering the availability of monolingual data, based on the back-translation method, firstly the language model trained by a large amount of monolingual data was fused with the neural machine translation model. Then, the language features were integrated into the language model in the back-translation process to generate more standardized and better quality pseudo-parallel data. Finally, the generated corpus was added to the original small-scale corpus to train the final translation model. Experimental results on the Chinese-Vietnamese translation tasks show that compared with the ordinary back-translation methods, the Chinese-Vietnamese neural machine translation has the BiLingual Evaluation Understudy (BLEU) value improved by 1.41 percentage points by fusing the pseudo-parallel data generated by the language model.

Key words: Chinese-Vietnamese neural machine translation, data augmentation, pseudo-parallel data, monolingual data, language model

摘要： 神经机器翻译在资源丰富的语种上取得了良好的翻译效果，但是由于数据稀缺问题在汉语-越南语这类低资源语言对上的性能不佳。目前缓解该问题最有效的方法之一是利用现有资源生成伪平行数据。考虑到单语数据的可利用性，在回译方法的基础上，首先将利用大量单语数据训练的语言模型与神经机器翻译模型进行融合，然后在回译过程中通过语言模型融入语言特性，以此生成更规范质量更优的伪平行数据，最后将生成的语料添加到原始小规模语料中训练最终翻译模型。在汉越翻译任务上的实验结果表明，与普通的回译方法相比，通过融合语言模型生成的伪平行数据使汉越神经机器翻译的BLEU值提升了1.41个百分点。

关键词: 汉越神经机器翻译, 数据增强, 伪平行数据, 单语数据, 语言模型

CLC Number:

TP391

JIA Chengxun, LAI Hua, YU Zhengtao, WEN Yonghua, YU Zhiqiang. Chinese-Vietnamese pseudo-parallel corpus generation based on monolingual language model[J]. Journal of Computer Applications, 2021, 41(6): 1652-1658.

贾承勋, 赖华, 余正涛, 文永华, 于志强. 融合单语语言模型的汉越伪平行语料生成[J]. 计算机应用, 2021, 41(6): 1652-1658.

References

[1] SUTSKEVER I, VINYALS O, LE Q V, et al. Sequence to sequence learning with neural networks[C]//Proceedings of the 2014 27th International Conference on Neural Information Processing Systems. Cambridge:MIT Press,2014:3104-3112.
[2] MARIE B, FUJITA A. Efficient extraction of pseudo-parallel sentences from raw monolingual data using word embeddings[C]//Proceedings of the 2017 55th Annual Meeting of the Association for Computational Linguistics. Stroudsburg:ACL,2017:392-398.
[3] ABDUL RAUF S,SCHWENK H. Parallel sentence generation from comparable corpora for improved SMT[J]. Machine Translation, 2011,25(4):341-375.
[4] FADAEE M,BISAZZA A,MONZ C. Data augmentation for lowresource neural machine translation[C]//Proceedings of the 2017 55th Annual Meeting of the Association for Computational Linguistics. Stroudsburg:ACL,2017:567-573.
[5] 蔡子龙, 杨明明, 熊德意. 基于数据增强技术的神经机器翻译[J]. 中文信息学报, 2018, 32(7):30-36.(CAI Z L,YANG M M, XIONG D Y. Data augmentation for neural machine translation[J]. Journal of Chinese Information Processing,2018,32(7):30-36.)
[6] ZAHABI S T,BAKHSHAEI S,KHADIVI S. Using context vectors in improving a machine translation system with bridge language[C]//Proceedings of the 201351st Annual Meeting of the Association for Computational Linguistics. Stroudsburg:ACL, 2013:318-322.
[7] 李强, 王强, 肖桐, 等. 稀缺资源机器翻译中改进的语料级和短语级中间语方法研究[J]. 计算机学报, 2017, 40(4):925-938.(LI Q,WANG Q,XIAO T,et al. Research on improved corpus-level and phrase-level pivot language based methods in low-resource machine translation[J]. Chinese Journal of Computers,2017,40(4):925-938.)
[8] 贾承勋, 赖华, 余正涛, 等. 基于枢轴语言的汉越神经机器翻译伪平行语料生成[J]. 计算机工程与科学, 2021, 43(3):543-550. (JIA C X, LAI H, YU Z T, et al. Pseudo-parallel corpus generation for Chinese-Vietnamese neural machine translation based on pivot language[J]. Computer Engineering and Science,2021, 43(3):543-550.)
[9] SENNRICH R,HADDOW B,BIRCH A,et al. Improving neural machine translation models with monolingual data[C]//Proceedings of the 2016 54th Annual Meeting of the Association for Computational Linguistics. Stroudsburg:ACL,2016:86-96.
[10] GIBADULLIN I,VALEEV A,KHUSAINOVA A,et al. A survey of methods to leverage monolingual data in low-resource neural machine translation[EB/OL].[2020-06-20]. https://arxiv.org/pdf/1910.00373.pdf.
[11] BURLOT F,YVON F. Using monolingual data in neural machine translation:a systematic study[C]//Proceedings of the 2018 3rd Conference on Machine Translation. Stroudsburg:ACL,2018:144-155.
[12] PARK J, SONG J, YOON S. Building a neural machine translation system using only synthetic parallel data[EB/OL].[2020-06-20]. https://arxiv.org/pdf/1704.00253.pdf.
[13] CREGO J, SENELLART J. Neural machine translation from simplified translations[EB/OL].[2020-06-20]. http://arxiv.org/pdf/1612.06139.pdf.
[14] STAHLBERG F, CROSS J, STOYANOV V. Simple fusion:return of the language model[C]//Proceedings of the 2018 3rd Conference on Machine Translation. Stroudsburg:ACL,2018:204-211.
[15] ZHANG Z,LIU S,LI M,et al. Joint training for neural machine translation models with monolingual data[C]//Proceedings of the 201832nd AAAI Conference on Artificial Intelligence. Palo Alto:AAAI Press,2018:555-562.
[16] WU L,WANG Y,XIA Y,et al. Exploiting monolingual data at scale for neural machine translation[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Stroudsburg:ACL,2019:4207-4216.
[17] MIKOLOV T,KARAFIÁT M,BURGET L,et al. Recurrent neural network based language model[C]//Proceedings of the 2010 11th Annual Conference of the International Speech Communication Association. Piscataway:IEEE, 2010:1045-1048.
[18] BENGIO Y,DUCHARME R,VINCENT P. A neural probabilistic language model[J]. Journal of Machine Learning Research, 2003,3:1137-1155.
[19] VASWANI A,SHAZEER N,PARMAR N,et al. Attention is all you need[C]//Proceedings of the 201731st International Conference on Neural Information Proceeding Systems. Red Hook:Curran Associates Inc.,2017:6000-6010.
[20] BAHDANAU D,CHO K,BENGIO Y. Neural machine translation by jointly learning to align and translate[EB/OL].[2020-06-20]. https://arxiv.org/pdf/1409.0473.pdf.
[21] 路琦, 张傲, 刘金花, 等. 面向统计机器翻译的训练语料质量评价方法研究及应用[C]//第六届全国青年计算语言学会议. 北京:中国中文信息学会, 2012:264-275.(LU Q,ZHANG A,LIU J H,et al. Research and application of training corpus quality evaluation method for statistical machine translation[C]//Proceedings of the 2012 6th Youth Conference on Computational Linguistics. Beijing:Chinese Information Processing Society of China,2012:264-275.)

Chinese-Vietnamese pseudo-parallel corpus generation based on monolingual language model

融合单语语言模型的汉越伪平行语料生成

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

[1]	Xianglan WU, Yang XIAO, Mengying LIU, Mingming LIU. Text-to-SQL model based on semantic enhanced schema linking [J]. Journal of Computer Applications, 2024, 44(9): 2689-2695.
[2]	Huanliang SUN, Siyi WANG, Junling LIU, Jingke XU. Help-seeking information extraction model for flood event in social media data [J]. Journal of Computer Applications, 2024, 44(8): 2437-2445.
[3]	Yuemei XU, Ling HU, Jiayi ZHAO, Wanze DU, Wenqing WANG. Technology application prospects and risk challenges of large language models [J]. Journal of Computer Applications, 2024, 44(6): 1655-1662.
[4]	Chao WEI, Yanping CHEN, Kai WANG, Yongbin QIN, Ruizhang HUANG. Relation extraction method based on mask prompt and gated memory network calibration [J]. Journal of Computer Applications, 2024, 44(6): 1713-1719.
[5]	Jiong WANG, Taotao TANG, Caiyan JIA. PAGCL： positive augmentation graph contrastive learning recommendation method without negative sampling [J]. Journal of Computer Applications, 2024, 44(5): 1485-1492.
[6]	Jie GUO, Jiayu LIN, Zuhong LIANG, Xiaobo LUO, Haitao SUN. Recommendation method based on knowledge‑awareness and cross-level contrastive learning [J]. Journal of Computer Applications, 2024, 44(4): 1121-1127.
[7]	Andi GUO, Zhen JIA, Tianrui LI. High-precision entity and relation extraction in medical domain based on pseudo-entity data augmentation [J]. Journal of Computer Applications, 2024, 44(2): 393-402.
[8]	Yifei SONG, Yi LIU. Fast adversarial training method based on data augmentation and label noise [J]. Journal of Computer Applications, 2024, 44(12): 3798-3807.
[9]	Xinrong HU, Jingxue CHEN, Zijian HUANG, Bangchao WANG, Xun YAO, Junping LIU, Qiang ZHU, Jie YANG. Graph convolution network-based masked data augmentation [J]. Journal of Computer Applications, 2024, 44(11): 3335-3344.
[10]	Yushan JIANG, Yangsen ZHANG. Large language model-driven stance-aware fact-checking [J]. Journal of Computer Applications, 2024, 44(10): 3067-3073.
[11]	Menglin HUANG, Lei DUAN, Yuanhao ZHANG, Peiyan WANG, Renhao LI. Prompt learning based unsupervised relation extraction model [J]. Journal of Computer Applications, 2023, 43(7): 2010-2016.
[12]	Yongbing GAO, Juntian GAO, Rong MA, Lidong YANG. User granularity-level personalized social text generation model [J]. Journal of Computer Applications, 2023, 43(4): 1021-1028.
[13]	Liang XU, Chun ZHANG, Ning ZHANG, Xuetao TIAN. Zero-shot relation extraction model via multi-template fusion in Prompt [J]. Journal of Computer Applications, 2023, 43(12): 3668-3675.
[14]	Cheng HUANG, Qianrui ZHAO. Sensitive information detection method based on attention mechanism-based ELMo [J]. Journal of Computer Applications, 2022, 42(7): 2009-2014.
[15]	Jing JIANG, Yu CHEN, Jieping SUN, Shenggen JU. Integrating posterior probability calibration training into text classification algorithm [J]. Journal of Computer Applications, 2022, 42(6): 1789-1795.