面向汉维机器翻译的调序表重构模型

doi:10.11772/j.issn.1001-9081.2017102455

计算机应用 ›› 2018, Vol. 38 ›› Issue (5): 1283-1288.DOI: 10.11772/j.issn.1001-9081.2017102455

面向汉维机器翻译的调序表重构模型

潘一荣^1,2,3, 李晓^1,3, 杨雅婷^1,3, 米成刚^1,3, 董瑞^1,3

1. 中国科学院新疆理化技术研究所, 乌鲁木齐 830011;
2. 中国科学院大学, 北京 100049;
3. 新疆民族语音语言信息处理实验室, 乌鲁木齐 830011

收稿日期:2017-10-16 修回日期:2017-11-24 出版日期:2018-05-10 发布日期:2018-05-24
通讯作者: 李晓
作者简介:潘一荣(1992-),女,天津人,博士研究生,CCF会员,主要研究方向:自然语言处理、机器翻译;李晓(1957-),男,新疆乌鲁木齐人,研究员,博士生导师,硕士,主要研究方向:多语种信息处理、信息系统;杨雅婷(1985-),女,新疆奇台人,副研究员,博士,主要研究方向:多语种信息处理;米成刚(1986-),男,陕西渭南人,助理研究员,博士,主要研究方向:多语种信息处理;董瑞(1985-),男,新疆塔城人,助理研究员,博士,主要研究方向:多语种信息处理。
基金资助:
中国科学院西部之光项目（2015-XBQN-B-10）；新疆自治区重大科技专项课题（2016A03007-3）；新疆自治区重点实验室开放课题（2015KL031）；新疆维吾尔族自治区自然科学基金资助项目（2015211B034）。

Reordering table reconstruction model for Chinese-Uyghur machine translation

PAN Yirong^1,2,3, LI Xiao^1,3, YANG Yating^1,3, MI Chenggang^1,3, DONG Rui^1,3

1. Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences, Urumqi Xinjiang 830011, China;
2. University of Chinese Academy of Sciences, Beijing 100049, China;
3. Xinjiang Laboratory of Minority Speech and Language Information Processing, Urumqi Xinjiang 830011, China

Received:2017-10-16 Revised:2017-11-24 Online:2018-05-10 Published:2018-05-24
Contact: 李晓
Supported by:
This work is partially supported by the Chinese Academy of Sciences "Light of West China" Program (YBXM-2014-04), the Major Subject of Science and Technology of Xinjiang Uygur Autonomous Region (2016A03007-3), the Key Laboratory Open Project of Xinjiang Uygur Autonomous Region (2015KL031), the Natural Science Foundation of Xinjiang Uygur Autonomous Region (2015211B034).

摘要/Abstract

摘要： 针对词汇化调序模型在机器翻译中存在的上下文无关性及稀疏性问题，提出了基于语义内容进行调序方向及概率预测的调序表重构模型。首先，使用连续分布式表示方法获取调序规则的特征向量；然后，通过循环神经网络（RNN）对于向量化表示的调序规则进行调序方向及概率预测；最后，过滤并重构调序表，赋予原始调序规则更加合理的调序概率分布值，提高调序模型中调序信息的准确度，同时降低调序表规模，提高后续解码速率。实验结果表明，将调序表重构模型应用至汉维机器翻译任务中，BLEU值可以获得0.39的提升。

关键词: 汉维机器翻译, 调序表重构模型, 词汇化调序, 语义内容, 连续分布式表示, 循环神经网络

Abstract: Focused on the issue that lexicalized reordering models are faced with context independence and sparsity problems in machine translation, a reordering table reconstruction model based on semantic content for reordering orientation and probability prediction was proposed. Firstly, continuous distributed representation approach was employed to acquire the feature vectors of reordering rules. Secondly, Recurrent Neural Networks (RNN) were utilized to predict the reordering orientation and probability of each reordering rule that represented with dense vectors. Finally, the original reordering table was filtered and reconstructed with more reasonable reordering probability distribution for the purpose of improving the reordering information accuracy in reordering model as well as reducing the size of the reordering table to speed up subsequent decoding process. The experimental results show that the reordering table reconstruction model can provide BLEU point gains (+0.39) for Chinese to Uyghur machine translation task.

Key words: Chinese-Uyghur machine translation, reordering table reconstruction model, lexicalized reordering, semantic content, continuous distributed representation, Recurrent Neural Network (RNN)

中图分类号:

TP391.2

潘一荣, 李晓, 杨雅婷, 米成刚, 董瑞. 面向汉维机器翻译的调序表重构模型[J]. 计算机应用, 2018, 38(5): 1283-1288.

PAN Yirong, LI Xiao, YANG Yating, MI Chenggang, DONG Rui. Reordering table reconstruction model for Chinese-Uyghur machine translation[J]. Journal of Computer Applications, 2018, 38(5): 1283-1288.

参考文献

[1] TILLMANN C. A unigram orientation model for statistical machine translation[C]//Proceedings of the 2004 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Stroudsburg, PA:Association for Computational Linguistics, 2004:101-104.
[2] GILDEA D, KHUDANPUR S, SARKAR A, et al. A smorgasbord of features for statistical machine translation[C]//Proceedings of the 2004 Conference of the North American Chapter of the Association for Computational Linguistics. Stroudsburg, PA:Association for Computational Linguistics, 2004:161-168.
[3] LI P, LIU Y, SUN M, et al. A neural reordering model for phrase-based translation[C]//Proceedings of the 2014 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Stroudsburg, PA:Association for Computational Linguistics, 2014:1897-1907.
[4] 董兴华, 陈丽娟, 周喜, 等. 汉维统计机器翻译中的形态学处理[J]. 计算机工程, 2011, 37(12):150-152.(DONG X H, CHEN L J, ZHOU X, et al. Morphology processing in Chinese-Uyghur statistical machine translation[J]. Computer Engineering, 2011, 37(12):150-152.)
[5] 陈丽娟, 张恒, 董兴华, 等. 基于句法调序的汉维统计机器翻译[J]. 计算机工程, 2012, 38(3):169-171.(CHEN L J, ZHANG H, DONG X H, et al. Chinese-Uyghur statistical machine translation based on syntactical reordering[J]. Computer Engineering, 2012, 38(3):169-171.)
[6] 孔金英, 李晓, 王磊, 等. 调序规则表的深度过滤研究[J]. 计算机科学与探索, 2017, 11(5):785-793.(KONG J Y, LI X, WANG L, et al. Research of deep filtering lexical reordering table[J]. Journal of Frontiers of Computer Science and Technology, 2017, 11(5):785-793.)
[7] 杨南, 李沐. 基于神经网络的统计机器翻译的预调序模型[J]. 中文信息学报, 2016, 30(3):103-110.(YANG N, LI M. A neural pre-reordering model for statistical machine translation[J]. Journal of Chinese Information Processing, 2016, 30(3):103-110.)
[8] LI P, LIU Y, SUN M. Recursive autoencoders for ITG-based translation[C]//Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. New York:ACM, 2013:567-577.
[9] GREEN S, GALLEY M, MANNING C D. Improved models of distortion cost for statistical machine translation[C]//Proceedings of the 2010 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Stroudsburg, PA:Association for Computational Linguistics, 2010:867-875.
[10] NGUYEN V V, NGUYEN T P, NGUYEN M L, et al. A model lexicalized hierarchical reordering for phrase based translation[J]. Procedia-Social and Behavioral Sciences, 2011, 27:77-85.
[11] HADIWINOTO C, NG H T. A dependency-based neural reordering model for statistical machine translation[C]//Proceedings of the 2017 Conference on Artificial Intelligence. Menlo Park, CA:AAAI Press, 2017:109-115.
[12] SIVIC J, ZISSERMAN A. Efficient visual search of videos cast as text retrieval[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2009, 31(4):591-606.
[13] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[EB/OL].[2017-05-10]. https://arxiv.org/abs/1301.3781.
[14] MIKOLOV T, YIH W, ZWEIG G. Linguistic regularities in continuous space word representations[C]//Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. Stroudsburg, PA:Association for Computational Linguistics, 2013:746-751.
[15] MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality[EB/OL].[2017-05-10]. https://arxiv.org/abs/1310.4546.
[16] GREFENSTETTE E, DINU G, ZHANG Y Z, et al. Multi-step regression learning for compositional distributional semantics[EB/OL].[2017-05-10]. http://www.anthology.aclweb.org/W/W13/W13-0112.pdf.
[17] LE Q V, MIKOLOV T. Distributed representations of sentences and documents[EB/OL].[2017-05-10]. http://proceedings.mlr.press/v32/le14.pdf.
[18] GRAVES A, MOHAMED A R, HINTON G. Speech recognition with deep recurrent neural networks[C]//Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway, NJ:IEEE, 2013:6645-6649.
[19] GRAVES A. Generating sequences with recurrent neural networks[EB/OL].[2017-05-10]. https://arxiv.org/abs/1308.0850.
[20] FINKEL J R, GRENAGER T, MANNING C. Incorporating non-local information into information extraction systems by Gibbs sampling[C]//Proceedings of the 2005 Conference of the Association for Computational Linguistics. Stroudsburg, PA:Association for Computational Linguistics, 2005:363-370.
[21] KOEHN P, HOANG H, BIRCH A, et al. Moses:open source toolkit for statistical machine translation[C]//Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions. Stroudsburg, PA:Association for Computational Linguistics, 2007:177-180.
[22] OCH F J. GIZA++:Training of statistical translation models[EB/OL].[2017-05-10]. https://www.prhlt.upv.es/~evidal/students/doct/sht/transp/giza2p.pdf.
[23] STOLCKE A. SRILM-an extensible language modeling toolkit[EB/OL].[2017-05-10]. http://isca-speech.org/archive/archive_papers/icslp_2002/i02_0901.pdf.
[24] PAPINENI K, ROUKOS S, WARD T, et al. BLEU:a method for automatic evaluation of machine translation[C]//Proceedings of the 2002 Conference of the Association for Computational Linguistics. Stroudsburg, PA:Association for Computational Linguistics, 2002:311-318.
[25] HUANG X, ALLEVA F, HON H, et al. The SPHINX-Ⅱ speech recognition system:an overview[J]. Computer Speech & Language, 1993, 7(2):137-148.
[26] MORIN F, BENGIO Y. Hierarchical probabilistic neural network language model[EB/OL].[2017-05-10]. http://www.iro.umontreal.ca/~lisa/pointeurs/hierarchical-nnlm-aistats05.pdf.
[27] YANMIN S, ANDREW K C W, MOHAMED S K. Classification of imbalanced data:a review[J]. International Journal of Pattern Recognition & Artificial Intelligence, 2009, 23(4):687-719.
[28] LIANG G, ZHANG C. An efficient and simple under-sampling technique for imbalanced time series classification[C]//CIKM 2012:Proceedings of the 21st ACM International Conference on Information and Knowledge Management. New York:ACM, 2012:2339-2342.
[29] YU D J, HU J, TANG Z M, et al. Improving protein-ATP binding residues prediction by boosting SVMs with random under-sampling[J]. Neurocomputing, 2013, 104:180-190.

面向汉维机器翻译的调序表重构模型

Reordering table reconstruction model for Chinese-Uyghur machine translation

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	赵宏, 孔东一. 图像特征注意力与自适应注意力融合的图像内容中文描述[J]. 计算机应用, 2021, 41(9): 2496-2503.
[2]	刘子辰, 李小娟, 韦伟. 基于循环神经网络的专利价格自动评估[J]. 计算机应用, 2021, 41(9): 2532-2538.
[3]	丁尹, 桑楠, 李晓瑜, 吴飞舟. 基于循环神经网络的电信行业容量数据预测方法[J]. 计算机应用, 2021, 41(8): 2373-2378.
[4]	赵小虎, 李晓. 基于多特征提取的图像语义描述算法[J]. 计算机应用, 2021, 41(6): 1640-1646.
[5]	倪水平, 李慧芳. 基于一维卷积神经网络与长短期记忆网络结合的电池荷电状态预测方法[J]. 计算机应用, 2021, 41(5): 1514-1521.
[6]	车冰倩, 周栋. 融合网络结构信息及文本内容的标签推荐方法[J]. 计算机应用, 2021, 41(4): 976-983.
[7]	袁景凌, 丁远远, 潘东行, 李琳. 基于时序和上下文特征的中文隐式情感分类模型[J]. 计算机应用, 2021, 41(10): 2820-2828.
[8]	黄中展, 徐世明. 基于循环神经网络的正交网格的自动化生成算法[J]. 计算机应用, 2020, 40(7): 2009-2015.
[9]	李全, 许新华, 刘兴红, 陈琦. 融合时空感知GRU和注意力的下一个地点推荐[J]. 计算机应用, 2020, 40(3): 677-682.
[10]	孙鹤立, 孙玉柱, 张晓云. 基于事件描述的社交事件参与度预测[J]. 计算机应用, 2020, 40(11): 3101-3106.
[11]	夏彬, 白宇轩, 殷俊杰. 基于生成对抗网络的系统日志级异常检测算法[J]. 计算机应用, 2020, 40(10): 2960-2966.
[12]	孟曌, 田生伟, 禹龙, 王瑞锦. 联合分层注意力网络和独立循环神经网络的地域欺凌识别[J]. 计算机应用, 2019, 39(8): 2450-2455.
[13]	陈郑淏, 冯翱, 何嘉. 基于一维卷积混合神经网络的文本情感分类[J]. 计算机应用, 2019, 39(7): 1936-1941.
[14]	周翔宇, 程勇, 王军. 基于改进深度信念网络的农业温室温度预测方法[J]. 计算机应用, 2019, 39(4): 1053-1058.
[15]	张克君, 李伟男, 钱榕, 史泰猛, 焦萌. 基于深度学习的文本自动摘要方案[J]. 计算机应用, 2019, 39(2): 311-315.