计算机应用 ›› 2010, Vol. 30 ›› Issue (8): 2038-2041.

• 人工智能 • 上一篇    下一篇

基于条件随机场的蒙古语词性标注方法

应玉龙1,李淼2,乌达巴拉2,朱海2   

  1. 1. 中国科学院合肥智能机械研究所
    2.
  • 收稿日期:2010-02-03 修回日期:2010-04-18 发布日期:2010-07-30 出版日期:2010-08-01
  • 通讯作者: 应玉龙
  • 基金资助:
    中国科学院知识创新工程重要方向项目

Mongolian part-of-speech tagging approach based on conditional random field

  • Received:2010-02-03 Revised:2010-04-18 Online:2010-07-30 Published:2010-08-01

摘要: 为了保留蒙古语词缀中大量的语法、语义信息和缩小蒙古语词典的规模,蒙古语词性标注需要对词干和词缀都进行词性标注。针对这一问题提出了一种基于条件随机场(CRF)的蒙古语词性标注方法。该方法利用CRF模型能够添加任意特征的特点,充分使用蒙文上下文信息,针对词素之间的相互影响添加了新的统计特征,并在3.8万句的蒙古语词性标注语料上进行了封闭测试,该方法的标注准确率达到了96.65%,优于使用隐马尔可夫模型(HMM)的词性标注模型。

关键词: 词干, 词缀, 条件随机场, 词性标注, 词素

Abstract: It is necessary to tag both stem and affix in the Mongolian part of speech tagging,in order to save lots of syntax and semantic information of affix and to reduce the size of Mongolian dictionary. This paper presented a new approach of Mongolian part of speech tagging based on CRF. To take advantage of the ability of using arbitrary features as input in CRF,the system exploited not only the contexts of words,but also new statistical features adopted for mutual influence between the morphemes. The system was tested in the 38000 part-of-speech dataset provided by Inner Mongolia University. The closed test results show that POS tagging accuracy of the testing set reaches 96.65%, outperforming the HMM-based model.

Key words: Stem, Affix, Conditional Random Field (CRF), Part-of-speech tagging, Morpheme