计算机应用 ›› 2014, Vol. 34 ›› Issue (8): 2197-2201.DOI: 10.11772/j.issn.1001-9081.2014.08.2197

• 第五届中国数据挖掘会议(CCDM 2014)论文 • 上一篇    下一篇

基于规则的汉语兼类词标注方法

李华栋,贾真,尹红风,杨燕   

  1. 西南交通大学 信息科学技术学院,成都610031
  • 收稿日期:2014-04-30 修回日期:2014-05-06 出版日期:2014-08-01 发布日期:2014-08-10
  • 通讯作者: 贾真
  • 作者简介:李华栋(1988-),男,湖北黄冈人,硕士研究生,主要研究方向:自然语言处理、数据挖掘;贾真(1975-),女,河南开封人,讲师,主要研究方向:信息抽取、知识工程;尹红风(1963-),男,河南夏邑人,教授,主要研究方向:语义搜索、大数据;杨燕(1964-),女,四川成都人,教授,CCF会员,主要研究方向:数据挖掘、计算智能、集成学习。
  • 基金资助:

    国家自然科学基金资助项目

Rule-based tagging method of Chinese ambiguity words

LI Huadong,JIA Zhen,YI Hongfeng,YANG Yan   

  1. School of Information Science and Technology, Southwest Jiaotong University, Chengdu Sichuan 610031, China
  • Received:2014-04-30 Revised:2014-05-06 Online:2014-08-01 Published:2014-08-10
  • Contact: JIA Zhen

摘要:

针对目前汉语兼类词标注的准确率不高的问题,提出了规则与统计模型相结合的兼类词标注方法。首先,利用隐马尔可夫、最大熵和条件随机场3种统计模型进行兼类词标注;然后,将改进的互信息算法应用到词性(POS)标注规则的获取上,通过计算目标词前后词单元与目标词的相关性获得词性标注规则;最后,将获取的规则与基于统计模型的词性标注算法结合起来进行兼类词标注。实验结果表明加入规则算法之后,平均词性标注准确率提升了5%左右。

Abstract:

Concerning the low accuracy of tagging Chinese ambiguity words, a combined tagging method of rules and statistical model was proposed in this paper. Firstly, three kinds of traditional statistical models, including Hidden Markov Model (HMM), Maximum Entropy (ME) and Condition Random Field (CRF), were used to tagging problem of the ambiguity words. Then, the improved mutual information algorithm was applied to learn Part Of Speech (POS) tagging rules. Tagging rules were got through the calculation of correlation between the target words and the nearby word units. Finally, rules were combined with statistical model algorithm to tag Chinese ambiguity words. The experimental results show that after adding the rule algorithm, the average accuracy of POS tagging promotes by 5%.

中图分类号: