基于规则的汉语兼类词标注方法

doi:10.11772/j.issn.1001-9081.2014.08.2197

计算机应用 ›› 2014, Vol. 34 ›› Issue (8): 2197-2201.DOI: 10.11772/j.issn.1001-9081.2014.08.2197

• 第五届中国数据挖掘会议(CCDM 2014)论文 • 上一篇下一篇

基于规则的汉语兼类词标注方法

李华栋,贾真,尹红风,杨燕

西南交通大学信息科学技术学院，成都610031

收稿日期:2014-04-30 修回日期:2014-05-06 出版日期:2014-08-01 发布日期:2014-08-10
通讯作者: 贾真
作者简介:李华栋（1988-），男，湖北黄冈人，硕士研究生，主要研究方向：自然语言处理、数据挖掘；贾真（1975-），女，河南开封人，讲师，主要研究方向：信息抽取、知识工程；尹红风（1963-），男，河南夏邑人，教授，主要研究方向：语义搜索、大数据；杨燕（1964-），女，四川成都人，教授，CCF会员，主要研究方向：数据挖掘、计算智能、集成学习。
基金资助:
国家自然科学基金资助项目

Rule-based tagging method of Chinese ambiguity words

LI Huadong,JIA Zhen,YI Hongfeng,YANG Yan

School of Information Science and Technology, Southwest Jiaotong University, Chengdu Sichuan 610031, China

Received:2014-04-30 Revised:2014-05-06 Online:2014-08-01 Published:2014-08-10
Contact: JIA Zhen

摘要/Abstract

摘要：

针对目前汉语兼类词标注的准确率不高的问题，提出了规则与统计模型相结合的兼类词标注方法。首先，利用隐马尔可夫、最大熵和条件随机场3种统计模型进行兼类词标注；然后，将改进的互信息算法应用到词性(POS)标注规则的获取上，通过计算目标词前后词单元与目标词的相关性获得词性标注规则；最后，将获取的规则与基于统计模型的词性标注算法结合起来进行兼类词标注。实验结果表明加入规则算法之后，平均词性标注准确率提升了5%左右。

Abstract:

Concerning the low accuracy of tagging Chinese ambiguity words, a combined tagging method of rules and statistical model was proposed in this paper. Firstly, three kinds of traditional statistical models, including Hidden Markov Model (HMM), Maximum Entropy (ME) and Condition Random Field (CRF), were used to tagging problem of the ambiguity words. Then, the improved mutual information algorithm was applied to learn Part Of Speech (POS) tagging rules. Tagging rules were got through the calculation of correlation between the target words and the nearby word units. Finally, rules were combined with statistical model algorithm to tag Chinese ambiguity words. The experimental results show that after adding the rule algorithm, the average accuracy of POS tagging promotes by 5%.

中图分类号:

TP391.1

李华栋贾真尹红风杨燕. 基于规则的汉语兼类词标注方法[J]. 计算机应用, 2014, 34(8): 2197-2201.

LI Huadong JIA Zhen YI Hongfeng YANG Yan. Rule-based tagging method of Chinese ambiguity words[J]. Journal of Computer Applications, 2014, 34(8): 2197-2201.

参考文献

［1］BRILL E. A corpus-based approach to language learning ［D］. Philadelphia: University of Pennsylvania, 1993.
［2］HAMMERTON J, OSBORNE M, ARMSTRONG S, et al. Introduction to special issue on machine learning approaches to shallow parsing ［J］. Journal of Machine Learning Research, 2002, 13(2): 551-558.
［3］BRILL E. Unsupervised learning of disambiguation rules for part-of-speech ［C］// Proceedings of the Third Workshop on Very Large Corpora. Piscataway: IEEE Press, 1995: 1-13.
［4］SCHMID H. Probabilistic part-of-speech using decision tree ［C］// Proceedings of the 1994 International Conference on New Methods in Language Processing. Piscataway: IEEE Press, 1994: 44-49.
［5］YUAN C. Improved hidden Markov model for speech recognition and POS tagging ［J］. Journal of Central South University, 2012, 19(2): 511-516.
［6］RATNAPARKHI A. A maximum entropy model for part-of-speech tagging ［C］// Proceedings of the 1996 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 1996: 132-142.
［7］LIU T, LEI L, CHEN L. A parallel training research of Chinese part-of-speech tagging CRF model based on MapReduce ［J］. Acta Scientiarum Naturalium Universitatis Pekinensis, 2013, 49(1):147-152.(刘滔，雷霖，陈荦，等.基于MapReduce的中文词性标注CRF模型并行化训练研究［J］.北京大学学报：自然科学版,2013,49(1):147-152.)
［8］EMBAL A, SAHA S. Simulated annealing based classifier ensemble techniques: application to part of speech tagging ［J］. Information Fusion, 2013, 14(3): 288-300.
［9］ZHAO Y, WANG X, LIU B, et al. Fusion of clustering trigger-pair features for POS tagging based on maximum entropy model ［J］. Journal of Computer Research and Development, 2006, 43(2): 268-274.(赵岩,王晓龙,刘秉权,等.融合聚类触发对特征的最大熵词性标注模型［J］.计算机研究与发展,2006,43(2):268-274).
［10］BRANTS T. TnT: a statistical part-of-speech tagger ［C］// Proceedings of the 6th Applied Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2000: 224-231.
［11］COHEN W, CARVALHO V. Stacked sequential learning ［C］// IJCAI'05: Proceedings of the 19th International Joint Conference on Artificial Intelligence. San Francisco: Morgan Kaufmann Publishers, 2005: 671-676.
［12］ZHAN Y, CLARK S. A fast decoder for joint word segmentation and POS-tagging using a single discriminative model ［C］// Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2010: 843-852.
［13］AULI M, LOPEZ A. Training a log-linear parser with loss functions via softmax-margin ［C］// EMNLP'11: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2011: 333-343.

[1]	殷雨昌王洪元陈莉冯尊登肖宇. 基于单标注样本的多损失学习与联合度量视频行人重识别[J]. 计算机应用, 0, (): 0-0.
[2]	左亚尧陈致然洪嘉伟陈坤. 融合多语义特征的命名实体识别方法[J]. 计算机应用, 0, (): 0-0.
[3]	袁景凌, 丁远远, 潘东行, 李琳. 基于时序和上下文特征的中文隐式情感分类模型[J]. 计算机应用, 2021, 41(10): 2820-2828.
[4]	杨书新, 张楠. 融合情感词典与上下文语言模型的文本情感分析[J]. 计算机应用, 2021, 41(10): 2829-2834.
[5]	杨璐, 何明祥. 基于门控机制和卷积神经网络的中文文本情感分析模型[J]. 计算机应用, 2021, 41(10): 2842-2848.
[6]	董永峰, 刘超, 王利琴, 李英双. 融合多跳关系路径信息的关系推理方法[J]. 计算机应用, 2021, 41(10): 2799-2805.
[7]	吴赛赛, 梁晓贺, 谢能付, 周爱莲, 郝心宁. 面向领域实体关系联合抽取的标注方法[J]. 计算机应用, 2021, 41(10): 2858-2863.
[8]	胡婕胡燕刘梦赤张龑. 基于知识库实体增强BERT模型的中文命名实体识别[J]. 计算机应用, 0, (): 0-0.
[9]	郝志刚秦丽李国亮. 基于多属性综合评价的食品安全标准引用网络重要节点发现方法[J]. 计算机应用, 0, (): 0-0.
[10]	丁行硕李翔谢乾. 基于标签分层延深建模的企业画像构建方法[J]. 计算机应用, 0, (): 0-0.
[11]	刘子辰, 李小娟, 韦伟. 基于循环神经网络的专利价格自动评估[J]. 计算机应用, 2021, 41(9): 2532-2538.
[12]	余敦辉, 万鹏, 王社. 基于企业知识图谱构建的实体关联查询系统[J]. 计算机应用, 2021, 41(9): 2510-2516.
[13]	张阳王小宁. 基于Word2Vec词嵌入和高维生物基因选择遗传算法的文本特征选择方法 [J]. 计算机应用, 0, (): 0-0.
[14]	李灿杨雅婷马玉鹏董瑞. 基于语种相似性挖掘的神经机器翻译语料库扩充方法[J]. 计算机应用, 0, (): 0-0.
[15]	王伟, 赵尔平, 崔志远, 孙浩. 基于HowNet义原和Word2vec词向量表示的多特征融合消歧方法[J]. 计算机应用, 2021, 41(8): 2193-2198.

基于规则的汉语兼类词标注方法

Rule-based tagging method of Chinese ambiguity words

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics