计算机应用 ›› 2013, Vol. 33 ›› Issue (03): 780-783.DOI: 10.3724/SP.J.1087.2013.00780

• 人工智能 • 上一篇    下一篇

基于无监督学习的专业领域分词歧义消解方法

修驰1*,宋柔1,2   

  1. 1.北京工业大学 计算机学院,北京 100022;
    2.北京语言大学 信息科学学院,北京 100083
  • 收稿日期:2012-09-26 修回日期:2012-10-31 出版日期:2013-03-01 发布日期:2013-03-01
  • 通讯作者: 修驰
  • 作者简介:修驰(1984-),女,辽宁营口人,博士研究生,主要研究方向:中文分词、统计机器翻译; 宋柔(1947-),男,江苏苏州人,教授,博士生导师,硕士,主要研究方向:智能软件工具、中文分词、计算机辅助汉语校对、广义话题理论。
  • 基金资助:

    国家自然科学基金资助项目(60872121)。

Disambiguation of domain word segmentation based on unsupervised learning

XIU Chi1*,SONG Rou1,2   

  1. 1.College of Computer Science, Beijing University of Technology, Beijing 100022, China;
    2.College of Information Science, Beijing Language and Culture University, Beijing 100083,China
  • Received:2012-09-26 Revised:2012-10-31 Online:2013-03-01 Published:2013-03-01
  • Contact: Chi XIU

摘要: 中文自然语言处理中专业领域分词的难度远远高于通用领域。特别是在专业领域的分词歧义方面,一直没有找到有效的解决方法。针对该问题提出基于无监督学习的专业领域分词歧义消解方法。以测试语料自身的字符串频次信息、互信息、边界熵信息为分词歧义的评价标准,独立、组合地使用这三种信息解决分词歧义问题。实验结果显示该方法可以有效消解专业领域的分词歧义,并明显提高分词效果。

关键词: 专业领域分词, 分词歧义, 字符串频次, 互信息, 边界熵

Abstract: Domain word segmentation is much more difficult than general word segmentation in Chinese natural language processing. The segmentation ambiguity has been lack of effective solution especially. Concerning this problem, an unsupervised learning method for domain segmentation ambiguity was proposed. String frequency, mutual information and boundary entropy were selected as evaluation standard for segmentation ambiguity. Individual and combination of these three kinds of information were used to solve the problem. The experimental results suggest that the proposed can solve the domain segmentation ambiguity efficiently and effectively.

Key words: domain word segmentation, segmentation ambiguity, string frequency, mutual information, boundary entropy

中图分类号: