计算机应用 ›› 2015, Vol. 35 ›› Issue (4): 996-1000.DOI: 10.11772/j.issn.1001-9081.2015.04.0996

• 人工智能 • 上一篇    下一篇

互信息改进方法在术语抽取中的应用

杜丽萍1, 李晓戈1, 周元哲1, 邵春昌2   

  1. 1. 西安邮电大学 计算机学院, 西安 710121;
    2. 中央民族大学 理学院, 北京 100081
  • 收稿日期:2014-10-30 修回日期:2015-01-13 出版日期:2015-04-10 发布日期:2015-04-08
  • 通讯作者: 李晓戈
  • 作者简介:杜丽萍(1987-),女,陕西宝鸡人,硕士研究生,主要研究方向:自然语言处理、文本数据挖掘; 李晓戈(1962-),男,浙江杭州人,教授,主要研究方向:自然语言处理、数据挖掘、机器学习; 周元哲(1974-),男,陕西西安人,讲师,硕士,主要研究方向:自然语言处理、机器学习; 邵春昌(1987-),男,山东淄博人,硕士研究生,主要研究方向:自然语言处理、数据挖掘、机器学习。
  • 基金资助:

    国家自然科学基金资助项目(61373116); 西安邮电大学研究生创新基金资助项目(ZL2013-31)。

Application of improved point-wise mutual information in term extraction

DU Liping1, LI Xiaoge1, ZHOU Yuanzhe1, SHAO Chunchang2   

  1. 1. College of Computer Science and Technology, Xi'an University of Posts and Telecommunications, Xi'an Shaanxi 710121, China;
    2. College of Science, Minzu University of China, Beijing 100081, China
  • Received:2014-10-30 Revised:2015-01-13 Online:2015-04-10 Published:2015-04-08

摘要:

为了确定改进互信息(PMIk)方法的参数k取何值时能够克服互信息(PMI)方法过高估计两个低频且总是一起出现的字串间结合强度的缺点,解决术语抽取系统采用经过分词的语料库时由于分词错误导致的某些术语无法抽取的问题,以及改善术语抽取系统的可移植性,提出了一种结合PMIk和两个基本过滤规则从未经过分词的语料库中进行术语抽取的算法。首先,利用PMIk方法计算两个字之间的结合强度,确定2元待扩展种子;其次,利用PMIk方法计算2元待扩展种子分别和其左边、右边的字的结合强度,确定2元是否能扩展为3元,如此迭代扩展出多元的候选术语;最后,利用两个基本过滤规则过滤候选术语中的垃圾串,得到最终结果。理论分析表明,当k≥3(k∈N+)时,PMIk方法能克服PMI方法的缺点。在1 GB的新浪财经博客语料库和300 MB百度贴吧语料库上的实验验证了理论分析的正确性,且PMIk方法获得了比PMI方法更高的精度,算法有良好的可移植性。

关键词: 术语抽取, 专业术语, 知识获取, 互信息

Abstract:

The traditional Point-wise Mutual Information (PMI) method has shortcoming of overvaluing the co-occurrence of two low-frequency words. To get the proper value of k of improved PMI named PMIk to overcome the shortcoming of PMI, and solve the problem that the term extraction cannot be obtained from a segmented corpus with segmentation errors, as well as maintaining the portability of term extraction system, combining with the PMIk method and two fundamental rules, a new method was put forward to identity terms from an unsegmented corpus. Firstly, 2-gram extended seed was determined by computing the bonding strength of two adjoining words by PMIk method. Secondly, whether the 2-gram extended seed could be extended to 3-gram was determined by respectively computing the bonding strength between the seed and the word in front of it and the word located behind it, and then getting multi-gram term candidates iteratively. Finally, the garbage of term candidates were filtered using the two fundamental rules to obtain terms. The theoretical analysis shows that PMIkcan overcome the shortcoming of PMI when k≥3(k∈N+). The experiments on 1 GB SINA finance Blog corpus and 300 MB Baidu Tieba corpus verify the theoretical analysis, and PMIk outperforms PMI with good portability.

Key words: term extraction, technical term, knowledge acquisition, Point-wise Mutual Information (PMI)

中图分类号: