互信息改进方法在术语抽取中的应用

doi:10.11772/j.issn.1001-9081.2015.04.0996

计算机应用 ›› 2015, Vol. 35 ›› Issue (4): 996-1000.DOI: 10.11772/j.issn.1001-9081.2015.04.0996

互信息改进方法在术语抽取中的应用

杜丽萍¹, 李晓戈¹, 周元哲¹, 邵春昌²

1. 西安邮电大学计算机学院, 西安 710121;
2. 中央民族大学理学院, 北京 100081

收稿日期:2014-10-30 修回日期:2015-01-13 出版日期:2015-04-10 发布日期:2015-04-08
通讯作者: 李晓戈
作者简介:杜丽萍(1987-),女,陕西宝鸡人,硕士研究生,主要研究方向:自然语言处理、文本数据挖掘; 李晓戈(1962-),男,浙江杭州人,教授,主要研究方向:自然语言处理、数据挖掘、机器学习; 周元哲(1974-),男,陕西西安人,讲师,硕士,主要研究方向:自然语言处理、机器学习; 邵春昌(1987-),男,山东淄博人,硕士研究生,主要研究方向:自然语言处理、数据挖掘、机器学习。
基金资助:
国家自然科学基金资助项目(61373116); 西安邮电大学研究生创新基金资助项目(ZL2013-31)。

Application of improved point-wise mutual information in term extraction

DU Liping¹, LI Xiaoge¹, ZHOU Yuanzhe¹, SHAO Chunchang²

1. College of Computer Science and Technology, Xi'an University of Posts and Telecommunications, Xi'an Shaanxi 710121, China;
2. College of Science, Minzu University of China, Beijing 100081, China

Received:2014-10-30 Revised:2015-01-13 Online:2015-04-10 Published:2015-04-08

摘要/Abstract

摘要：

为了确定改进互信息(PMI^k)方法的参数k取何值时能够克服互信息(PMI)方法过高估计两个低频且总是一起出现的字串间结合强度的缺点,解决术语抽取系统采用经过分词的语料库时由于分词错误导致的某些术语无法抽取的问题,以及改善术语抽取系统的可移植性,提出了一种结合PMI^k和两个基本过滤规则从未经过分词的语料库中进行术语抽取的算法。首先,利用PMI^k方法计算两个字之间的结合强度,确定2元待扩展种子;其次,利用PMI^k方法计算2元待扩展种子分别和其左边、右边的字的结合强度,确定2元是否能扩展为3元,如此迭代扩展出多元的候选术语;最后,利用两个基本过滤规则过滤候选术语中的垃圾串,得到最终结果。理论分析表明,当k≥3(k∈N₊)时,PMI^k方法能克服PMI方法的缺点。在1 GB的新浪财经博客语料库和300 MB百度贴吧语料库上的实验验证了理论分析的正确性,且PMI^k方法获得了比PMI方法更高的精度,算法有良好的可移植性。

关键词: 术语抽取, 专业术语, 知识获取, 互信息

Abstract:

The traditional Point-wise Mutual Information (PMI) method has shortcoming of overvaluing the co-occurrence of two low-frequency words. To get the proper value of k of improved PMI named PMI^k to overcome the shortcoming of PMI, and solve the problem that the term extraction cannot be obtained from a segmented corpus with segmentation errors, as well as maintaining the portability of term extraction system, combining with the PMI^k method and two fundamental rules, a new method was put forward to identity terms from an unsegmented corpus. Firstly, 2-gram extended seed was determined by computing the bonding strength of two adjoining words by PMI^k method. Secondly, whether the 2-gram extended seed could be extended to 3-gram was determined by respectively computing the bonding strength between the seed and the word in front of it and the word located behind it, and then getting multi-gram term candidates iteratively. Finally, the garbage of term candidates were filtered using the two fundamental rules to obtain terms. The theoretical analysis shows that PMI^kcan overcome the shortcoming of PMI when k≥3(k∈N₊). The experiments on 1 GB SINA finance Blog corpus and 300 MB Baidu Tieba corpus verify the theoretical analysis, and PMI^k outperforms PMI with good portability.

Key words: term extraction, technical term, knowledge acquisition, Point-wise Mutual Information (PMI)

中图分类号:

TP391.1

杜丽萍, 李晓戈, 周元哲, 邵春昌. 互信息改进方法在术语抽取中的应用[J]. 计算机应用, 2015, 35(4): 996-1000.

DU Liping, LI Xiaoge, ZHOU Yuanzhe, SHAO Chunchang. Application of improved point-wise mutual information in term extraction[J]. Journal of Computer Applications, 2015, 35(4): 996-1000.

参考文献

[1] PAULO J L, CORREIA M, MAMEDE N J. et al. Using morphological, syntactical, and statistical information for automatic term acquisition [C]// Proceedings of the Third International Conference on Advances in Natural Language Processing, LNCS 2389. Berlin: Springer-Verlag, 2002:219-227.
[2] ZHU Q, LENG F. Existing problems and developing trends of automatic term recognition[J]. Library and Information Service, 2012, 56(18):104-109.(祝清松, 冷伏海. 自动术语识别存在的问题及发展趋势综述[J].图书情报工作, 2012, 56(18):104-109.)
[3] ZHANG F, XU Y, HOU Y, et al. Chinese term extraction system based on mutual information[J]. Application Research of Computers, 2005, 22(5): 72-74.(张峰, 许云, 侯艳, 等. 基于互信息的中文术语抽取系统[J]. 计算机应用研究, 2005, 22(5): 72-74.)
[4] PANTEL P, LIN D. A statistical corpora-based term extractor[C]// Proceedings of the 14th Biennial Conference of the Canadian Society for Computational Studies of Intelligence, LNCS 2056. Berlin: Springer-Verlag, 2001: 34-46.
[5] PAZIENZA M T, PENNNACCHIOTTI M, ZANZOTTO F M. Terminology extraction: an analysis of linguistic and statistical approaches[C]// Proceedings of the NEMIS 2004 Final Conference on Knowledge Mining, SFSC 185. Berlin: Springer-Verlag, 2005: 255-279.
[6] LIANG Y, ZHANG W, ZHOU D. A hybrid strategy for high precision long term extraction[J]. Journal of Chinese Information Processing, 2009, 23(6): 26-30.(梁颖红, 张文静, 周德福. 基于混合策略的高精度长术语自动抽取[J]. 中文信息学报, 2009, 23(6): 26-30)
[7] HE T, ZHANG Y. Automatic Chinese term extraction based on decomposition of prime string[J]. Computer Engineering, 2006, 32(23): 188-190.(何婷婷, 张勇. 基于质子串分解的中文术语自动抽取[J]. 计算机工程, 2006, 32(23): 188-190.)
[8] SUN J, JIA M, LIU Z. On a text-oriented concept extraction technique[J]. Computer Application and Software, 2009, 26(9): 28-30.(孙继鹏, 贾民, 刘增宝. 一种面向文本的概念抽取方法研究[J]. 计算机应用与软件, 2009, 26(9): 28-30.)
[9] BOUMA G. Normalized (pointwise) mutual information in collocation extraction[EB/OL].[2013-10-10]. https://svn.spraakdata.gu.se/repos/gerlof/pub/www/Docs/npmi-pfd.pdf.
[10] HU A, ZHANG J, LIU J. Chinese term extraction based on improved C-value method[J]. New Technology of Library and Information Service, 2013(2): 24-29.(胡阿沛, 张静, 刘俊丽. 基于改进C-value方法的中文术语抽取[J]. 现代图书情报技术, 2013(2): 24-29.)
[11] ZHOU L, SHI S, FENG C, et al. A Chinese term extraction system based on multi-strategies integration[J]. Journal of China Society for Scientific and Technical Information, 2010, 29(3): 460-467.(周浪, 史树敏, 冯冲, 等. 基于多策略融合的中文术语抽取方法[J]. 情报学报, 2010, 29(3): 460-467.)
[12] ZHOU L, ZHANG L, FENG C, et al. Terminology extraction based on statistical word frequency distribution variety[J]. Computer Science, 2009, 36(5): 177-180.(周浪, 张亮, 冯冲,等. 基于词频分布变化统计的术语抽取方法[J]. 计算机科学, 2009, 36(5): 177-180.)
[13] YAN X, LIU Y, FANG Q, et al. Domain-specific terms extraction based on Web resource and user behavior[J]. Journal of Software, 2013, 24(9): 2089-2100.(闫兴龙, 刘奕群, 方奇, 等. 基于网络资源与用户行为信息的领域术语提取[J]. 软件学报, 2013, 24(9): 2089-2100.)

[1]	程玉胜, 宋帆, 王一宾, 钱坤. 基于专家特征的条件互信息多标记特征选择算法[J]. 计算机应用, 2020, 40(2): 503-509.
[2]	王煜, 徐建民. 用于网络新闻热点识别的热点新词发现[J]. 计算机应用, 2020, 40(12): 3513-3519.
[3]	雍菊亚, 周忠眉. 基于互信息的多级特征选择算法[J]. 计算机应用, 2020, 40(12): 3478-3484.
[4]	魏嘉旺, 王肖, 袁玉波. 人脸特征点定位的自适应窗回归方法[J]. 计算机应用, 2019, 39(5): 1459-1465.
[5]	毛莺池, 曹海, 平萍, 李晓芳. 基于最大联合条件互信息的特征选择[J]. 计算机应用, 2019, 39(3): 734-741.
[6]	胡健, 苏永东, 黄文载, 肖鹏, 刘玉婷, 杨本富. 基于互信息加权集成迁移学习的入侵检测方法[J]. 计算机应用, 2019, 39(11): 3310-3315.
[7]	徐洪峰, 孙振强. 多标签学习中基于互信息的快速特征选择方法[J]. 计算机应用, 2019, 39(10): 2815-2821.
[8]	梁志刚, 顾军华. 改进头脑风暴优化算法与Powell算法结合的医学图像配准[J]. 计算机应用, 2018, 38(9): 2683-2688.
[9]	唐小川, 邱曦伟, 罗亮. 基于交互作用的文本分类特征选择算法[J]. 计算机应用, 2018, 38(7): 1857-1861.
[10]	喻德旷, 杨谊. 肿瘤特征基因选择的互信息最值过滤原则与粒子群优化算法[J]. 计算机应用, 2018, 38(2): 421-426.
[11]	王则林, 郝水侠. 运用差分演化算法实现包匹配多层核心基的提取[J]. 计算机应用, 2017, 37(3): 777-781.
[12]	刘胜男, 宁纪锋. 基于超像素的点互信息边界检测算法[J]. 计算机应用, 2016, 36(8): 2296-2300.
[13]	夭荣朋, 许国艳, 宋健. 基于改进互信息和邻接熵的微博新词发现方法[J]. 计算机应用, 2016, 36(10): 2772-2776.
[14]	国强, 秦月. 改进的基于网络编码中继转发方案[J]. 计算机应用, 2016, 36(1): 61-65.
[15]	韩敏, 孙卓然. 基于小波变换和AdaBoost极限学习机的癫痫脑电信号分类[J]. 计算机应用, 2015, 35(9): 2701-2705.

互信息改进方法在术语抽取中的应用

Application of improved point-wise mutual information in term extraction

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics