计算机应用 ›› 2013, Vol. 33 ›› Issue (09): 2546-2549.DOI: 10.11772/j.issn.1001-9081.2013.09.2546

• 人工智能 • 上一篇    下一篇

改进的增量词集频率主题词提取算法

刘兴林   

  1. 五邑大学 计算机学院,广东 江门 529020
  • 收稿日期:2013-03-20 修回日期:2013-04-24 出版日期:2013-09-01 发布日期:2013-10-18
  • 通讯作者: 刘兴林
  • 作者简介:刘兴林(1976-),男,广东南雄人,讲师,博士,主要研究方向:智能计算、数据挖掘、文本知识获取。
  • 基金资助:

    国家自然科学基金资助项目;广东省自然科学基金资助项目;广东省科技计划项目;中国科学院计算技术研究所智能信息处理重点实验室开放课题基金资助项目;五邑大学博士启动基金资助项目;五邑大学博士启动基金资助项目

Improved algorithm of thematic term extraction based on increment term-set frequency from Chinese document

LIU Xinglin   

  1. School of Computer Science, Wuyi University, Jiangmen Guangdong 529020, China
  • Received:2013-03-20 Revised:2013-04-24 Online:2013-10-18 Published:2013-09-01
  • Contact: LIU Xinglin
  • Supported by:

    ;the Natural Science Foundation of Guangdong Province

摘要: 为了解决基于增量词集频率的主题词提取算法不能提取合成词的问题,在原算法的基础上增加了文本预处理环节,即合成词识别。采用基于词性探测和词共现有向图算法识别文本中的合成词,并对分词结果进行修正。生成候选主题词集时,考察每个词的出现位置,根据不同的出现位置赋予不同的权重;然后累加获得同一个词的总权重,并按权重从高到低生成候选主题词集。提取主题词时逐个考察候选主题词集中的每一个候选主题词,计算其对主题词集权重的增量,若增量小于给定阈值,则主题词提取算法结束;否则将该候选主题词加入主题词集。实验结果表明,该算法取得了较好的效果,所获得的主题词能更贴切地反映文档的主题内容,主题词满意度比原算法提高了5个百分点。

关键词: 主题词, 词共现有向图, 词位置权重, 词集频率, 知识获取

Abstract: In order to solve the problem that the thematic term extraction algorithm based on incremental term-set frequency cannot extract compound-words, this paper added text preprocessing, compound-word recognition, to the original algorithm. Compound-word recognition was based on part-of-speech detection and word co-occurrence directed graph, and corrected the results of segmentation. When generating thematic term candidate set, the position of each word was examined and determined its weight. And then, the total weight of the same word was accumulated, and a candidate set of thematic terms was generated by the weight from high to low. When this algorithm got a term from thematic term candidate set, the increment frequency was calculated. If the increment was less than a given threshold, the algorithm stopped; otherwise, the thematic term candidate was added into thematic term set. The experimental results show this algorithm achieves sound effects, the thematic terms acquired by this algorithm can more aptly reflect the main contents of the article, and the satisfaction of thematic term increased 5% than the original algorithm.

Key words: thematic term, word co-occurrence directed graph, word position weigh, term-set frequency, knowledge acquisition

中图分类号: