计算机应用 ›› 2015, Vol. 35 ›› Issue (12): 3506-3510.DOI: 10.11772/j.issn.1001-9081.2015.12.3506

• 人工智能 • 上一篇    下一篇

基于Document Triage的TF-IDF算法的改进

李镇君, 周竹荣   

  1. 西南大学计算机与信息科学学院, 重庆 400715
  • 收稿日期:2015-06-05 修回日期:2015-07-24 出版日期:2015-12-10 发布日期:2015-12-10
  • 通讯作者: 周竹荣(1970-),男,重庆人,副教授,博士,主要研究方向:语义网、面向服务计算
  • 作者简介:李镇君(1991-),男,重庆人,硕士研究生,CCF会员,主要研究方向:文本挖掘。

Improvement of term frequency-inverse document frequency algorithm based on Document Triage

LI Zhenjun, ZHOU Zhurong   

  1. College of Computer and Information Science, Southwest University, Chongqing 400715, China
  • Received:2015-06-05 Revised:2015-07-24 Online:2015-12-10 Published:2015-12-10

摘要: 针对TF-IDF算法在加权时没有考虑特征词本身在文档中重要度的问题,提出利用用户阅读时的阅读行为来改进TF-IDF。将Document Triage引入到TF-IDF中,利用IPM收集用户阅读中行为的相关信息,计算文档评分。由于用户的标注内容往往是文章的重要内容,或者反映了用户的兴趣。因此,赋予用户标注词项更大的权重,将文档评分和用户的标注信息等作为因子引入到TF-IDF中,设计出改进的加权算法DT-TF-IDF。实验结果表明,相对传统TF-IDF算法,DT-TF-IDF的查全率、查准率,以及查准率和查全率的调和均值都有了一定的提高。DT-TF-IDF算法比传统TF-IDF算法更加有效,提高了文本相似度计算的准确性。

关键词: TF-IDF, Document Triage, 标引, 加权

Abstract: The Term Frequency-Inverse Document Frequency (TF-IDF) algorithm does not consider the importance of index items themselves in the document when computing the weights of index terms. In order to solve the problem, the users' behaviors when reading were utilized to improve the efficiency of TF-IDF. By introducing Document Triage to TF-IDF, the Interest Profile Manager (IPM)was used to collect data about users' reading behaviors, and then the document scores were computed. Since the users' annotation was quite important in the aimed text, or reflected the users' interest. The improved term weighting algorithm named Document Triage-Term Frequency-Inverse Document Frequency (DT-TF-IDF) was proposed by introducing document scores and users' annotation to TF-IDF and giving a greater weight to annotated term. The experimental results show that the recall, the precision and their harmonic mean of DT-TF-IDF are all higher than those of the traditional TF-IDF algorithm. The proposed DT-TF-IDF algorithm is more effective than TF-IDF and has improved the accuracy of the text similarity calculation.

Key words: Term Frequency-Inverse Document Frequency (TF-IDF), Document Triage, annotation, weighting

中图分类号: