计算机应用 ›› 2005, Vol. 25 ›› Issue (07): 1570-1572.DOI: 10.3724/SP.J.1087.2005.01570

• 数据库技术 • 上一篇    下一篇

一个基于关联规则的多层文档聚类算法

宋江春,沈钧毅,宋擒豹   

  1. 西安交通大学 电子与信息工程学院
  • 收稿日期:2005-02-03 修回日期:2005-04-01 发布日期:2005-07-01 出版日期:2005-07-01
  • 作者简介:宋江春(1962-),男,四川成都人,工程师,博士研究生,主要研究方向:数据库与数据挖掘;沈钧毅(1939-),男,江苏扬州人,教授,博士生导师,主要研究方向:数据库理论、数据挖掘、工作流;宋擒豹(1966-),男,陕西华县人,副教授,博士,主要研究方向:数据仓库与数据挖掘
  • 基金资助:

    国家自然科学基金资助项目(60173058)

Multilevel document clustering algorithm based on association rules

SONG Jiang-chun,SHEN Jun-yi,SONG Qing-bao   

  1. School of Electronics and Information Engineering, Xi'an Jiaotong University
  • Received:2005-02-03 Revised:2005-04-01 Online:2005-07-01 Published:2005-07-01

摘要:

提出了一种新的基于关联规则的多层文档聚类算法,该算法利用新的文档特征抽取方法构造了文档的主题和关键字特征向量。首先在主题特征向量空间中利用频集快速算法对文档进行初始聚类,然后在基于主题关键字的新的特征向量空间中利用类间距和连接度对初始文档类进行求精,从而得到最终聚类。由于使用了两层聚类方法,使算法的效率和精度都大大提高;使用新的文档特征抽取方法还解决了由于文档关键字过多而导致文档特征向量的维数过高的问题。

关键词: 文档挖掘;文档聚类;关联规则;文档主题特征向量;文档关键字特征向量

Abstract:

A multi-level document clustering algorithm was proposed based on association rules, It constructed ducument feature vector of topic and keyword by using a new method of document feature extraction. Firstly, it found the initial ducument clusters by using fast algorithm of finding frequent item sets in topic vector space, then in keyword vector space, re-clustered the initial clusters according to the cluster distance and the link intensity. For processing initial clustering by using classical fast frequent item sets, the efficiency and the precision of the algorithm were highly increased. The new method of ducument feature extraction is also used to solve the problem that the dimention of the keyword vector space is too high with increasing of keywords in document.

Key words: document mining, document clustering, association rule, document topic feature vector, document keyword feature vector

中图分类号: