一个基于关联规则的多层文档聚类算法

doi:10.3724/SP.J.1087.2005.01570

计算机应用 ›› 2005, Vol. 25 ›› Issue (07): 1570-1572.DOI: 10.3724/SP.J.1087.2005.01570

一个基于关联规则的多层文档聚类算法

宋江春，沈钧毅，宋擒豹

西安交通大学电子与信息工程学院

收稿日期:2005-02-03 修回日期:2005-04-01 发布日期:2005-07-01 出版日期:2005-07-01
作者简介:宋江春(1962-)，男，四川成都人，工程师，博士研究生，主要研究方向：数据库与数据挖掘；沈钧毅（1939-），男，江苏扬州人，教授，博士生导师，主要研究方向：数据库理论、数据挖掘、工作流；宋擒豹（1966-），男，陕西华县人，副教授，博士，主要研究方向：数据仓库与数据挖掘
基金资助:
国家自然科学基金资助项目（60173058）

Multilevel document clustering algorithm based on association rules

SONG Jiang-chun，SHEN Jun-yi，SONG Qing-bao

School of Electronics and Information Engineering, Xi'an Jiaotong University

Received:2005-02-03 Revised:2005-04-01 Online:2005-07-01 Published:2005-07-01

摘要/Abstract

摘要：

提出了一种新的基于关联规则的多层文档聚类算法，该算法利用新的文档特征抽取方法构造了文档的主题和关键字特征向量。首先在主题特征向量空间中利用频集快速算法对文档进行初始聚类，然后在基于主题关键字的新的特征向量空间中利用类间距和连接度对初始文档类进行求精，从而得到最终聚类。由于使用了两层聚类方法，使算法的效率和精度都大大提高；使用新的文档特征抽取方法还解决了由于文档关键字过多而导致文档特征向量的维数过高的问题。

关键词: 文档挖掘；文档聚类；关联规则；文档主题特征向量；文档关键字特征向量

Abstract:

A multi-level document clustering algorithm was proposed based on association rules, It constructed ducument feature vector of topic and keyword by using a new method of document feature extraction. Firstly, it found the initial ducument clusters by using fast algorithm of finding frequent item sets in topic vector space, then in keyword vector space, re-clustered the initial clusters according to the cluster distance and the link intensity. For processing initial clustering by using classical fast frequent item sets, the efficiency and the precision of the algorithm were highly increased. The new method of ducument feature extraction is also used to solve the problem that the dimention of the keyword vector space is too high with increasing of keywords in document.

Key words: document mining, document clustering, association rule, document topic feature vector, document keyword feature vector

中图分类号:

TP311.11

宋江春，沈钧毅，宋擒豹. 一个基于关联规则的多层文档聚类算法[J]. 计算机应用, 2005, 25(07): 1570-1572.

SONG Jiang-chun，SHEN Jun-yi，SONG Qing-bao. Multilevel document clustering algorithm based on association rules[J]. Journal of Computer Applications, 2005, 25(07): 1570-1572.

[1]	周航叶俊勇. 运用聚类方法的分层采样粒子滤波算法[J]. 计算机应用, 2013, 33(01): 69-71.
[2]	何丽赵富强饶俊. 基于社团服务链的Web服务组合方法[J]. 计算机应用, 2013, 33(01): 250-253.
[3]	卢建平郭玉东王晓睿赵玉春. 基于协作型VMM的虚拟机执行环境动态配置模型[J]. 计算机应用, 2012, 32(03): 831-834.
[4]	何伟游婧张玲. 基于DM642 RAW采集格式的视频驱动开发及应用[J]. 计算机应用, 2012, 32(01): 279-283.
[5]	于淼;孙强. 对超粒度混杂技术的改进：基于瘦虚拟机的指令集交替技术[J]. 计算机应用, 2005, 25(12): 2808-2810.
[6]	李明;张保威. 基于Rough set的有序信息表的排序问题研究[J]. 计算机应用, 2005, 25(11): 2645-2646.
[7]	陈晓林;吴永英;李专. 细粒度的XML推理控制及实现[J]. 计算机应用, 2005, 25(11): 2544-2546.
[8]	唐常杰，彭京，张欢，钟义啸. 基于基因表达式编程的知识发现的三项新技术——转基因,重叠基因表达和回溯进化[J]. 计算机应用, 2005, 25(09): 1978-1981.
[9]	李炯，汪文勇，缪静. 考场编制中动态规划问题的提出和解决[J]. 计算机应用, 2005, 25(05): 1222-1224.
[10]	王达宗，马增良. 冗余SCADA数据同步的设计与构建[J]. 计算机应用, 2005, 25(05): 1225-1226.
[11]	林佳一，刘进，何克清. 面向对象的契约式程序设计[J]. 计算机应用, 2005, 25(04): 796-798.
[12]	李斌，郭剑毅. 一种带约束的最小离差平方和系统聚类法及应用[J]. 计算机应用, 2005, 25(01): 45-48.
[13]	邱万彬，张国杰，裘鸿林. Java中的组件复用相关技术[J]. 计算机应用, 2005, 25(01): 73-75.
[14]	刘歌群，刘卫国，卢京潮. 无人机强实时性串行通讯程序设计[J]. 计算机应用, 2005, 25(01): 210-212.

一个基于关联规则的多层文档聚类算法

Multilevel document clustering algorithm based on association rules

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 14

编辑推荐

Metrics