面向大文本数据集的间接谱聚类

doi:10.3724/SP.J.1087.2012.03274

计算机应用 ›› 2012, Vol. 32 ›› Issue (12): 3274-3277.DOI: 10.3724/SP.J.1087.2012.03274

面向大文本数据集的间接谱聚类

侯海霞¹,原民民²,刘春霞³

1. 太原大学计算机工程系,太原 030032
2. 山西水利职业技术学院信息工程系,山西运城 04400
3. 太原科技大学计算机科学与技术学院,太原 030024

收稿日期:2012-07-11 修回日期:2012-09-03 发布日期:2012-12-29 出版日期:2012-12-01
通讯作者: 侯海霞
作者简介:侯海霞（1978-），女，山西阳泉人，讲师，硕士，主要研究方向:软件工程、算法分析; 〓原民民（1977-），男，山西运城人，讲师，硕士，主要研究方向:算法分析、计算机程序设计;〓刘春霞（1977-），女，山西大同人，讲师，硕士，主要研究方向:计算机智能控制及系统优化。
基金资助:
山西省青年科技研究基金

Indirect spectral clustering towards large text datasets

HOU Hai-xia¹,YUAN Min-min²,LIU Chun-xia³

1. Department of Computer Engineering, Taiyuan University, Taiyuan Shanxi 030032，China
2. Department of Information Engineering, Shanxi Conservancy Technical College, Yuncheng Shanxi 044000，China
3. College of Computer Science and Technology, Taiyuan University of Science and Technology, Taiyuan Shanxi 030024，China

Received:2012-07-11 Revised:2012-09-03 Online:2012-12-29 Published:2012-12-01
Contact: HOU Hai-xia

摘要/Abstract

摘要： 针对谱聚类存在计算瓶颈的问题，提出了一种快速的集成算法，称为间接谱聚类。它首先运用K-Means算法对数据集进行过分聚类，然后把每个过分簇看成一个基本对象，最后在过分簇的级别上利用标准谱聚类来完成总体的聚类。将该思想应用于大文本数据集的聚类问题后，过分簇中心之间的相似性度度量方法可以采用常用的余弦距离法。在20-Newgroups文本数据上的实验结果表明：间接谱聚类算法在聚类准确性上比K-Means算法平均高出14.72%;比规范割谱聚类仅低0.88%，但算法所需的计算时间平均不到规范割谱聚类的1/16,且随着数据集的增大当规范割谱聚类遭遇计算瓶颈时，提出的算法却能快速地给出次优解。

关键词: 谱聚类, 文本聚类, 大数据集

Abstract: To alleviate the computational bottleneck of spectral clustering, in this paper a general ensemble algorithm, called indirect spectral clustering, was developed. The algorithm first grouped a given large dataset into many overclusters and then regarded each obtained overcluster as a basic object. And then the standard spectral clustering ran at this object level. By convention, when applying this new idea to large text datasets, the cosine distance would be the appropriate manner in measuring the similarities between overclusters. The empirical studies on 20-Newgroups dataset show that the proposed algorithm has a 14.72% higher accuracy on average than the K-Means algorithm and has a 0.88% lower accuracy than the normalizedcut spectral clustering. However, the proposed algorithm saves 16.8 times computation time compared to the normalizedcut spectral clustering. In conclusion, with the increase of data size, the computation time of the normalizedcut spectral clustering might become unacceptable; however, the proposed algorithm might efficiently give the nearoptimal solutions.

Key words: spectral clustering, text clustering, large datasets

侯海霞原民民刘春霞. 面向大文本数据集的间接谱聚类[J]. 计算机应用, 2012, 32(12): 3274-3277.

HOU Hai-xia YUAN Min-min LIU Chun-xia. Indirect spectral clustering towards large text datasets[J]. Journal of Computer Applications, 2012, 32(12): 3274-3277.

[1]	祝承, 赵晓琦, 赵丽萍, 焦玉宏, 朱亚飞, 陈建英, 周伟, 谭颖. 基于谱聚类半监督特征选择的功能磁共振成像数据分类[J]. 计算机应用, 2021, 41(8): 2288-2293.
[2]	李杏峰, 黄玉清, 任珍文. 联合低秩稀疏的多核子空间聚类算法[J]. 计算机应用, 2020, 40(6): 1648-1653.
[3]	刘静姝, 王莉, 刘惊雷. 无需特征分解的快速谱聚类算法[J]. 计算机应用, 2020, 40(12): 3413-3422.
[4]	宋艳, 殷俊. 基于共享近邻的多视角谱聚类算法[J]. 计算机应用, 2020, 40(11): 3211-3216.
[5]	崔艺馨, 陈晓东. Spark框架优化的大规模谱聚类并行算法[J]. 计算机应用, 2020, 40(1): 168-172.
[6]	毛伊敏, 刘银萍, 梁田, 毛丁慧. 基于模糊谱聚类的不确定蛋白质相互作用网络功能模块挖掘[J]. 计算机应用, 2019, 39(4): 1032-1040.
[7]	郭烜成, 林晖, 叶秀彩, 许传丰. 软件定义广域网中控制器部署与交换机动态迁移策略[J]. 计算机应用, 2019, 39(2): 453-457.
[8]	孙石磊, 王超, 赵元棣. 基于轮廓系数的参数无关空中交通轨迹聚类方法[J]. 计算机应用, 2019, 39(11): 3293-3297.
[9]	曹大为, 贺超波, 陈启买, 刘海. 基于加权核非负矩阵分解的短文本聚类算法[J]. 计算机应用, 2018, 38(8): 2180-2184.
[10]	郑孝遥, 陈冬梅, 刘雨晴, 尤浩, 汪祥舜, 孙丽萍. 基于差分隐私保护的谱聚类算法[J]. 计算机应用, 2018, 38(10): 2918-2922.
[11]	王日宏, 崔兴梅. 融合集群度与距离均衡优化的K-均值聚类算法[J]. 计算机应用, 2018, 38(1): 104-109.
[12]	唐黎哲, 冯大为, 李东升, 李荣春, 刘锋. 以LDA为例的大规模分布式机器学习系统分析[J]. 计算机应用, 2017, 37(3): 628-634.
[13]	王伟东, 刘兵, 管红杰, 周勇, 夏士雄. 基于核函数的谱嵌入聚类算法[J]. 计算机应用, 2015, 35(3): 761-765.
[14]	张嫱嫱, 黄廷磊, 张银明. 基于聚类分析的二分网络社区挖掘[J]. 计算机应用, 2015, 35(12): 3511-3514.
[15]	徐盈盈钟才明. 基于集成学习的无监督离散化算法[J]. 计算机应用, 2014, 34(8): 2184-2187.

面向大文本数据集的间接谱聚类

Indirect spectral clustering towards large text datasets

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics