计算机应用 ›› 2018, Vol. 38 ›› Issue (11): 3132-3138.DOI: 10.11772/j.issn.1001-9081.2018041338

• 第七届中国数据挖掘会议(CCDM 2018) • 上一篇    下一篇

基于模拟退火与贪心策略的平衡聚类算法

唐海波1, 林煜明1, 李优2, 蔡国永1   

  1. 1. 广西可信软件重点实验室(桂林电子科技大学), 广西 桂林 541004;
    2. 广西自动检测技术与仪器重点实验室(桂林电子科技大学), 广西 桂林 541004
  • 收稿日期:2018-04-29 修回日期:2018-06-21 出版日期:2018-11-10 发布日期:2018-11-10
  • 通讯作者: 林煜明
  • 作者简介:唐海波(1993-),男,江苏淮安人,硕士研究生,主要研究方向:分布式数据管理;林煜明(1978-),男,广西合浦人,副研究员,博士,CCF会员,主要研究方向:观点挖掘、海量数据管理;李优(1980-),女,安徽亳州人,副教授,硕士,主要研究方向:文本挖掘;蔡国永(1971-),男,广西河池人,教授,博士,CCF高级会员,主要研究方向:社会媒体挖掘、机器学习、可信软件。
  • 基金资助:
    国家自然科学基金资助项目(61562014,61763007,U1711263);广西自然科学基金资助项目(2015GXNSFAA139303);广西可信软件重点实验室研究课题;广西自动检测技术与仪器重点实验室主任基金立项项目(YQ17111)。

Balanced clustering based on simulated annealing and greedy strategy

TANG Haibo1, LIN Yuming1, LI You2, CAI Guoyong1   

  1. 1. Guangxi Key Laboratory of Trusted Software(Guilin University of Electronic Technology), Guilin Gunangxi 541004, China;
    2. Guangxi Key Laboratory of Automatic Testing and Instrument(Guilin University of Electronic Technology), Guilin Gunangxi 541004, China
  • Received:2018-04-29 Revised:2018-06-21 Online:2018-11-10 Published:2018-11-10
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61562014, 61763007, U1711263), the Natural Science Foundation of Guangxi (2015GXNSFAA139303), the Project of Guangxi Key Laboratory of Trusted Software, the Director Fund Project of Guangxi Key Laboratory of Automatic Testing and Instrument (YQ17111).

摘要: 针对现实应用通常要求聚类的结果相对平衡的问题,提出了一种基于模拟退火与贪心策略的平衡聚类算法(BCSG),该算法包括基于模拟退火的初始点选择算法(SACI)与基于贪心策略的平衡聚类算法(BCGS)2个步骤,以提高平衡聚类算法的聚类效果与时间性能。首先基于模拟退火在数据集中快速定位出K个合适的数据点作为平衡聚类初始点,然后每个中心点分阶段贪婪地将距离其最近的数据点加入簇中直至达到簇规模上限。在6个UCI真实数据集与2个公开图像数据集上进行的聚类对比实验结果表明:在簇数目较大时相比Fuzzy C-Means聚类结果平衡度最高提升了50%以上;聚类结果的准确率相比Balanced K-Means、BCLS两个表现较好的算法平均提高了8个百分点;算法时间复杂度也更低,在较大规模的数据集上运行时间比Balanced K-Means最高减少了近40%。实验结果表明BCSG具有更佳的聚类效果和时间性能。

关键词: 平衡聚类, 贪心算法, 模拟退火, 近似算法, 数据挖掘

Abstract: Concerning the problem that clustering results are usually required to be balanced in practical applications, a Balanced Clustering algorithm based on Simulated annealing and Greedy strategy (BCSG) was proposed. The algorithm includes two steps:Simulated Annealing Clustering Initialization (SACI) and Balanced Clustering based on Greedy Strategy (BCGS) to improve clustering effectiveness with less time cost. First of all, K suitable data points of data set were located based on simulated annealing as the initial point of balanced clustering, and the nearest data points to each center point were added into the cluster where it belongs in stages greedily until the cluster size reach the upper limit. A series of experiments carried on six UCI real datasets and two public image datasets show that the balance degree can be increased by more than 50 percentage points compared with Fuzzy C-Means when the number of clusters is large, and the accuracy of clustering result is increased by 8 percentage points compared with Balanced K-Means and BCLS (Balanced Clustering with Least Square regression) which have good balanced clustering performance. Meanwhile, the time complexity of the BCSG is also lower, the running time is decreased by nearly 40 percentage points on large datasets compared with Balanced K-Means. BCSG has better clustering effectiveness with less time cost than other balanced clustering algorithms.

Key words: balanced clustering, greedy algorithm, simulated annealing, approximate algorithm, data mining

中图分类号: