计算机应用 ›› 2013, Vol. 33 ›› Issue (08): 2204-2207.

• 数据库技术 • 上一篇    下一篇

基于数据集特点的增强聚类集成算法

侯勇1,2,郑雪峰1   

  1. 1. 北京科技大学 计算机与通信工程学院,北京 100083;
    2. 山东经贸职业学院 科学与人文学院,山东 潍坊 261011
  • 收稿日期:2013-02-04 修回日期:2013-03-12 出版日期:2013-08-01 发布日期:2013-09-11
  • 通讯作者: 侯勇
  • 作者简介:侯勇(1978-),男,山东蓬莱人,讲师,博士研究生,主要研究方向:数据挖掘、网络安全、机器学习;
    郑雪峰(1951-),男,福建福州人,教授,主要研究方向:网络安全。
  • 基金资助:

    山东省企业培训与职工教育课题资助项目;潍坊市社科规划重点课题资助项目;山东省高校人文社科研究计划项目

Enhanced clustering ensemble algorithm based on characteristics of data sets

HOU Yong1,2,ZHENG Xuefeng2   

  1. 1. College of Humanities and Science, Shandong Vocational College of Economics and Business, Weifang Shandong 61011, China
    2. School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing 100083, China
  • Received:2013-02-04 Revised:2013-03-12 Online:2013-09-11 Published:2013-08-01
  • Contact: HOU Yong

摘要: 当前流行的聚类集成算法无法依据不同数据集的不同特点给出恰当的处理方案,为此提出一种新的基于数据集特点的增强聚类集成算法,该算法由基聚类器的生成、基聚类器的选择与共识函数构成。该算法依据数据集的特点,通过启发式方法,选出合适的基聚类器,构建最终的基聚类器集合,并产生最终聚类结果。实验中,对ecoli,leukaemia与Vehicle三个基准数据集进行了聚类,所提出算法的聚类误差分别是0.014,0.489,0.479,同基于Bagging的结构化集成(BSEA)、异构聚类集成(HCE)和基于聚类的集成分类(COEC)算法相比,所提出算法的聚类误差始终最低;而在增加候基聚类器的情况下,所提出算法的标准化互信息(NMI)值始终高于对比算法。实验结果表明,同对比的聚类集成算法相比,所提出算法的聚类精度最高,可伸缩性最强。

关键词: 基聚类器, 共识函数, 聚类集成算法, 聚类误差, 自适应性, 标准化互信息

Abstract: The popular clustering ensemble algorithms cannot give the appropriate treatment program in the light of the different characteristics of the different data sets. A new clustering ensemble algorithm — Enhanced Clustering Ensemble algorithm based on Characteristics of Data sets (ECECD) was proposed for overcoming this defect. ECECD was composed of generation of base clustering, selection of base clustering and consensus function. It selected a special range of ensemble members to form the final ensemble and produced the final clustering based on the characteristic of the data set. Three Benchmark data sets including ecoli, leukaemia and Vehicle were clustered in the experiment, and the clustering errors gained by the proposed algorithm were 0.014, 0.489 and 0.361 respectively, which were always the minimum compared with that of the other algorithms such as Bagging based Structure Ensemble Approach (BSEA), Hybrid Cluster Ensemble (HCE) and Cluster-Oriented Ensemble Classifier (COES). The Normalized Mutual Information (NMI) values of the proposed algorithm were also always higher than that of these algorithms when increasing candidate base clusterings. Therefore, compared with these popular clustering ensemble algorithms, the proposed algorithm has the highest clustering precision and the strongest scalability.

Key words: base clustering, consensus function, clustering ensemble algorithm, clustering error, adaptivity, Normalized Mutual Information (NMI)

中图分类号: