Journal of Computer Applications ›› 2018, Vol. 38 ›› Issue (2): 458-463.DOI: 10.11772/j.issn.1001-9081.2017071749

Previous Articles     Next Articles

Clustering ensemble algorithms based on improved genetic algorithm in cloud computing

XU Zhanyang, ZHENG Kezhang   

  1. School of Computer and Software, Nanjing University of Information Science & Technology, Nanjing Jiangsu 210044, China
  • Received:2017-07-18 Revised:2017-09-10 Online:2018-02-10 Published:2018-02-10
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61572259).

云计算下基于改进遗传算法的聚类融合算法

徐占洋, 郑克长   

  1. 南京信息工程大学 计算机与软件学院, 南京 210044
  • 通讯作者: 郑克长
  • 作者简介:徐占洋(1975-),男,江苏灌云人,副教授,博士,主要研究方向:无线自组织网络、大数据挖掘;郑克长(1989-),男,江苏南京人,硕士研究生,主要研究方向:大数据挖掘、数据挖掘。
  • 基金资助:
    国家自然科学基金资助项目(61572259)。

Abstract: Considering the problem that unsupervised clustering lacks priori information about data classification, the accuracy of base clustering is affected by clustering algorithm and general clustering ensemble algorithm has high space complexity, a Clustering Ensemble algorithm based on Improved Genetic Algorithm (CEIGA) was proposed. Focusing on the issue that traditional clustering ensemble algorithms can not meet the time requirement of large scale data processing, a Parallel Clustering Ensemble algorithm based on Improved Genetic Algorithm (PCEIGA) using Hadoop for cloud computing was also proposed. Firstly, the base clustering partitions produced by base clustering generation mechanism were encoded as the initial population of the improved Genetic Algorithm (GA) after changing cluster labels. Secondly, the diversity of base clustering was ensured by improving the selection operator of GA. According to the improved selection operator, crossover operation and mutation operation were adopted on chromosomes and the next generation population was gotten by elitist strategy to ensure the accuracy of base clustering. By this way, the final results of clustering ensemble reached global optimum and the accuracy of the algorithm was improved. To improve the efficiency of the proposed algorithms, two MapReduce processes were designed and one Combine process was added to reduce the communication among nodes. Finally, CEIGA, PCEIGA and four advanced clustering ensemble algorithms were compared on UCI data sets. The experimental results show that CEIGA performs better than other advanced clustering ensemble algorithms, and PCEIGA can significantly reduce running time and improve algorithm efficiency without decreasing the accuracy of clustering results.

Key words: cloud computing, Genetic Algorithm (GA), clustering ensemble, selection operator, parallel

摘要: 针对无监督聚类缺少数据分类等先验信息、基聚类的准确性受聚类算法影响以及一般聚类融合算法空间复杂度高的问题,提出一种基于改进遗传算法的聚类融合算法(CEIGA);同时针对传统聚类融合算法已经不能满足大规模数据处理对于时间的要求的问题,提出一种云计算下使用Hadoop平台的基于改进遗传算法的并行聚类融合算法(PCEIGA)。首先,基聚类生成机制产生的基聚类划分在完成簇标签转化后进行基因编码作为遗传算法的初始种群。其次,通过改进遗传算法的选择算子,保证基聚类的多样性;再根据改进的选择算子对染色体进行交叉和变异操作并使用精英策略得到下一代种群,保证基聚类的准确性。如此循环,使聚类融合最终结果达到全局最优,提高算法准确度。通过设计两个MapReduce过程并加入Combine过程减少节点通信,提高算法运行效率。最后,在UCI数据集上比较了CEIGA、PCEIGA和四个先进的聚类融合算法。实验结果表明,与先进的聚类融合算法相比,CEIGA性能最好;而PCEIGA能在不影响聚类结果准确度的前提下明显降低算法运行时间,提高算法效率。

关键词: 云计算, 遗传算法, 聚类融合, 选择算子, 并行

CLC Number: