面向大数据处理的并行优化抽样聚类K-means算法

doi:10.11772/j.issn.1001-9081.2016.02.0311

计算机应用 ›› 2016, Vol. 36 ›› Issue (2): 311-315.DOI: 10.11772/j.issn.1001-9081.2016.02.0311

• 第三届CCF大数据学术会议(CCF BigData 2015) • 上一篇下一篇

面向大数据处理的并行优化抽样聚类K-means算法

周润物, 李智勇, 陈少淼, 陈京, 李仁发

湖南大学信息科学与工程学院, 长沙 410082

收稿日期:2015-08-29 修回日期:2015-09-14 出版日期:2016-02-10 发布日期:2016-02-03
通讯作者: 李智勇(1971-),男,湖南长沙人,教授,博士,主要研究方向:云计算、大数据分析、模式识别、机器视觉。
作者简介:周润物(1992-),男,湖北咸宁人,博士研究生,主要研究方向:人工智能、数据挖掘;陈少淼(1989-),男,湖南邵阳人,博士研究生,主要研究方向:多目标优化、智能计算;陈京(1992-),男,湖南怀化人,硕士研究生,主要研究方向:云计算、大数据学习;李仁发(1957-),男,湖南郴州人,教授,博士生导师,博士,主要研究方向:高性能嵌入式计算。
基金资助:
国家自然科学基金资助项目(61173107);国家863计划项目(2012AA01A301-01)。

Parallel optimization sampling clustering K-means algorithm for big data processing

ZHOU Runwu, LI Zhiyong, CHEN Shaomiao, CHEN Jing, LI Renfa

College of Computer Science and Electronic Engineering, Hunan University, Changsha Hunan 410082, China

Received:2015-08-29 Revised:2015-09-14 Online:2016-02-10 Published:2016-02-03

摘要/Abstract

摘要： 针对大数据环境下K-means聚类算法聚类精度不足和收敛速度慢的问题,提出一种基于优化抽样聚类的K-means算法(OSCK)。首先,该算法从海量数据中概率抽样多个样本;其次,基于最佳聚类中心的欧氏距离相似性原理,建模评估样本聚类结果并去除抽样聚类结果的次优解;最后,加权整合评估得到的聚类结果得到最终k个聚类中心,并将这k个聚类中心作为大数据集聚类中心。理论分析和实验结果表明,OSCK面向海量数据分析相对于对比算法具有更好的聚类精度,并且具有很强的稳健性和可扩展性。

关键词: 大数据, K-均值, 概率抽样, 欧氏距离, 聚类精度

Abstract: Focusing on the low accuracy and slow convergence of K-means clustering algorithm, an improved K-means algorithm based on optimization sample clustering named OSCK (Optimization Sampling Clustering K-means Algorithm) was proposed. Firstly, multiple samples were obtained from mass data by probability sampling. Secondly, based on Euclidean distance similarity principle of optimal clustering center, the results of sample clustering were modeled and evaluated, and the sub-optimal solution of sample clustering results was removed. Finally, the final k clustering centers were got by weighted integration evaluation of clustering results, and the final k clustering centers were used as cluster centers of big data set. Theoretical analysis and experimental results show that the proposed method for mass data analysis with respect to the comparison algorithm has better clustering accuracy, and has strong robustness and scalability.

Key words: big data, K-means, probability sampling, Euclidean distance, clustering accuracy

中图分类号:

TP391

周润物, 李智勇, 陈少淼, 陈京, 李仁发. 面向大数据处理的并行优化抽样聚类K-means算法[J]. 计算机应用, 2016, 36(2): 311-315.

ZHOU Runwu, LI Zhiyong, CHEN Shaomiao, CHEN Jing, LI Renfa. Parallel optimization sampling clustering K-means algorithm for big data processing[J]. Journal of Computer Applications, 2016, 36(2): 311-315.

参考文献

[1] WU X, ZHU X, WU G, et al. Data mining with big data[J]. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(1): 97-107.
[2] CHEN M-S, HAN J, YU P S. Data mining: an overview from a database perspective[J]. IEEE Transactions on Knowledge and Data Engineering, 1996, 8(6): 866-883.
[3] NIMMAGADDA S L, DREHER H. Petro-data cluster mining——knowledge building analysis of complex petroleum systems[C]//ICIT 2009: Proceedings of the 2009 IEEE International Conference on Industrial Technology. Washington, DC: IEEE Computer Society, 2009: 1-8.
[4] FAHAD A, ALSHATRI N, TARI Z, et al. A survey of clustering algorithms for big data: taxonomy & empirical analysis[J]. IEEE Transactions on Emerging Topics in Computing, 2014, 2(3): 1.
[5] KURASOVA O, MARCINKEVICIUS V, MEDVEDEV V, et al. Strategies for big data clustering[C]//ICTAI 2014: Proceedings of the IEEE 26th International Conference on Tools with Artificial Intelligence. Piscataway, NJ: IEEE, 2014: 740-747.
[6] 李建江,崔健,王聃,等.MapReduce并行编程模型研究综述[J].电子学报,2011,39(11):2635-2642. (LI J J, CUI J, WANG D, et al. Survey of MapReduce parallel programming model[J]. Acta Electronica Sinica, 2011, 39(11): 2635-2642.)
[7] GUNARATHNE T, WU T-L, QIU J, et al. MapReduce in the clouds for science[C]//CloudCom 2010: Proceedings of the IEEE Second International Conference on Cloud Computing Technology and Science. Washington, DC: IEEE Computer Society, 2010: 565-572.
[8] 江小平,李成华,向文,等.K-means聚类算法的MapReduce并行化实现[J]. 华中科技大学学报(自然科学版),2011,39(Z1):120-124. (JIANG X P, LI C H, XIANG W, et al. Parallel implementing K-means clustering algorithm using MapReduce programming mode[J]. Journal of Huazhong University of Science and Technology (Natural Science), 2011, 39(Z1): 120-124.)
[9] 赵卫中,马慧芳,傅燕翔,等.基于云计算平台Hadoop的并行K-means聚类算法设计研究[J].计算机科学,2011,38(10):166-168. (ZHAO W Z, MA H F, FU Y X, et al. Research on parallel K-means clustering algorithm design based on Hadoop platform[J]. Computer Science, 2011, 38(10): 166-168.)
[10] 行小帅,潘进,焦李成.基于免疫规划的K-means聚类算法[J].计算机学报,2003,26(5):605-610. (XING X S, PAN J, JIAO L C. A novel K-means clustering algorithm based on immune programming algorithm[J]. Chinese Journal of Computers, 2003, 26(5): 605-610.)
[11] 於跃成,王建东,郑关胜,等.基于约束信息的并行K-means算[J].东南大学学报(自然科学版),2011,41(3):505-508. (YU Y C, WANG J D, ZHENG G S, et al. Parallel K-means algorithm based on constrained information[J]. Journal of Southeast University (Natural Science Edition), 2011, 41(3): 505-508.)
[12] ANCHALIA P P. Improved MapReduce K-means clustering algorithm with combiner[C]//Proceedings of the 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation. Washington, DC: IEEE Computer Society, 2014: 386-391.
[13] CUI X, ZHU P, YANG X, et al. Optimized big data K-means clustering using MapReduce[J]. Journal of Supercomputing, 2014, 70(3): 1249-1259.
[14] ARTHUR D, VASSILVITSKII S. K-means++: the advantages of careful seeding[C]//SODA '07: Proceedings of the Eighteenth Annual ACM-SIAM symposium on Discrete Algorithms. Philadelphia, PA: SIAM, 2007: 1027-1035.
[15] LIAO Q, YANG F, ZHAO J. An improved parallel K-means clustering algorithm with MapReduce[C]//ICCT 2013: Proceedings of the 15th IEEE International Conference on Communication Technology. Piscataway, NJ: IEEE, 2013: 764-768.
[16] PUN W K D, ALI A B M S. Unique distance measure approach for K-means (UDMA-Km) clustering algorithm[C]//TENCON 2007: Proceedings of the 2007 IEEE Region 10 Conference. Piscataway, NJ: IEEE, 2007: 1-4.
[17] BAHMANI B, MOSELEY B, VATTANI A, et al. Scalable K-means++[J]. Proceedings of VLDB Endowment, 2012, 5(7): 622-633.
[18] SHINDLER M, WONG A, MEYERSON A. Fast and accurate K-means for large datasets[C]//NIPS 2011: Advances in Neural Information Processing Systems 26. Cambridge, MA: MIT Press, 2011: 2375-2383.
[19] CAI X, NIE F, HUANG H. Multi-view K-means clustering on big data[C]//IJCAI '13: Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence. Menlo Park, CA: AAAI Press, 2013: 2098-2064.
[20] WANG J, SU X. An improved K-means clustering algorithm[C]//ICCSN 2011: Proceedings of the IEEE 3rd International Conference on Communication Software and Networks. Piscataway, NJ: IEEE, 2011: 44-46.
[21] CHEN G-P, WANG W-P. An improved K-means algorithm with meliorated initial center[C]//ICCSE 2012: Proceedings of the 7th International Conference on Computer Science & Education. Piscataway, NJ: IEEE, 2012: 150-153.
[22] DONG J, QI M. K-means optimization algorithm for solving clustering problem[C]//WKDD 2009: Proceedings of the Second International Workshop on Knowledge Discovery and Data Mining. Washington, DC: IEEE Computer Society, 2009: 52-55.

面向大数据处理的并行优化抽样聚类K-means算法

Parallel optimization sampling clustering K-means algorithm for big data processing

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	张增辉, 姜高霞, 王文剑. 基于局部概率抽样的标签噪声过滤方法[J]. 计算机应用, 2021, 41(1): 67-73.
[2]	周翔, 翟俊海, 黄雅婕, 申瑞彩, 侯璎真. 基于随机森林和投票机制的大数据样例选择算法[J]. 计算机应用, 2021, 41(1): 74-80.
[3]	曹策俊, 刘桔. 灾害运作管理中应急组织决策建模方法综述[J]. 计算机应用, 2020, 40(7): 2142-2149.
[4]	朱小杰, 赵子豪, 杜一. 模型驱动的大数据流水线框架PiFlow[J]. 计算机应用, 2020, 40(6): 1638-1647.
[5]	吴文莉, 刘国华, 张君宝. 大数据上函数查询解答的复杂度分析[J]. 计算机应用, 2020, 40(2): 416-419.
[6]	李孜颖, 石振国. 面向大数据任务的调度方法[J]. 计算机应用, 2020, 40(10): 2923-2928.
[7]	章永来, 周耀鉴. 聚类算法综述[J]. 计算机应用, 2019, 39(7): 1869-1882.
[8]	马建刚, 马应龙. 语义驱动的司法文档学习分类方法[J]. 计算机应用, 2019, 39(6): 1696-1700.
[9]	纪丽娜, 陈凯, 于彦伟, 宋鹏, 王淑莹, 王成锐. 基于城市交通大数据的车辆类别挖掘及应用分析[J]. 计算机应用, 2019, 39(5): 1343-1350.
[10]	张译天, 于炯, 鲁亮, 李梓杨. 大数据流式计算框架Heron环境下的流分类任务调度策略[J]. 计算机应用, 2019, 39(4): 1106-1116.
[11]	常征, 吕勇. 基于正则表达式的海量数据清洗系统[J]. 计算机应用, 2019, 39(10): 2942-2947.
[12]	王雪菲, 丁维龙. 面向高速公路大数据的短时流量预测方法[J]. 计算机应用, 2019, 39(1): 87-92.
[13]	徐垚, 李卓然, 孟金龙, 赵利坡, 温建新, 王桂玲. 基于大规模船舶轨迹数据的航道边界提取方法[J]. 计算机应用, 2019, 39(1): 105-112.
[14]	唐小川, 罗亮. 基于析因设计的大数据相关关系挖掘算法[J]. 计算机应用, 2018, 38(9): 2507-2510.
[15]	陈镜任, 吴业福, 吴冰. 基于车辆行驶数据的驾驶人行为谱分析方法[J]. 计算机应用, 2018, 38(7): 1916-1922.