Parallel optimization sampling clustering K-means algorithm for big data processing

doi:10.11772/j.issn.1001-9081.2016.02.0311

Journal of Computer Applications ›› 2016, Vol. 36 ›› Issue (2): 311-315.DOI: 10.11772/j.issn.1001-9081.2016.02.0311

Previous Articles Next Articles

Parallel optimization sampling clustering K-means algorithm for big data processing

ZHOU Runwu, LI Zhiyong, CHEN Shaomiao, CHEN Jing, LI Renfa

College of Computer Science and Electronic Engineering, Hunan University, Changsha Hunan 410082, China

Received:2015-08-29 Revised:2015-09-14 Online:2016-02-10 Published:2016-02-03

面向大数据处理的并行优化抽样聚类K-means算法

周润物, 李智勇, 陈少淼, 陈京, 李仁发

湖南大学信息科学与工程学院, 长沙 410082

通讯作者: 李智勇(1971-),男,湖南长沙人,教授,博士,主要研究方向:云计算、大数据分析、模式识别、机器视觉。
作者简介:周润物(1992-),男,湖北咸宁人,博士研究生,主要研究方向:人工智能、数据挖掘;陈少淼(1989-),男,湖南邵阳人,博士研究生,主要研究方向:多目标优化、智能计算;陈京(1992-),男,湖南怀化人,硕士研究生,主要研究方向:云计算、大数据学习;李仁发(1957-),男,湖南郴州人,教授,博士生导师,博士,主要研究方向:高性能嵌入式计算。
基金资助:
国家自然科学基金资助项目(61173107);国家863计划项目(2012AA01A301-01)。

Abstract

Abstract: Focusing on the low accuracy and slow convergence of K-means clustering algorithm, an improved K-means algorithm based on optimization sample clustering named OSCK (Optimization Sampling Clustering K-means Algorithm) was proposed. Firstly, multiple samples were obtained from mass data by probability sampling. Secondly, based on Euclidean distance similarity principle of optimal clustering center, the results of sample clustering were modeled and evaluated, and the sub-optimal solution of sample clustering results was removed. Finally, the final k clustering centers were got by weighted integration evaluation of clustering results, and the final k clustering centers were used as cluster centers of big data set. Theoretical analysis and experimental results show that the proposed method for mass data analysis with respect to the comparison algorithm has better clustering accuracy, and has strong robustness and scalability.

Key words: big data, K-means, probability sampling, Euclidean distance, clustering accuracy

摘要： 针对大数据环境下K-means聚类算法聚类精度不足和收敛速度慢的问题,提出一种基于优化抽样聚类的K-means算法(OSCK)。首先,该算法从海量数据中概率抽样多个样本;其次,基于最佳聚类中心的欧氏距离相似性原理,建模评估样本聚类结果并去除抽样聚类结果的次优解;最后,加权整合评估得到的聚类结果得到最终k个聚类中心,并将这k个聚类中心作为大数据集聚类中心。理论分析和实验结果表明,OSCK面向海量数据分析相对于对比算法具有更好的聚类精度,并且具有很强的稳健性和可扩展性。

关键词: 大数据, K-均值, 概率抽样, 欧氏距离, 聚类精度

CLC Number:

TP391

ZHOU Runwu, LI Zhiyong, CHEN Shaomiao, CHEN Jing, LI Renfa. Parallel optimization sampling clustering K-means algorithm for big data processing[J]. Journal of Computer Applications, 2016, 36(2): 311-315.

周润物, 李智勇, 陈少淼, 陈京, 李仁发. 面向大数据处理的并行优化抽样聚类K-means算法[J]. 计算机应用, 2016, 36(2): 311-315.

References

[1] WU X, ZHU X, WU G, et al. Data mining with big data[J]. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(1): 97-107.
[2] CHEN M-S, HAN J, YU P S. Data mining: an overview from a database perspective[J]. IEEE Transactions on Knowledge and Data Engineering, 1996, 8(6): 866-883.
[3] NIMMAGADDA S L, DREHER H. Petro-data cluster mining——knowledge building analysis of complex petroleum systems[C]//ICIT 2009: Proceedings of the 2009 IEEE International Conference on Industrial Technology. Washington, DC: IEEE Computer Society, 2009: 1-8.
[4] FAHAD A, ALSHATRI N, TARI Z, et al. A survey of clustering algorithms for big data: taxonomy & empirical analysis[J]. IEEE Transactions on Emerging Topics in Computing, 2014, 2(3): 1.
[5] KURASOVA O, MARCINKEVICIUS V, MEDVEDEV V, et al. Strategies for big data clustering[C]//ICTAI 2014: Proceedings of the IEEE 26th International Conference on Tools with Artificial Intelligence. Piscataway, NJ: IEEE, 2014: 740-747.
[6] 李建江,崔健,王聃,等.MapReduce并行编程模型研究综述[J].电子学报,2011,39(11):2635-2642. (LI J J, CUI J, WANG D, et al. Survey of MapReduce parallel programming model[J]. Acta Electronica Sinica, 2011, 39(11): 2635-2642.)
[7] GUNARATHNE T, WU T-L, QIU J, et al. MapReduce in the clouds for science[C]//CloudCom 2010: Proceedings of the IEEE Second International Conference on Cloud Computing Technology and Science. Washington, DC: IEEE Computer Society, 2010: 565-572.
[8] 江小平,李成华,向文,等.K-means聚类算法的MapReduce并行化实现[J]. 华中科技大学学报(自然科学版),2011,39(Z1):120-124. (JIANG X P, LI C H, XIANG W, et al. Parallel implementing K-means clustering algorithm using MapReduce programming mode[J]. Journal of Huazhong University of Science and Technology (Natural Science), 2011, 39(Z1): 120-124.)
[9] 赵卫中,马慧芳,傅燕翔,等.基于云计算平台Hadoop的并行K-means聚类算法设计研究[J].计算机科学,2011,38(10):166-168. (ZHAO W Z, MA H F, FU Y X, et al. Research on parallel K-means clustering algorithm design based on Hadoop platform[J]. Computer Science, 2011, 38(10): 166-168.)
[10] 行小帅,潘进,焦李成.基于免疫规划的K-means聚类算法[J].计算机学报,2003,26(5):605-610. (XING X S, PAN J, JIAO L C. A novel K-means clustering algorithm based on immune programming algorithm[J]. Chinese Journal of Computers, 2003, 26(5): 605-610.)
[11] 於跃成,王建东,郑关胜,等.基于约束信息的并行K-means算[J].东南大学学报(自然科学版),2011,41(3):505-508. (YU Y C, WANG J D, ZHENG G S, et al. Parallel K-means algorithm based on constrained information[J]. Journal of Southeast University (Natural Science Edition), 2011, 41(3): 505-508.)
[12] ANCHALIA P P. Improved MapReduce K-means clustering algorithm with combiner[C]//Proceedings of the 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation. Washington, DC: IEEE Computer Society, 2014: 386-391.
[13] CUI X, ZHU P, YANG X, et al. Optimized big data K-means clustering using MapReduce[J]. Journal of Supercomputing, 2014, 70(3): 1249-1259.
[14] ARTHUR D, VASSILVITSKII S. K-means++: the advantages of careful seeding[C]//SODA '07: Proceedings of the Eighteenth Annual ACM-SIAM symposium on Discrete Algorithms. Philadelphia, PA: SIAM, 2007: 1027-1035.
[15] LIAO Q, YANG F, ZHAO J. An improved parallel K-means clustering algorithm with MapReduce[C]//ICCT 2013: Proceedings of the 15th IEEE International Conference on Communication Technology. Piscataway, NJ: IEEE, 2013: 764-768.
[16] PUN W K D, ALI A B M S. Unique distance measure approach for K-means (UDMA-Km) clustering algorithm[C]//TENCON 2007: Proceedings of the 2007 IEEE Region 10 Conference. Piscataway, NJ: IEEE, 2007: 1-4.
[17] BAHMANI B, MOSELEY B, VATTANI A, et al. Scalable K-means++[J]. Proceedings of VLDB Endowment, 2012, 5(7): 622-633.
[18] SHINDLER M, WONG A, MEYERSON A. Fast and accurate K-means for large datasets[C]//NIPS 2011: Advances in Neural Information Processing Systems 26. Cambridge, MA: MIT Press, 2011: 2375-2383.
[19] CAI X, NIE F, HUANG H. Multi-view K-means clustering on big data[C]//IJCAI '13: Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence. Menlo Park, CA: AAAI Press, 2013: 2098-2064.
[20] WANG J, SU X. An improved K-means clustering algorithm[C]//ICCSN 2011: Proceedings of the IEEE 3rd International Conference on Communication Software and Networks. Piscataway, NJ: IEEE, 2011: 44-46.
[21] CHEN G-P, WANG W-P. An improved K-means algorithm with meliorated initial center[C]//ICCSE 2012: Proceedings of the 7th International Conference on Computer Science & Education. Piscataway, NJ: IEEE, 2012: 150-153.
[22] DONG J, QI M. K-means optimization algorithm for solving clustering problem[C]//WKDD 2009: Proceedings of the Second International Workshop on Knowledge Discovery and Data Mining. Washington, DC: IEEE Computer Society, 2009: 52-55.

Parallel optimization sampling clustering K-means algorithm for big data processing

面向大数据处理的并行优化抽样聚类K-means算法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

[1]	ZHANG Zenghui, JIANG Gaoxia, WANG Wenjian. Label noise filtering method based on local probability sampling [J]. Journal of Computer Applications, 2021, 41(1): 67-73.
[2]	ZHOU Xiang, ZHAI Junhai, HUANG Yajie, SHEN Ruicai, HOU Yingzhen. Instance selection algorithm for big data based on random forest and voting mechanism [J]. Journal of Computer Applications, 2021, 41(1): 74-80.
[3]	CAO Cejun, LIU Ju. Overview of modeling method of emergency organization decision in disaster operations management [J]. Journal of Computer Applications, 2020, 40(7): 2142-2149.
[4]	ZHU Xiaojie, ZHAO Zihao, DU Yi. PiFlow: model driven big data pipeline framework [J]. Journal of Computer Applications, 2020, 40(6): 1638-1647.
[5]	WU Wenli, LIU Guohua, ZHANG Junbao. Complexity analysis of functional query answering on big data [J]. Journal of Computer Applications, 2020, 40(2): 416-419.
[6]	LI Ziying, SHI Zhenguo. Scheduling method for big data tasks [J]. Journal of Computer Applications, 2020, 40(10): 2923-2928.
[7]	LI He, JIANG Dengying, HUANG Zhangcan, WANG Zhanzhan. Method for solving color images quantization problem of color images [J]. Journal of Computer Applications, 2019, 39(9): 2646-2651.
[8]	ZHAO Ruixiang, HOU Honghua, ZHANG Pengcheng, LIU Yi, TIAN Zhu, GUI Zhiguo. Welding ball edge bubble segmentation for ball grid array based on full convolutional network and K-means clustering [J]. Journal of Computer Applications, 2019, 39(9): 2580-2585.
[9]	REN Jie, MIN Fan, WANG Min. Cost-sensitive active learning through farthest distance sum sampling [J]. Journal of Computer Applications, 2019, 39(9): 2499-2504.
[10]	ZHANG Yonglai, ZHOU Yaojian. Review of clustering algorithms [J]. Journal of Computer Applications, 2019, 39(7): 1869-1882.
[11]	MA Jiangang, MA Yinglong. Semantic-driven learning and classification method of judicial documents [J]. Journal of Computer Applications, 2019, 39(6): 1696-1700.
[12]	JI Lina, CHEN Kai, YU Yanwei, SONG Peng, WANG Shuying, WANG Chenrui. Vehicle type mining and application analysis based on urban traffic big data [J]. Journal of Computer Applications, 2019, 39(5): 1343-1350.
[13]	ZHANG Yitian, YU Jiong, LU Liang, LI Ziyang. Task scheduling strategy based on data stream classification in Heron [J]. Journal of Computer Applications, 2019, 39(4): 1106-1116.
[14]	DING Cheng, WANG Qiuping, WANG Xiaofeng. Krill herd algorithm based on generalized opposition-based learning and its application in data clustering [J]. Journal of Computer Applications, 2019, 39(2): 336-342.
[15]	XU Yao, LI Zhuoran, MENG Jinlong, ZHAO Lipo, WEN Jianxin, WANG Guiling. Extraction method of marine lane boundary from exploiting trajectory big data [J]. Journal of Computer Applications, 2019, 39(1): 105-112.