Probability model-based algorithm for non-uniform data clustering

doi:10.11772/j.issn.1001-9081.2018020375

Journal of Computer Applications ›› 2018, Vol. 38 ›› Issue (10): 2844-2849.DOI: 10.11772/j.issn.1001-9081.2018020375

Previous Articles Next Articles

Probability model-based algorithm for non-uniform data clustering

YANG Tianpeng¹, CHEN Lifei^1,2

1. College of Mathematics and Informatics, Fujian Normal University, Fuzhou Fujian 350117, China;
2. Digital Fujian Internet-of-Things Laboratory of Environmental Monitoring, Fujian Normal University, Fuzhou Fujian 350117, China

Received:2018-02-12 Revised:2018-03-31 Online:2018-10-10 Published:2018-10-13
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61672157), the Innovation Team Project of Fujian Normal University (IRTL1704).

基于概率模型的非均匀数据聚类算法

杨天鹏¹, 陈黎飞^1,2

1. 福建师范大学数学与信息学院, 福州 350117;
2. 福建师范大学数字福建环境监测物联网实验室, 福州 350117

通讯作者: 陈黎飞
作者简介:杨天鹏(1991-),男,湖北十堰人,硕士研究生,主要研究方向:数据挖掘;陈黎飞(1972-),男,福建长乐人,教授,博士,主要研究方向:统计机器学习、数据挖掘、模式识别。
基金资助:
国家自然科学基金资助项目（61672157）；福建师范大学创新团队项目（IRTL1704）。

Abstract

Abstract: Aiming at the "uniform effect" of the traditional K-means algorithm, a new probability model-based algorithm was proposed for non-uniform data clustering. Firstly, a Gaussian mixture distribution model was proposed to describe the clusters hidden within non-uniform data, allowing the datasets to contain clusters with different densities and sizes at the same time. Secondly, the objective optimization function for non-uniform data clustering was deduced based on the model, and an EM (Expectation Maximization)-type clustering algorithm defined to optimize the objective function. Theoretical analysis shows that the new algorithm is able to perform soft subspace clustering on non-uniform data. Finally, experimental results on synthetic datasets and real datasets demostrate that the accuracy of the proposed algorithm is increased by 5% to 50% compared with the existing K-means-type algorithms and under-sampling algorithms.

Key words: clustering, probability model, non-uniform data, uniform effect

摘要： 针对传统K-means型算法的"均匀效应"问题，提出一种基于概率模型的聚类算法。首先，提出一个描述非均匀数据簇的高斯混合分布模型，该模型允许数据集中同时包含密度和大小存在差异的簇；其次，推导了非均匀数据聚类的目标优化函数，并定义了优化该函数的期望最大化（EM）型聚类算法。分析结果表明，所提算法可以进行非均匀数据的软子空间聚类。最后，在合成数据集与实际数据集上进行的实验结果表明，所提算法有较高的聚类精度，与现有K-means型算法及基于欠抽样的算法相比，所提算法获得了5%~50%的精度提升。

关键词: 聚类, 概率模型, 非均匀数据, 均匀效应

CLC Number:

TP311

YANG Tianpeng, CHEN Lifei. Probability model-based algorithm for non-uniform data clustering[J]. Journal of Computer Applications, 2018, 38(10): 2844-2849.

杨天鹏, 陈黎飞. 基于概率模型的非均匀数据聚类算法[J]. 计算机应用, 2018, 38(10): 2844-2849.

References

[1] 韩家炜,坎伯M,裴健.数据挖掘:概念与技术[M].3版.范明,孟小峰,译.北京:机械工业出版社,2012:288.(HAN J W, KAMER M, PEI J. Data Mining:Concepts and Techniques[M]. 3rd ed. FAN M, MENG X F, translated. Beijing:China Machine Press, 2012:288.)
[2] BERKHIN P. A survey of clustering data mining techniques[M]//KOGAN J, NICHOLAS C, TEBOULLE M. Grouping Multidimensional Data. Berlin:Springer, 2002:25-71.
[3] AGGARWAL C C, REDDY C K. Data Clustering:Algorithms and Applications[M]. Boca Raton:Chapman and Hall/CRC, 2013:3-15.
[4] HARTIGAN J A, WONG M A. Algorithm AS 136:a K-means clustering algorithm[J]. Journal of the Royal Statistical Society, 1979, 28(1):100-108.
[5] JAIN A K. Data clustering:50 years beyond K-means[J]. Pattern Recognition Letters, 2010, 31(8):651-666.
[6] XIONG H, WU J, CHEN J. K-means clustering versus validation measures:a data-distribution perspective[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part B:Cybernetics, 2009, 39(2):318-331.
[7] HE H, GARCIA E A. Learning from imbalanced data[J]. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(9):1263-1284.
[8] KUMAR N S, RAO K N, GOVARDHAN A, et al. Undersampled K-means approach for handling imbalanced distributed data[J]. Progress in Artificial Intelligence, 2014, 3(1):29-38.
[9] KUMAR C N S, RAO K N, GOVARDHAN A. An empirical comparative study of novel clustering algorithms for class imbalance learning[C]//Proceedings of the 2nd International Conference on Computer and Communication Technologies, AISC 380. Berlin:Springer, 2016:181-191.
[10] 刘云.不平衡数据的模糊聚类算法研究及在宏基因组重叠群分类中的应用[D].长春:吉林大学,2016:15-48.(LIU Y. Research of fuzzy clustering method on imbalanced dataset and its application in metagenomic contigs binning[D]. Changchun:Jilin University, 2016:15-48.)
[11] LIANG J, BAI L, DANG C, et al. The K-means-type algorithms versus imbalanced data distributions[J]. IEEE Transactions on Fuzzy Systems, 2012, 20(4):728-745.
[12] 程铃钫,杨天鹏,陈黎飞.不平衡数据的软子空间聚类算法[J].计算机应用,2017,37(10):2952-2957.(CHENG L F, YANG T P, CHEN L F. Soft subspace clustering algorithm for imbalanced data[J]. Journal of Computer Applications, 2017, 37(10):2952-2957.)
[13] CHEN L, JIANG Q, WANG S. A probability model for projective clustering on high dimensional data[C]//ICDM 2008:Proceedings of the 8th IEEE International Conference on Data Mining. Washington, DC:IEEE Computer Society, 2008:755-760.
[14] VIDAL R. Subspace clustering[J]. IEEE Signal Processing Magazine, 2011, 28(2):52-68.
[15] XU L, JORDAN M I. On convergence properties of the EM algorithm for Gaussian mixtures[J]. Neural Computation, 1996, 8(1):129-151.
[16] 李航.统计学习方法[M].北京:清华大学出版社,2012:162-165.(LI H. Statistical Learning Method[M]. Beijing:Tsinghua University Press, 2012:162-165.)
[17] TASKAR B, SEGAL E, KOLLER D. Probabilistic classification and clustering in relational data[C]//IJCAI 2001:Proceedings of the 17th International Joint Conference on Artificial Intelligence. San Francisco, CA:Morgan Kaufmann, 2001, 2:870-876
[18] 朱杰,陈黎飞.类属数据的贝叶斯聚类算法[J].计算机应用,2017,37(4):1026-1031.(ZHU J, CHEN L F. Bayesian clustering algorithm for categorical data[J]. Journal of Computer Applications, 2017, 37(4):1026-1031.)
[19] LI X, CHEN Z, YANG F. Exploring of clustering algorithm on class-imbalanced data[C]//Proceedings of the 8th International Conference on Computer Science and Education. Piscataway, NJ:IEEE, 2013:89-93.
[20] STREHL A, GHOSH J. Cluster ensembles-a knowledge reuse framework for combining multiple partitions[J]. Journal of Machine Learning Research, 2003, 3(3):583-617.

Probability model-based algorithm for non-uniform data clustering

基于概率模型的非均匀数据聚类算法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

[1]	CHEN Hengheng, NI Zhiwei, ZHU Xuhui, JIN Yuanyuan, CHEN Qian. Differential privacy high-dimensional data publishing method via clustering analysis [J]. Journal of Computer Applications, 2021, 41(9): 2578-2585.
[2]	ZHU Cheng, ZHAO Xiaoqi, ZHAO Liping, JIAO Yuhong, ZHU Yafei, CHENG Jianying, ZHOU Wei, TAN Ying. Classification of functional magnetic resonance imaging data based on semi-supervised feature selection by spectral clustering [J]. Journal of Computer Applications, 2021, 41(8): 2288-2293.
[3]	ZENG Xiangyin, ZHENG Bochuan, LIU Dan. Detection of left and right railway tracks based on deep convolutional neural network and clustering [J]. Journal of Computer Applications, 2021, 41(8): 2324-2329.
[4]	WANG Jiarui, TAN Guoping, ZHOU Siyuan. Clustered wireless federated learning algorithm in high-speed internet of vehicles scenes [J]. Journal of Computer Applications, 2021, 41(6): 1546-1550.
[5]	DAI Yanran, DAI Guoqing, YUAN Yubo. Multi-face foreground extraction method based on skin color learning [J]. Journal of Computer Applications, 2021, 41(6): 1659-1666.
[6]	MA Jianhong, CAO Wenbin, LIU Yuangang, XIA Shuang. Patent clustering method based on functional effect [J]. Journal of Computer Applications, 2021, 41(5): 1361-1366.
[7]	WANG Zhihe, CHANG Xiaoqing, DU Hui. Adaptive affinity propagation clustering algorithm based on universal gravitation [J]. Journal of Computer Applications, 2021, 41(5): 1337-1342.
[8]	LI Guorong, YE Jimin, ZHEN Yuanting. Time series clustering based on new robust similarity measure [J]. Journal of Computer Applications, 2021, 41(5): 1343-1347.
[9]	LONG Chaoqi, JIANG Yu, XIE Yu. Improved wavelet clustering algorithm based on peak grid [J]. Journal of Computer Applications, 2021, 41(4): 1122-1127.
[10]	LI Xingfeng, HUANG Yuqing, REN Zhenwen, LI Yihong. Robust multi-view clustering algorithm based on adaptive neighborhood [J]. Journal of Computer Applications, 2021, 41(4): 1093-1099.
[11]	ZOU Zhiwen, QIN Cheng. Method of dynamically constructing spatial topic R-tree based on k-means++ [J]. Journal of Computer Applications, 2021, 41(3): 733-737.
[12]	GUO Jia, HAN Litao, SUN Xianlong, ZHOU Lijuan. Comparative density peaks clustering algorithm with automatic determination of clustering center [J]. Journal of Computer Applications, 2021, 41(3): 738-744.
[13]	LYU Jia, XIAN Yan. Co-training algorithm combining improved density peak clustering and shared subspace [J]. Journal of Computer Applications, 2021, 41(3): 686-693.
[14]	ZHANG En, LI Huimin, CHANG Jian. Verifiable k-means clustering scheme with privacy-preserving [J]. Journal of Computer Applications, 2021, 41(2): 413-421.
[15]	YUAN Qianqian, DENG Hongmin, WANG Xiaohang. Citrus disease and insect pest area segmentation based on superpixel fast fuzzy C-means clustering and support vector machine [J]. Journal of Computer Applications, 2021, 41(2): 563-570.