基于概率模型的非均匀数据聚类算法

doi:10.11772/j.issn.1001-9081.2018020375

计算机应用 ›› 2018, Vol. 38 ›› Issue (10): 2844-2849.DOI: 10.11772/j.issn.1001-9081.2018020375

基于概率模型的非均匀数据聚类算法

杨天鹏¹, 陈黎飞^1,2

1. 福建师范大学数学与信息学院, 福州 350117;
2. 福建师范大学数字福建环境监测物联网实验室, 福州 350117

收稿日期:2018-02-12 修回日期:2018-03-31 出版日期:2018-10-10 发布日期:2018-10-13
通讯作者: 陈黎飞
作者简介:杨天鹏(1991-),男,湖北十堰人,硕士研究生,主要研究方向:数据挖掘;陈黎飞(1972-),男,福建长乐人,教授,博士,主要研究方向:统计机器学习、数据挖掘、模式识别。
基金资助:
国家自然科学基金资助项目（61672157）；福建师范大学创新团队项目（IRTL1704）。

Probability model-based algorithm for non-uniform data clustering

YANG Tianpeng¹, CHEN Lifei^1,2

1. College of Mathematics and Informatics, Fujian Normal University, Fuzhou Fujian 350117, China;
2. Digital Fujian Internet-of-Things Laboratory of Environmental Monitoring, Fujian Normal University, Fuzhou Fujian 350117, China

Received:2018-02-12 Revised:2018-03-31 Online:2018-10-10 Published:2018-10-13
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61672157), the Innovation Team Project of Fujian Normal University (IRTL1704).

摘要/Abstract

摘要： 针对传统K-means型算法的"均匀效应"问题，提出一种基于概率模型的聚类算法。首先，提出一个描述非均匀数据簇的高斯混合分布模型，该模型允许数据集中同时包含密度和大小存在差异的簇；其次，推导了非均匀数据聚类的目标优化函数，并定义了优化该函数的期望最大化（EM）型聚类算法。分析结果表明，所提算法可以进行非均匀数据的软子空间聚类。最后，在合成数据集与实际数据集上进行的实验结果表明，所提算法有较高的聚类精度，与现有K-means型算法及基于欠抽样的算法相比，所提算法获得了5%~50%的精度提升。

关键词: 聚类, 概率模型, 非均匀数据, 均匀效应

Abstract: Aiming at the "uniform effect" of the traditional K-means algorithm, a new probability model-based algorithm was proposed for non-uniform data clustering. Firstly, a Gaussian mixture distribution model was proposed to describe the clusters hidden within non-uniform data, allowing the datasets to contain clusters with different densities and sizes at the same time. Secondly, the objective optimization function for non-uniform data clustering was deduced based on the model, and an EM (Expectation Maximization)-type clustering algorithm defined to optimize the objective function. Theoretical analysis shows that the new algorithm is able to perform soft subspace clustering on non-uniform data. Finally, experimental results on synthetic datasets and real datasets demostrate that the accuracy of the proposed algorithm is increased by 5% to 50% compared with the existing K-means-type algorithms and under-sampling algorithms.

Key words: clustering, probability model, non-uniform data, uniform effect

中图分类号:

TP311

杨天鹏, 陈黎飞. 基于概率模型的非均匀数据聚类算法[J]. 计算机应用, 2018, 38(10): 2844-2849.

YANG Tianpeng, CHEN Lifei. Probability model-based algorithm for non-uniform data clustering[J]. Journal of Computer Applications, 2018, 38(10): 2844-2849.

参考文献

[1] 韩家炜,坎伯M,裴健.数据挖掘:概念与技术[M].3版.范明,孟小峰,译.北京:机械工业出版社,2012:288.(HAN J W, KAMER M, PEI J. Data Mining:Concepts and Techniques[M]. 3rd ed. FAN M, MENG X F, translated. Beijing:China Machine Press, 2012:288.)
[2] BERKHIN P. A survey of clustering data mining techniques[M]//KOGAN J, NICHOLAS C, TEBOULLE M. Grouping Multidimensional Data. Berlin:Springer, 2002:25-71.
[3] AGGARWAL C C, REDDY C K. Data Clustering:Algorithms and Applications[M]. Boca Raton:Chapman and Hall/CRC, 2013:3-15.
[4] HARTIGAN J A, WONG M A. Algorithm AS 136:a K-means clustering algorithm[J]. Journal of the Royal Statistical Society, 1979, 28(1):100-108.
[5] JAIN A K. Data clustering:50 years beyond K-means[J]. Pattern Recognition Letters, 2010, 31(8):651-666.
[6] XIONG H, WU J, CHEN J. K-means clustering versus validation measures:a data-distribution perspective[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part B:Cybernetics, 2009, 39(2):318-331.
[7] HE H, GARCIA E A. Learning from imbalanced data[J]. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(9):1263-1284.
[8] KUMAR N S, RAO K N, GOVARDHAN A, et al. Undersampled K-means approach for handling imbalanced distributed data[J]. Progress in Artificial Intelligence, 2014, 3(1):29-38.
[9] KUMAR C N S, RAO K N, GOVARDHAN A. An empirical comparative study of novel clustering algorithms for class imbalance learning[C]//Proceedings of the 2nd International Conference on Computer and Communication Technologies, AISC 380. Berlin:Springer, 2016:181-191.
[10] 刘云.不平衡数据的模糊聚类算法研究及在宏基因组重叠群分类中的应用[D].长春:吉林大学,2016:15-48.(LIU Y. Research of fuzzy clustering method on imbalanced dataset and its application in metagenomic contigs binning[D]. Changchun:Jilin University, 2016:15-48.)
[11] LIANG J, BAI L, DANG C, et al. The K-means-type algorithms versus imbalanced data distributions[J]. IEEE Transactions on Fuzzy Systems, 2012, 20(4):728-745.
[12] 程铃钫,杨天鹏,陈黎飞.不平衡数据的软子空间聚类算法[J].计算机应用,2017,37(10):2952-2957.(CHENG L F, YANG T P, CHEN L F. Soft subspace clustering algorithm for imbalanced data[J]. Journal of Computer Applications, 2017, 37(10):2952-2957.)
[13] CHEN L, JIANG Q, WANG S. A probability model for projective clustering on high dimensional data[C]//ICDM 2008:Proceedings of the 8th IEEE International Conference on Data Mining. Washington, DC:IEEE Computer Society, 2008:755-760.
[14] VIDAL R. Subspace clustering[J]. IEEE Signal Processing Magazine, 2011, 28(2):52-68.
[15] XU L, JORDAN M I. On convergence properties of the EM algorithm for Gaussian mixtures[J]. Neural Computation, 1996, 8(1):129-151.
[16] 李航.统计学习方法[M].北京:清华大学出版社,2012:162-165.(LI H. Statistical Learning Method[M]. Beijing:Tsinghua University Press, 2012:162-165.)
[17] TASKAR B, SEGAL E, KOLLER D. Probabilistic classification and clustering in relational data[C]//IJCAI 2001:Proceedings of the 17th International Joint Conference on Artificial Intelligence. San Francisco, CA:Morgan Kaufmann, 2001, 2:870-876
[18] 朱杰,陈黎飞.类属数据的贝叶斯聚类算法[J].计算机应用,2017,37(4):1026-1031.(ZHU J, CHEN L F. Bayesian clustering algorithm for categorical data[J]. Journal of Computer Applications, 2017, 37(4):1026-1031.)
[19] LI X, CHEN Z, YANG F. Exploring of clustering algorithm on class-imbalanced data[C]//Proceedings of the 8th International Conference on Computer Science and Education. Piscataway, NJ:IEEE, 2013:89-93.
[20] STREHL A, GHOSH J. Cluster ensembles-a knowledge reuse framework for combining multiple partitions[J]. Journal of Machine Learning Research, 2003, 3(3):583-617.

基于概率模型的非均匀数据聚类算法

Probability model-based algorithm for non-uniform data clustering

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	陈恒恒, 倪志伟, 朱旭辉, 金媛媛, 陈千. 基于聚类分析的差分隐私高维数据发布方法[J]. 计算机应用, 2021, 41(9): 2578-2585.
[2]	曾祥银, 郑伯川, 刘丹. 基于深度卷积神经网络和聚类的左右轨道线检测[J]. 计算机应用, 2021, 41(8): 2324-2329.
[3]	祝承, 赵晓琦, 赵丽萍, 焦玉宏, 朱亚飞, 陈建英, 周伟, 谭颖. 基于谱聚类半监督特征选择的功能磁共振成像数据分类[J]. 计算机应用, 2021, 41(8): 2288-2293.
[4]	戴嫣然, 戴国庆, 袁玉波. 基于肤色学习的多人脸前景抽取方法[J]. 计算机应用, 2021, 41(6): 1659-1666.
[5]	马建红, 曹文斌, 刘元刚, 夏爽. 基于功效特征的专利聚类方法[J]. 计算机应用, 2021, 41(5): 1361-1366.
[6]	王治和, 常筱卿, 杜辉. 基于万有引力的自适应近邻传播聚类算法[J]. 计算机应用, 2021, 41(5): 1337-1342.
[7]	李国荣, 冶继民, 甄远婷. 基于新的鲁棒相似性度量的时间序列聚类[J]. 计算机应用, 2021, 41(5): 1343-1347.
[8]	李杏峰, 黄玉清, 任珍文, 李毅红. 基于自适应邻域的鲁棒多视图聚类算法[J]. 计算机应用, 2021, 41(4): 1093-1099.
[9]	龙超奇, 蒋瑜, 谢雨. 基于峰值网格改进的小波聚类算法[J]. 计算机应用, 2021, 41(4): 1122-1127.
[10]	邹志文, 秦程. 基于k-means++的动态构建空间主题R树方法[J]. 计算机应用, 2021, 41(3): 733-737.
[11]	吕佳, 鲜焱. 结合改进密度峰值聚类和共享子空间的协同训练算法[J]. 计算机应用, 2021, 41(3): 686-693.
[12]	郭佳, 韩李涛, 孙宪龙, 周丽娟. 自动确定聚类中心的比较密度峰值聚类算法[J]. 计算机应用, 2021, 41(3): 738-744.
[13]	袁芊芊, 邓洪敏, 王晓航. 基于超像素快速模糊C均值聚类与支持向量机的柑橘病虫害区域分割[J]. 计算机应用, 2021, 41(2): 563-570.
[14]	张恩, 李会敏, 常键. 可验证的隐私保护k-means聚类方案[J]. 计算机应用, 2021, 41(2): 413-421.
[15]	陈港, 孟相如, 康巧燕, 阳勇. 基于拓扑分割与聚类分析的虚拟软件定义网络映射算法[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3309-3318.