Journal of Computer Applications ›› 2018, Vol. 38 ›› Issue (10): 2844-2849.DOI: 10.11772/j.issn.1001-9081.2018020375

Previous Articles     Next Articles

Probability model-based algorithm for non-uniform data clustering

YANG Tianpeng1, CHEN Lifei1,2   

  1. 1. College of Mathematics and Informatics, Fujian Normal University, Fuzhou Fujian 350117, China;
    2. Digital Fujian Internet-of-Things Laboratory of Environmental Monitoring, Fujian Normal University, Fuzhou Fujian 350117, China
  • Received:2018-02-12 Revised:2018-03-31 Online:2018-10-10 Published:2018-10-13
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61672157), the Innovation Team Project of Fujian Normal University (IRTL1704).

基于概率模型的非均匀数据聚类算法

杨天鹏1, 陈黎飞1,2   

  1. 1. 福建师范大学 数学与信息学院, 福州 350117;
    2. 福建师范大学 数字福建环境监测物联网实验室, 福州 350117
  • 通讯作者: 陈黎飞
  • 作者简介:杨天鹏(1991-),男,湖北十堰人,硕士研究生,主要研究方向:数据挖掘;陈黎飞(1972-),男,福建长乐人,教授,博士,主要研究方向:统计机器学习、数据挖掘、模式识别。
  • 基金资助:
    国家自然科学基金资助项目(61672157);福建师范大学创新团队项目(IRTL1704)。

Abstract: Aiming at the "uniform effect" of the traditional K-means algorithm, a new probability model-based algorithm was proposed for non-uniform data clustering. Firstly, a Gaussian mixture distribution model was proposed to describe the clusters hidden within non-uniform data, allowing the datasets to contain clusters with different densities and sizes at the same time. Secondly, the objective optimization function for non-uniform data clustering was deduced based on the model, and an EM (Expectation Maximization)-type clustering algorithm defined to optimize the objective function. Theoretical analysis shows that the new algorithm is able to perform soft subspace clustering on non-uniform data. Finally, experimental results on synthetic datasets and real datasets demostrate that the accuracy of the proposed algorithm is increased by 5% to 50% compared with the existing K-means-type algorithms and under-sampling algorithms.

Key words: clustering, probability model, non-uniform data, uniform effect

摘要: 针对传统K-means型算法的"均匀效应"问题,提出一种基于概率模型的聚类算法。首先,提出一个描述非均匀数据簇的高斯混合分布模型,该模型允许数据集中同时包含密度和大小存在差异的簇;其次,推导了非均匀数据聚类的目标优化函数,并定义了优化该函数的期望最大化(EM)型聚类算法。分析结果表明,所提算法可以进行非均匀数据的软子空间聚类。最后,在合成数据集与实际数据集上进行的实验结果表明,所提算法有较高的聚类精度,与现有K-means型算法及基于欠抽样的算法相比,所提算法获得了5%~50%的精度提升。

关键词: 聚类, 概率模型, 非均匀数据, 均匀效应

CLC Number: