计算机应用 ›› 2018, Vol. 38 ›› Issue (10): 2844-2849.DOI: 10.11772/j.issn.1001-9081.2018020375

• 数据科学与技术 • 上一篇    下一篇

基于概率模型的非均匀数据聚类算法

杨天鹏1, 陈黎飞1,2   

  1. 1. 福建师范大学 数学与信息学院, 福州 350117;
    2. 福建师范大学 数字福建环境监测物联网实验室, 福州 350117
  • 收稿日期:2018-02-12 修回日期:2018-03-31 出版日期:2018-10-10 发布日期:2018-10-13
  • 通讯作者: 陈黎飞
  • 作者简介:杨天鹏(1991-),男,湖北十堰人,硕士研究生,主要研究方向:数据挖掘;陈黎飞(1972-),男,福建长乐人,教授,博士,主要研究方向:统计机器学习、数据挖掘、模式识别。
  • 基金资助:
    国家自然科学基金资助项目(61672157);福建师范大学创新团队项目(IRTL1704)。

Probability model-based algorithm for non-uniform data clustering

YANG Tianpeng1, CHEN Lifei1,2   

  1. 1. College of Mathematics and Informatics, Fujian Normal University, Fuzhou Fujian 350117, China;
    2. Digital Fujian Internet-of-Things Laboratory of Environmental Monitoring, Fujian Normal University, Fuzhou Fujian 350117, China
  • Received:2018-02-12 Revised:2018-03-31 Online:2018-10-10 Published:2018-10-13
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61672157), the Innovation Team Project of Fujian Normal University (IRTL1704).

摘要: 针对传统K-means型算法的"均匀效应"问题,提出一种基于概率模型的聚类算法。首先,提出一个描述非均匀数据簇的高斯混合分布模型,该模型允许数据集中同时包含密度和大小存在差异的簇;其次,推导了非均匀数据聚类的目标优化函数,并定义了优化该函数的期望最大化(EM)型聚类算法。分析结果表明,所提算法可以进行非均匀数据的软子空间聚类。最后,在合成数据集与实际数据集上进行的实验结果表明,所提算法有较高的聚类精度,与现有K-means型算法及基于欠抽样的算法相比,所提算法获得了5%~50%的精度提升。

关键词: 聚类, 概率模型, 非均匀数据, 均匀效应

Abstract: Aiming at the "uniform effect" of the traditional K-means algorithm, a new probability model-based algorithm was proposed for non-uniform data clustering. Firstly, a Gaussian mixture distribution model was proposed to describe the clusters hidden within non-uniform data, allowing the datasets to contain clusters with different densities and sizes at the same time. Secondly, the objective optimization function for non-uniform data clustering was deduced based on the model, and an EM (Expectation Maximization)-type clustering algorithm defined to optimize the objective function. Theoretical analysis shows that the new algorithm is able to perform soft subspace clustering on non-uniform data. Finally, experimental results on synthetic datasets and real datasets demostrate that the accuracy of the proposed algorithm is increased by 5% to 50% compared with the existing K-means-type algorithms and under-sampling algorithms.

Key words: clustering, probability model, non-uniform data, uniform effect

中图分类号: