计算机应用 ›› 2017, Vol. 37 ›› Issue (4): 1026-1031.DOI: 10.11772/j.issn.1001-9081.2017.04.1026

• 数据科学与技术 • 上一篇    下一篇

类属数据的贝叶斯聚类算法

朱杰1, 陈黎飞2   

  1. 1. 中国西南电子技术研究所, 成都 610036;
    2. 福建师范大学 数学与计算机科学学院, 福州 350117
  • 收稿日期:2016-09-12 修回日期:2016-12-23 出版日期:2017-04-10 发布日期:2017-04-19
  • 通讯作者: 陈黎飞
  • 作者简介:朱杰(1971-),男,浙江余姚人,高级工程师,主要研究方向:模式识别、目标识别;陈黎飞(1972-),男,福建长乐人,教授,博士,主要研究方向:统计机器学习、数据挖掘、模式识别。
  • 基金资助:
    国家自然科学基金资助项目(61175123);福建省自然科学基金资助项目(2015J01238)。

Bayesian clustering algorithm for categorical data

ZHU Jie1, CHEN Lifei2   

  1. 1. Southwest China Institute of Electronic Technology, Chengdu Sichuan 610036, China;
    2. School of Mathematics and Computer Science, Fujian Normal University, Fuzhou Fujian 350117, China
  • Received:2016-09-12 Revised:2016-12-23 Online:2017-04-10 Published:2017-04-19
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61175123), the Natural Science Foundation of Fujian Province (2015J01238).

摘要: 针对类属型数据聚类中对象间距离函数定义的困难问题,提出一种基于贝叶斯概率估计的类属数据聚类算法。首先,提出一种属性加权的概率模型,在这个模型中每个类属属性被赋予一个反映其重要性的权重;其次,经过贝叶斯公式的变换,定义了基于最大似然估计的聚类优化目标函数,并提出了一种基于划分的聚类算法,该算法不再依赖于对象间的距离,而是根据对象与数据集划分间的加权似然进行聚类;第三,推导了计算属性权重的表达式,得出了类属型属性权重与其符号分布的信息熵成反比的结论。在实际数据和合成数据集上进行了实验,结果表明,与基于距离的现有聚类算法相比,所提算法提高了聚类精度,特别是在生物信息学数据上取得了5%~48%的提升幅度,并可以获得有实际意义的属性加权结果。

关键词: 数据聚类, 类属型属性, 属性加权, 贝叶斯聚类, 概率模型

Abstract: To address the difficulty of defining a meaningful distance measure for categorical data clustering, a new categorical data clustering algorithm was proposed based on Bayesian probability estimation. Firstly, a probability model with automatic attribute-weighting was proposed, in which each categorical attribute is assigned an individual weight to indicate its importance for clustering. Secondly, a clustering objective function was derived using maximum likelihood estimation and Bayesian transformation, then a partitioning algorithm was proposed to optimize the objective function which groups data according to the weighted likelihood between objects and clusters instead of the pairwise distances. Thirdly, an expression for estimating the attribute weights was derived, indicating that the weight should be inversely proportional to the entropy of category distribution. The experiments were conducted on some real datasets and a synthetic dataset. The results show that the proposed algorithm yields higher clustering accuracy than the existing distance-based algorithms, achieving 5%-48% improvements on the Bioinformatics data with meaningful attribute-weighting results for the categorical attributes.

Key words: data clustering, categorical attribute, attribute weighting, Bayesian clustering, probability model

中图分类号: