类属数据的贝叶斯聚类算法

doi:10.11772/j.issn.1001-9081.2017.04.1026

计算机应用 ›› 2017, Vol. 37 ›› Issue (4): 1026-1031.DOI: 10.11772/j.issn.1001-9081.2017.04.1026

类属数据的贝叶斯聚类算法

朱杰¹, 陈黎飞²

1. 中国西南电子技术研究所, 成都 610036;
2. 福建师范大学数学与计算机科学学院, 福州 350117

收稿日期:2016-09-12 修回日期:2016-12-23 出版日期:2017-04-10 发布日期:2017-04-19
通讯作者: 陈黎飞
作者简介:朱杰(1971-),男,浙江余姚人,高级工程师,主要研究方向:模式识别、目标识别;陈黎飞(1972-),男,福建长乐人,教授,博士,主要研究方向:统计机器学习、数据挖掘、模式识别。
基金资助:
国家自然科学基金资助项目（61175123）；福建省自然科学基金资助项目（2015J01238）。

Bayesian clustering algorithm for categorical data

ZHU Jie¹, CHEN Lifei²

1. Southwest China Institute of Electronic Technology, Chengdu Sichuan 610036, China;
2. School of Mathematics and Computer Science, Fujian Normal University, Fuzhou Fujian 350117, China

Received:2016-09-12 Revised:2016-12-23 Online:2017-04-10 Published:2017-04-19
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61175123), the Natural Science Foundation of Fujian Province (2015J01238).

摘要/Abstract

摘要： 针对类属型数据聚类中对象间距离函数定义的困难问题，提出一种基于贝叶斯概率估计的类属数据聚类算法。首先，提出一种属性加权的概率模型，在这个模型中每个类属属性被赋予一个反映其重要性的权重；其次，经过贝叶斯公式的变换，定义了基于最大似然估计的聚类优化目标函数，并提出了一种基于划分的聚类算法，该算法不再依赖于对象间的距离，而是根据对象与数据集划分间的加权似然进行聚类；第三，推导了计算属性权重的表达式，得出了类属型属性权重与其符号分布的信息熵成反比的结论。在实际数据和合成数据集上进行了实验，结果表明，与基于距离的现有聚类算法相比，所提算法提高了聚类精度，特别是在生物信息学数据上取得了5%~48%的提升幅度，并可以获得有实际意义的属性加权结果。

关键词: 数据聚类, 类属型属性, 属性加权, 贝叶斯聚类, 概率模型

Abstract: To address the difficulty of defining a meaningful distance measure for categorical data clustering, a new categorical data clustering algorithm was proposed based on Bayesian probability estimation. Firstly, a probability model with automatic attribute-weighting was proposed, in which each categorical attribute is assigned an individual weight to indicate its importance for clustering. Secondly, a clustering objective function was derived using maximum likelihood estimation and Bayesian transformation, then a partitioning algorithm was proposed to optimize the objective function which groups data according to the weighted likelihood between objects and clusters instead of the pairwise distances. Thirdly, an expression for estimating the attribute weights was derived, indicating that the weight should be inversely proportional to the entropy of category distribution. The experiments were conducted on some real datasets and a synthetic dataset. The results show that the proposed algorithm yields higher clustering accuracy than the existing distance-based algorithms, achieving 5%-48% improvements on the Bioinformatics data with meaningful attribute-weighting results for the categorical attributes.

Key words: data clustering, categorical attribute, attribute weighting, Bayesian clustering, probability model

中图分类号:

TP274.2

朱杰, 陈黎飞. 类属数据的贝叶斯聚类算法[J]. 计算机应用, 2017, 37(4): 1026-1031.

ZHU Jie, CHEN Lifei. Bayesian clustering algorithm for categorical data[J]. Journal of Computer Applications, 2017, 37(4): 1026-1031.

参考文献

[1] HUNT L, JORGENSEN M. Clustering mixed data[J]. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2011, 1(4):352-361.
[2] BORIAH S, CHANDOLA V, KUMAR V. Similarity measures for categorical data: a comparative evaluation[C]//Proceedings of the 8th SIAM International Conference on Data Mining. Philadelphia: SIAM, 2008: 243-254.
[3] CHEN L, WANG S. Central clustering of categorical data with automated feature weighting[C]//Proceedings of the 23rd International Joint Conference on Artificial Intelligence. Menlo Park, CA: AAAI Press, 2013: 1260-1266.
[4] GUHA S, RASTOGI R, SHIM K. ROCK: a robust clustering algorithm for categorical attributes[J]. Information Systems, 2000, 25(5):345-366.
[5] XIONG T, WANG S, MAYERS A, et al. DHCC: divisive hierarchical clustering of categorical data[J]. Data Mining and Knowledge Discovery, 2012, 24(1):103-135.
[6] MACQUEEN J. Some methods for classification and analysis of multivariate observation[C]//Proceedings of the 5th Berkley Symposium on Mathematical Statistics and Probability. Berkeley: University of California Press, 1967: 281-297.
[7] JI J, BAI T, ZHOU C, et al. An improved k-prototypes clustering algorithm for mixed numeric and categorical data[J]. Neurocomputing, 2013, 120:590-596.
[8] SAN O, HUYNH V, NAKAMORI Y. An alternative extension of the k-means algorithm for clustering categorical data[J]. International Journal of Applied Mathematics and Computer Science, 2004, 14(2):241-247.
[9] HUANG Z, NG M. A note on k-modes clustering[J]. Journal of Classification, 2003, 20(2):257-261.
[10] 李仁侃, 叶东毅. 粗糙k-modes聚类算法[J]. 计算机应用, 2011, 31(1): 97-100.(LI R K, YE D Y. Rough k-modes clustering algorithm[J]. Journal of Computer Applications, 2011, 31(1): 97-100.)
[11] HUANG Z. Extensions to the k-means algorithm for clustering large data sets with categorical values[J]. Data Mining and Knowledge Discovery, 1998, 2(3):283-304.
[12] 梁吉业, 白亮, 曹付元. 基于新的距离度量的k-modes聚类算法[J]. 计算机研究与发展, 2010, 47(10):1749-1755.(LIANG J Y, BAI L, CAO F Y. k-modes clustering algorithm based on a new distance measure[J]. Journal of Computer Research and Development, 2010, 47(10):1749-1755.)
[13] LIN D. An information-theoretic definition of similarity[C]//Proceedings of the 15th International Conference on Machine Learning. San Francisco: Morgan Kaufmann, 1998: 296-304.
[14] GOODALL D. A new similarity index based on probability[J]. Biometrics, 1966, 22(4):882-907.
[15] BAI L, LIANG J, DANG C, et al. A novel attribute weighting algorithm for clustering high-dimensional categorical data[J]. Pattern Recognition, 2011, 44(12): 2843-2861.
[16] CAO F, LIANG J, LI D, et al. A weighting k-modes algorithm for subspace clustering of categorical data[J]. Neurocomputing, 2013, 108: 23-30.
[17] CHEN L, WANG S, WANG K, et al. Soft subspace clustering of categorical data with probabilistic distance[J]. Pattern Recognition, 2016, 51:322-332.
[18] 陈黎飞, 郭躬德. 属性加权的类属型数据非模聚类[J]. 软件学报, 2013, 24(11):2628-2641.(CHEN L F, GUO G D. Non-mode clustering of categorical data with attributes weighting[J]. Journal of Software, 2013, 24(11):2628-2641.)
[19] HUANG Z, NG M, RONG H, et al. Automated variable weighting in k-means type clustering[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005, 27(5):657-668.
[20] BOUGUESSA M. Clustering categorical data in projected spaces[J]. Data Mining and Knowledge Discovery, 2015, 29(1): 3-38.
[21] CHEN L. A probabilistic framework for optimizing projected clusters with categorical attributes[J]. Science China Information Sciences, 2015, 58(7): 072104(15).
[22] LIANG J, ZHAO X, LI D, et al. Determining the number of clusters using information entropy for mixed data[J]. Pattern Recognition, 2012, 45(6):2251-2265.

类属数据的贝叶斯聚类算法

Bayesian clustering algorithm for categorical data

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	张杰, 常天庆, 戴文君, 郭理彬, 张雷. 基于相关滤波与颜色概率模型的目标跟踪算法[J]. 计算机应用, 2020, 40(6): 1774-1782.
[2]	王巧玲, 乔非, 蒋友好. 基于聚合距离参数的改进K-means算法[J]. 计算机应用, 2019, 39(9): 2586-2590.
[3]	章永来, 周耀鉴. 聚类算法综述[J]. 计算机应用, 2019, 39(7): 1869-1882.
[4]	丁成, 王秋萍, 王晓峰. 基于广义反向学习的磷虾群算法及其在数据聚类中的应用[J]. 计算机应用, 2019, 39(2): 336-342.
[5]	韩忠华, 毕开元, 司雯, 吕哲. 基于谱分析的密度峰值快速聚类算法[J]. 计算机应用, 2019, 39(2): 409-413.
[6]	杨天鹏, 陈黎飞. 基于概率模型的非均匀数据聚类算法[J]. 计算机应用, 2018, 38(10): 2844-2849.
[7]	黄海南, 李晓峰, 连培昆, 荣建. 基于信号配时的公交优先策略触发概率模型[J]. 计算机应用, 2018, 38(10): 3025-3029.
[8]	杨鹏, 赵辉, 鲍忠贵. 基于双十字链表存储的共享资源矩阵方法特性研究[J]. 计算机应用, 2016, 36(3): 653-656.
[9]	王政英, 于炯, 英昌甜, 鲁亮. 分布式文件系统数据块聚类存储节能策略[J]. 计算机应用, 2015, 35(2): 378-382.
[10]	杨观赐李少波钟勇. 分段抽样模型中抽中目标的概率分析[J]. 计算机应用, 2012, 32(08): 2209-2211.
[11]	韩旭杨余旺王磊. 基于网络编码的传染路由协议性能[J]. 计算机应用, 2012, 32(03): 791-794.
[12]	刘洋肖宝秋戴光明. 基于概率模型的混合多目标算法[J]. 计算机应用, 2011, 31(09): 2555-2558.
[13]	文英董荣胜郭云川. 一种提高Ad Hoc网络节点能量效率的DPM模型[J]. 计算机应用, 2007, 27(5): 1095-1098.
[14]	李建中，石胜飞，王朝坤. 基于感知数据概率模型的无线传感器网络采样和通信调度算法[J]. 计算机应用, 2005, 25(09): 1982-1985.
[15]	刘平，陈斌，付忠良. 基于透射类图像数学模型的阈值分割新方法[J]. 计算机应用, 2005, 25(05): 1084-1086.