不平衡数据的软子空间聚类算法

doi:10.11772/j.issn.1001-9081.2017.10.2952

计算机应用 ›› 2017, Vol. 37 ›› Issue (10): 2952-2957.DOI: 10.11772/j.issn.1001-9081.2017.10.2952

不平衡数据的软子空间聚类算法

程铃钫¹, 杨天鹏², 陈黎飞²

1. 福建农林大学金山学院, 福州 350002;
2. 福建师范大学数学与计算机科学学院, 福州 350117

收稿日期:2017-05-15 修回日期:2017-07-10 出版日期:2017-10-10 发布日期:2017-10-16
通讯作者: 陈黎飞(1972-),男,福建长乐人,教授,博士,主要研究方向:统计机器学习、数据挖掘、模式识别,E-mail:clf@fafu.edu.cn
作者简介:程铃钫(1983-),女,山东滕州人,讲师,硕士,主要研究方向:机器学习、数据挖掘;杨天鹏(1991-),男,湖北十堰人,硕士研究生,主要研究方向:数据挖掘;陈黎飞(1972-),男,福建长乐人,教授,博士,主要研究方向:统计机器学习、数据挖掘、模式识别.
基金资助:
国家自然科学基金资助项目（61672157）；福建省自然科学基金资助项目（2015J01238）。

Soft subspace clustering algorithm for imbalanced data

CHENG Lingfang¹, YANG Tianpeng², CHEN Lifei²

1. Jinshan College, Fujian Agriculture and Forestry University, Fuzhou Fujian 350002, China;
2. School of Mathematics and Computer Science, Fujian Normal University, Fuzhou Fujian 350117, China

Received:2017-05-15 Revised:2017-07-10 Online:2017-10-10 Published:2017-10-16
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61672157), the Natural Science Foundation of Fujian Province (2015J01238).

摘要/Abstract

摘要： 针对受均匀效应的影响，当前K-means型软子空间算法不能有效聚类不平衡数据的问题，提出一种基于划分的不平衡数据软子空间聚类新算法。首先，提出一种双加权方法，在赋予每个属性一个特征权重的同时，赋予每个簇反映其重要性的一个簇类权重；其次，提出一种混合型数据的新距离度量，以平衡不同类型属性及具有不同符号数目的类属型属性间的差异；第三，定义了基于双加权方法的不平衡数据子空间聚类目标优化函数，给出了优化簇类权重和特征权重的表达式。在实际应用数据集上进行了系列实验，结果表明，新算法使用的双权重方法能够为不平衡数据中的簇类学习更准确的软子空间；与现有的K-means型软子空间算法相比，所提算法提高了不平衡数据的聚类精度，在其中的生物信息学数据上可以取得近50%的提升幅度。

关键词: 软子空间聚类, 不平衡数据, 特征权重, 簇类权重

Abstract: Aiming at the problem that the current K-means-type soft-subspace algorithms cannot effectively cluster imbalanced data due to uniform effect, a new partition-based algorithm was proposed for soft subspace clustering on imbalanced data. First, a bi-weighting method was proposed, where each attribute was assigned a feature-weight and each cluster was assigned a cluster-weight to measure its importance for clustering. Second, in order to make a trade-off between attributes with different types or those categorical attributes having various numbers of categories, a new distance measurement was then proposed for mixed-type data. Third, an objective function was defined for the subspace clustering algorithm on imbalanced data based on the bi-weighting method, and the expressions for optimizing both the cluster-weights and feature-weights were derived. A series of experiments were conducted on some real-world data sets and the results demonstrated that the bi-weighting method used in the new algorithm can learn more accurate soft-subspace for the clusters hidden in the imbalanced data. Compared with the existing K-means-type soft-subspace clustering algorithms, the proposed algorithm yields higher clustering accuracy on imbalanced data, achieving about 50% improvements on the bioinformatic data used in the experiments.

Key words: soft subspace clustering, imbalanced data, feature weight, cluster weight

中图分类号:

TP274.2

程铃钫, 杨天鹏, 陈黎飞. 不平衡数据的软子空间聚类算法[J]. 计算机应用, 2017, 37(10): 2952-2957.

CHENG Lingfang, YANG Tianpeng, CHEN Lifei. Soft subspace clustering algorithm for imbalanced data[J]. Journal of Computer Applications, 2017, 37(10): 2952-2957.

参考文献

[1] DENG Z, CHOI K-S, JIANG Y, et al. A survey on soft subspace clustering [J]. Information Sciences, 2016, 348: 84-106.
[2] AGGRAWAL C C. Data Mining: the Textbook[M]. Berlin: Springer, 2015.
[3] 陈黎飞, 郭躬德, 姜青山, 自适应的软子空间聚类算法[J]. 软件学报, 2010, 21(10): 2513-2523. (CHEN L F, GUO G D, JIANG Q S. An adaptive algorithm for soft subspace clustering[J]. Journal of Software, 2010, 21(10): 2513-2523.)
[4] HUANG J Z, NG M K, RONG H, LI Z. Automated variable weighting in k-means type clustering [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005, 27(5): 657-668.
[5] CHEN L, WANG S, WANG K, et al. Soft subspace clustering of categorical data with probabilistic distance[J]. Pattern Recognition, 2016, 51 (C): 322-332.
[6] CAO F, JIANG J, LI D, et al. A weighting k-modes algorithm for subspace clustering of categorical data [J]. Neurocomputing, 2013, 108: 23-30.
[7] MACQUEEN J. Some methods for classification and analysis of multivariate observation[C]//Proceedings of the 5th Berkley Symposium on Mathematical Statistics and Probability. Berkeley: University of California Press, 1967: 281-297.
[8] HUANG Z, NG M. A note on k-modes clustering[J]. Journal of Classification, 2003, 20(2): 257-261.
[9] 李仁侃, 叶东毅. 粗糙K-Modes聚类算法[J]. 计算机应用, 2011, 31(1): 97-100. (LI R K, YE D Y. Rough K-modes clustering algorithm[J]. Journal of Computer Applications, 2011, 31(1): 97-100.)
[10] 梁吉业, 白亮, 曹付元. 基于新的距离度量的K-Modes聚类算法[J]. 计算机研究与发展, 2010, 47(10): 1749-1755. (LIANG J Y, BAI L, CAO F Y. K-Modes clustering algorithm based on a new distance measure[J]. Journal of Computer Research and Development, 2010, 47(10): 1749-1755.)
[11] ZHOU K, YANG S. Exploring the uniform effect of FCM clustering: a data distribution perspective [J]. Knowledge-Based Systems, 2016, 96 (C): 76-83.
[12] HE H, GARCIA E A. Learning from imbalanced data[J]. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(9): 1263-1284.
[13] KUMAR N S, RAO K N, GOVARDHAN A, et al. Undersampled K-means approach for handling imbalanced distributed data[J]. Progress in Artificial Intelligence, 2014, 3(1): 29-38.
[14] LIANG J, BAI L, DANG C, et al. The k-means-type algorithms versus imbalanced data distributions[J]. IEEE Transactions on Fuzzy Systems, 2012, 20(4): 728-745.
[15] DE AMORIM R C. A survey on feature weighting based k-means algorithms [J]. Journal of Classification, 2016, 33(2): 210-242.
[16] LIANG J, ZHAO X, LI D, et al. Determining the number of clusters using information entropy for mixed data[J]. Pattern Recognition, 2012, 45(6): 2251-2265.
[17] ROUSSEEUW P J, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis[J]. Computational and Applied Mathematics, 1987, 20: 53-65.
[18] YANG Y, WEBB G I, Proportional k-interval discretization for naive-Bayes classifiers[C]//Proceedings of the 12th European Conference on Machine Learning. Berlin: Springer, 2001: 564-575.

不平衡数据的软子空间聚类算法

Soft subspace clustering algorithm for imbalanced data

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 14

编辑推荐

Metrics

[1]	肖振远, 王逸涵, 罗建桥, 熊鹰, 李柏林. 基于部分加权损失函数的RefineDet[J]. 计算机应用, 2021, 41(7): 1928-1932.
[2]	蔡瑞光, 张德生, 肖燕婷. 参数独立的加权局部均值伪近邻分类算法[J]. 计算机应用, 2021, 41(6): 1694-1700.
[3]	王俊红, 闫家荣. 基于欠采样和代价敏感的不平衡数据分类算法[J]. 计算机应用, 2021, 41(1): 48-52.
[4]	苏俊宁, 叶东毅. 基于样本密度峰值的不平衡数据欠抽样方法[J]. 计算机应用, 2020, 40(1): 83-89.
[5]	王忠震, 黄勃, 方志军, 高永彬, 张娟. 改进SMOTE的不平衡数据集成分类算法[J]. 计算机应用, 2019, 39(9): 2591-2596.
[6]	田臣, 周丽娟. 基于带多数类权重的少数类过采样技术和随机森林的信用评估方法[J]. 计算机应用, 2019, 39(6): 1707-1712.
[7]	张宗堂, 陈喆, 戴卫国. 基于间隔理论的过采样集成算法[J]. 计算机应用, 2019, 39(5): 1364-1367.
[8]	王伟, 谢耀滨, 尹青. 针对不平衡数据的决策树改进方法[J]. 计算机应用, 2019, 39(3): 623-628.
[9]	束珏, 成卫青, 邓聪. 基于话题标签和转发关系的微博聚类和主题词提取[J]. 计算机应用, 2016, 36(2): 460-464.
[10]	支晓斌, 许朝晖. 鲁棒的特征权重自调节软子空间聚类算法[J]. 计算机应用, 2015, 35(3): 770-774.
[11]	杨婷孟相如温祥西伍文. 基于Fisher类内散度的支持向量机分类面修正方法[J]. 计算机应用, 2013, 33(09): 2553-2556.
[12]	马雯雯邓一贵. 新的短文本特征权重计算方法[J]. 计算机应用, 2013, 33(08): 2280-2282.
[13]	雷治军张素玲薛贞霞. 基于球边界的不平衡数据分类方法[J]. 计算机应用, 2008, 28(4): 866-868.
[14]	王金艳冯建武刘万里. 一种不平衡支持向量机的校正方法[J]. 计算机应用, 2007, (12): 2896-2898.