Soft subspace clustering algorithm for imbalanced data

doi:10.11772/j.issn.1001-9081.2017.10.2952

Journal of Computer Applications ›› 2017, Vol. 37 ›› Issue (10): 2952-2957.DOI: 10.11772/j.issn.1001-9081.2017.10.2952

Previous Articles Next Articles

Soft subspace clustering algorithm for imbalanced data

CHENG Lingfang¹, YANG Tianpeng², CHEN Lifei²

1. Jinshan College, Fujian Agriculture and Forestry University, Fuzhou Fujian 350002, China;
2. School of Mathematics and Computer Science, Fujian Normal University, Fuzhou Fujian 350117, China

Received:2017-05-15 Revised:2017-07-10 Online:2017-10-16 Published:2017-10-10
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61672157), the Natural Science Foundation of Fujian Province (2015J01238).

不平衡数据的软子空间聚类算法

程铃钫¹, 杨天鹏², 陈黎飞²

1. 福建农林大学金山学院, 福州 350002;
2. 福建师范大学数学与计算机科学学院, 福州 350117

通讯作者: 陈黎飞(1972-),男,福建长乐人,教授,博士,主要研究方向:统计机器学习、数据挖掘、模式识别,E-mail:clf@fafu.edu.cn
作者简介:程铃钫(1983-),女,山东滕州人,讲师,硕士,主要研究方向:机器学习、数据挖掘;杨天鹏(1991-),男,湖北十堰人,硕士研究生,主要研究方向:数据挖掘;陈黎飞(1972-),男,福建长乐人,教授,博士,主要研究方向:统计机器学习、数据挖掘、模式识别.
基金资助:
国家自然科学基金资助项目（61672157）；福建省自然科学基金资助项目（2015J01238）。

Abstract

Abstract: Aiming at the problem that the current K-means-type soft-subspace algorithms cannot effectively cluster imbalanced data due to uniform effect, a new partition-based algorithm was proposed for soft subspace clustering on imbalanced data. First, a bi-weighting method was proposed, where each attribute was assigned a feature-weight and each cluster was assigned a cluster-weight to measure its importance for clustering. Second, in order to make a trade-off between attributes with different types or those categorical attributes having various numbers of categories, a new distance measurement was then proposed for mixed-type data. Third, an objective function was defined for the subspace clustering algorithm on imbalanced data based on the bi-weighting method, and the expressions for optimizing both the cluster-weights and feature-weights were derived. A series of experiments were conducted on some real-world data sets and the results demonstrated that the bi-weighting method used in the new algorithm can learn more accurate soft-subspace for the clusters hidden in the imbalanced data. Compared with the existing K-means-type soft-subspace clustering algorithms, the proposed algorithm yields higher clustering accuracy on imbalanced data, achieving about 50% improvements on the bioinformatic data used in the experiments.

Key words: soft subspace clustering, imbalanced data, feature weight, cluster weight

摘要： 针对受均匀效应的影响，当前K-means型软子空间算法不能有效聚类不平衡数据的问题，提出一种基于划分的不平衡数据软子空间聚类新算法。首先，提出一种双加权方法，在赋予每个属性一个特征权重的同时，赋予每个簇反映其重要性的一个簇类权重；其次，提出一种混合型数据的新距离度量，以平衡不同类型属性及具有不同符号数目的类属型属性间的差异；第三，定义了基于双加权方法的不平衡数据子空间聚类目标优化函数，给出了优化簇类权重和特征权重的表达式。在实际应用数据集上进行了系列实验，结果表明，新算法使用的双权重方法能够为不平衡数据中的簇类学习更准确的软子空间；与现有的K-means型软子空间算法相比，所提算法提高了不平衡数据的聚类精度，在其中的生物信息学数据上可以取得近50%的提升幅度。

关键词: 软子空间聚类, 不平衡数据, 特征权重, 簇类权重

CLC Number:

TP274.2

CHENG Lingfang, YANG Tianpeng, CHEN Lifei. Soft subspace clustering algorithm for imbalanced data[J]. Journal of Computer Applications, 2017, 37(10): 2952-2957.

程铃钫, 杨天鹏, 陈黎飞. 不平衡数据的软子空间聚类算法[J]. 计算机应用, 2017, 37(10): 2952-2957.

References

[1] DENG Z, CHOI K-S, JIANG Y, et al. A survey on soft subspace clustering [J]. Information Sciences, 2016, 348: 84-106.
[2] AGGRAWAL C C. Data Mining: the Textbook[M]. Berlin: Springer, 2015.
[3] 陈黎飞, 郭躬德, 姜青山, 自适应的软子空间聚类算法[J]. 软件学报, 2010, 21(10): 2513-2523. (CHEN L F, GUO G D, JIANG Q S. An adaptive algorithm for soft subspace clustering[J]. Journal of Software, 2010, 21(10): 2513-2523.)
[4] HUANG J Z, NG M K, RONG H, LI Z. Automated variable weighting in k-means type clustering [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005, 27(5): 657-668.
[5] CHEN L, WANG S, WANG K, et al. Soft subspace clustering of categorical data with probabilistic distance[J]. Pattern Recognition, 2016, 51 (C): 322-332.
[6] CAO F, JIANG J, LI D, et al. A weighting k-modes algorithm for subspace clustering of categorical data [J]. Neurocomputing, 2013, 108: 23-30.
[7] MACQUEEN J. Some methods for classification and analysis of multivariate observation[C]//Proceedings of the 5th Berkley Symposium on Mathematical Statistics and Probability. Berkeley: University of California Press, 1967: 281-297.
[8] HUANG Z, NG M. A note on k-modes clustering[J]. Journal of Classification, 2003, 20(2): 257-261.
[9] 李仁侃, 叶东毅. 粗糙K-Modes聚类算法[J]. 计算机应用, 2011, 31(1): 97-100. (LI R K, YE D Y. Rough K-modes clustering algorithm[J]. Journal of Computer Applications, 2011, 31(1): 97-100.)
[10] 梁吉业, 白亮, 曹付元. 基于新的距离度量的K-Modes聚类算法[J]. 计算机研究与发展, 2010, 47(10): 1749-1755. (LIANG J Y, BAI L, CAO F Y. K-Modes clustering algorithm based on a new distance measure[J]. Journal of Computer Research and Development, 2010, 47(10): 1749-1755.)
[11] ZHOU K, YANG S. Exploring the uniform effect of FCM clustering: a data distribution perspective [J]. Knowledge-Based Systems, 2016, 96 (C): 76-83.
[12] HE H, GARCIA E A. Learning from imbalanced data[J]. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(9): 1263-1284.
[13] KUMAR N S, RAO K N, GOVARDHAN A, et al. Undersampled K-means approach for handling imbalanced distributed data[J]. Progress in Artificial Intelligence, 2014, 3(1): 29-38.
[14] LIANG J, BAI L, DANG C, et al. The k-means-type algorithms versus imbalanced data distributions[J]. IEEE Transactions on Fuzzy Systems, 2012, 20(4): 728-745.
[15] DE AMORIM R C. A survey on feature weighting based k-means algorithms [J]. Journal of Classification, 2016, 33(2): 210-242.
[16] LIANG J, ZHAO X, LI D, et al. Determining the number of clusters using information entropy for mixed data[J]. Pattern Recognition, 2012, 45(6): 2251-2265.
[17] ROUSSEEUW P J, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis[J]. Computational and Applied Mathematics, 1987, 20: 53-65.
[18] YANG Y, WEBB G I, Proportional k-interval discretization for naive-Bayes classifiers[C]//Proceedings of the 12th European Conference on Machine Learning. Berlin: Springer, 2001: 564-575.

Soft subspace clustering algorithm for imbalanced data

不平衡数据的软子空间聚类算法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

[1]	Qiangkui LENG, Xuezi SUN, Xiangfu MENG. Oversampling method for imbalanced data based on sample potential and noise evolution [J]. Journal of Computer Applications, 2024, 44(8): 2466-2475.
[2]	Tianyu HUANG, Yuanxing LI, Hao CHEN, Zijia GUO, Mingjun WEI. User cluster partitioning method based on weighted fuzzy clustering in ground-air collaboration scenarios [J]. Journal of Computer Applications, 2024, 44(5): 1555-1561.
[3]	Mingzhu LEI, Hao WANG, Rong JIA, Lin BAI, Xiaoying PAN. Oversampling algorithm based on synthesizing minority class samples using relationship between features [J]. Journal of Computer Applications, 2024, 44(5): 1428-1436.
[4]	Tian HE, Zongxin SHEN, Qianqian HUANG, Yanyong HUANG. Adaptive learning-based multi-view unsupervised feature selection method [J]. Journal of Computer Applications, 2023, 43(9): 2657-2664.
[5]	Xiang GUO, Wengang JIANG, Yuhang WANG. Encrypted traffic classification method based on improved Inception-ResNet [J]. Journal of Computer Applications, 2023, 43(8): 2471-2476.
[6]	Lin SUN, Jinxu HUANG, Jiucheng XU. Feature selection for imbalanced data based on neighborhood tolerance mutual information and whale optimization algorithm [J]. Journal of Computer Applications, 2023, 43(6): 1842-1854.
[7]	Dongliang MU, Meng HAN, Ang LI, Shujuan LIU, Zhihui GAO. Overview of classification methods for complex data streams with concept drift [J]. Journal of Computer Applications, 2023, 43(6): 1664-1675.
[8]	Yi JIANG, Shuping WU, Kun HU, Linbo LONG. Imbalanced data classification method based on Lasso and constructive covering algorithm [J]. Journal of Computer Applications, 2023, 43(4): 1086-1093.
[9]	Yaru HAN, Lianshan YAN, Tao YAO. Deep hashing retrieval algorithm based on meta-learning [J]. Journal of Computer Applications, 2022, 42(7): 2015-2021.
[10]	Gaofeng PAN, Yuan FAN, Yu RU, Yuchao GUO. Low-texture monocular visual simultaneous localization and mapping algorithm based on point-line feature fusion [J]. Journal of Computer Applications, 2022, 42(7): 2170-2176.
[11]	Hailong CHEN, Chang YANG, Mei DU, Yingyu ZHANG. Credit risk prediction model based on borderline adaptive SMOTE and Focal Loss improved LightGBM [J]. Journal of Computer Applications, 2022, 42(7): 2256-2264.
[12]	Xuewen LIU, Jikui WANG, Zhengguo YANG, Qiang LI, Jihai YI, Bing LI, Feiping NIE. Imbalanced data classification algorithm based on ball cluster partitioning and undersampling with density peak optimization [J]. Journal of Computer Applications, 2022, 42(5): 1455-1463.
[13]	Yiheng LI, Chenxi DU, Yanyan YANG, Xiangyu LI. Feature selection algorithm for imbalanced data based on pseudo-label consistency [J]. Journal of Computer Applications, 2022, 42(2): 475-484.
[14]	XIAO Zhenyuan, WANG Yihan, LUO Jianqiao, XIONG Ying, LI Bailin. RefineDet based on subsection weighted loss function [J]. Journal of Computer Applications, 2021, 41(7): 1928-1932.
[15]	YANG Xian, ZHAO Jisheng, QIANG Baohua, MI Luzhong, PENG Bo, TANG Chenghua, LI Baolian. Wind turbine fault sampling algorithm based on improved BSMOTE and sequential characteristics [J]. Journal of Computer Applications, 2021, 41(6): 1673-1678.