计算机应用 ›› 2017, Vol. 37 ›› Issue (10): 2952-2957.DOI: 10.11772/j.issn.1001-9081.2017.10.2952

• 数据科学与技术 • 上一篇    下一篇

不平衡数据的软子空间聚类算法

程铃钫1, 杨天鹏2, 陈黎飞2   

  1. 1. 福建农林大学 金山学院, 福州 350002;
    2. 福建师范大学 数学与计算机科学学院, 福州 350117
  • 收稿日期:2017-05-15 修回日期:2017-07-10 出版日期:2017-10-10 发布日期:2017-10-16
  • 通讯作者: 陈黎飞(1972-),男,福建长乐人,教授,博士,主要研究方向:统计机器学习、数据挖掘、模式识别,E-mail:clf@fafu.edu.cn
  • 作者简介:程铃钫(1983-),女,山东滕州人,讲师,硕士,主要研究方向:机器学习、数据挖掘;杨天鹏(1991-),男,湖北十堰人,硕士研究生,主要研究方向:数据挖掘;陈黎飞(1972-),男,福建长乐人,教授,博士,主要研究方向:统计机器学习、数据挖掘、模式识别.
  • 基金资助:
    国家自然科学基金资助项目(61672157);福建省自然科学基金资助项目(2015J01238)。

Soft subspace clustering algorithm for imbalanced data

CHENG Lingfang1, YANG Tianpeng2, CHEN Lifei2   

  1. 1. Jinshan College, Fujian Agriculture and Forestry University, Fuzhou Fujian 350002, China;
    2. School of Mathematics and Computer Science, Fujian Normal University, Fuzhou Fujian 350117, China
  • Received:2017-05-15 Revised:2017-07-10 Online:2017-10-10 Published:2017-10-16
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61672157), the Natural Science Foundation of Fujian Province (2015J01238).

摘要: 针对受均匀效应的影响,当前K-means型软子空间算法不能有效聚类不平衡数据的问题,提出一种基于划分的不平衡数据软子空间聚类新算法。首先,提出一种双加权方法,在赋予每个属性一个特征权重的同时,赋予每个簇反映其重要性的一个簇类权重;其次,提出一种混合型数据的新距离度量,以平衡不同类型属性及具有不同符号数目的类属型属性间的差异;第三,定义了基于双加权方法的不平衡数据子空间聚类目标优化函数,给出了优化簇类权重和特征权重的表达式。在实际应用数据集上进行了系列实验,结果表明,新算法使用的双权重方法能够为不平衡数据中的簇类学习更准确的软子空间;与现有的K-means型软子空间算法相比,所提算法提高了不平衡数据的聚类精度,在其中的生物信息学数据上可以取得近50%的提升幅度。

关键词: 软子空间聚类, 不平衡数据, 特征权重, 簇类权重

Abstract: Aiming at the problem that the current K-means-type soft-subspace algorithms cannot effectively cluster imbalanced data due to uniform effect, a new partition-based algorithm was proposed for soft subspace clustering on imbalanced data. First, a bi-weighting method was proposed, where each attribute was assigned a feature-weight and each cluster was assigned a cluster-weight to measure its importance for clustering. Second, in order to make a trade-off between attributes with different types or those categorical attributes having various numbers of categories, a new distance measurement was then proposed for mixed-type data. Third, an objective function was defined for the subspace clustering algorithm on imbalanced data based on the bi-weighting method, and the expressions for optimizing both the cluster-weights and feature-weights were derived. A series of experiments were conducted on some real-world data sets and the results demonstrated that the bi-weighting method used in the new algorithm can learn more accurate soft-subspace for the clusters hidden in the imbalanced data. Compared with the existing K-means-type soft-subspace clustering algorithms, the proposed algorithm yields higher clustering accuracy on imbalanced data, achieving about 50% improvements on the bioinformatic data used in the experiments.

Key words: soft subspace clustering, imbalanced data, feature weight, cluster weight

中图分类号: