《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (12): 3750-3755.DOI: 10.11772/j.issn.1001-9081.2021101837

• 数据科学与技术 • 上一篇    

基于改进的半监督聚类的不平衡分类算法

陆宇, 赵凌云, 白斌雯, 姜震()   

  1. 江苏大学 计算机科学与通信工程学院,江苏 镇江 212013
  • 收稿日期:2021-10-28 修回日期:2022-01-06 接受日期:2022-01-10 发布日期:2022-01-19 出版日期:2022-12-10
  • 通讯作者: 姜震
  • 作者简介:陆宇(1997—),男,江苏徐州人,硕士研究生,主要研究方向:机器学习
    赵凌云(1996—),男,江苏淮安人,硕士研究生,主要研究方向:机器学习
    白斌雯(2001—),男,山西太原人,主要研究方向:机器学习
  • 基金资助:
    国家自然科学基金资助项目(61906077);江苏大学大学生实践创新训练计划项目(202010299312X)

Imbalanced classification algorithm based on improved semi-supervised clustering

Yu LU, Lingyun ZHAO, Binwen BAI, Zhen JIANG()   

  1. College of Computer Science and Communication Engineering,Jiangsu University,Zhenjiang Jiangsu 212013,China
  • Received:2021-10-28 Revised:2022-01-06 Accepted:2022-01-10 Online:2022-01-19 Published:2022-12-10
  • Contact: Zhen JIANG
  • About author:LU Yu,born in 1997, M. S. candidate. His research interests include machine learning.
    ZHAO Lingyun,born in 1996, M. S. candidate. His research interests include machine learning.
    BAI Binwen, born in 2001. His research interests include machine learning.
  • Supported by:
    National Natural Science Foundation of China(61906077);Practical Innovation Training Program for College Students of Jiangsu University(202010299312X)

摘要:

不平衡分类的相关算法是机器学习领域的研究热点之一,其中的过采样通过重复抽取或者人工合成来增加少数类样本,以实现数据集的再平衡。然而当前的过采样方法大部分是基于原有的样本分布进行的,难以揭示更多的数据集分布特征。为了解决以上问题,首先,提出一种改进的半监督聚类算法来挖掘数据的分布特征;其次,基于半监督聚类的结果,在属于少数类的簇中选择置信度高的无标签数据(伪标签样本)加入原始训练集,这样做除了实现数据集的再平衡外,还可以利用半监督聚类获得的分布特征来辅助不平衡分类;最后,融合半监督聚类和分类的结果来预测最终的类别标签,从而进一步提高算法的不平衡分类性能。选择G-mean和曲线下面积(AUC)作为评价指标,将所提算法与TU、CDSMOTE等7个基于过采样或欠采样的不平衡分类算法在10个公开数据集上进行了对比分析。实验结果表明,与TU、CDSMOTE相比,所提算法在AUC指标上分别平均提高了6.7%和3.9%,在G-mean指标上分别平均提高了7.6%和2.1%,且在两个评价指标上相较于所有对比算法都取得了最高的平均结果。可见所提算法能够有效地提高不平衡分类性能。

关键词: 不平衡分类, 半监督聚类, 伪标签样本, 过采样, 融合

Abstract:

Imbalanced classification is one of the research hotspots in the field of machine learning, where oversampling increases minority samples through repeated extraction or artificial synthesis to rebalance the dataset. However, most of the existing oversampling methods are based on the original data distribution, and are difficult to reveal more dataset distribution characteristics. To address the above problem, firstly, an improved semi-supervised clustering algorithm was proposed to mine the data distribution characteristics. Secondly, based on the results of semi-supervised clustering, the highly-confident unlabeled data (pseudo-labeled samples) was selected from minority-class clusters to join into the original training set. In this way, in addition to rebalancing the dataset, the distribution characteristics obtained by semi-supervised clustering was able to be used to assist the imbalanced classification. Finally, the results of semi-supervised clustering and classification were fused to predict the final labels, which further improved the model performance of imbalanced classification. With G-mean and Area Under Curve (AUC) selected as evaluation indicators, the proposed algorithm was compared with seven oversampling-/undersampling-based imbalanced classification algorithms, such as TU (Trainable Undersampling) and CDSMOTE (Class Decomposition Synthetic Minority Oversampling TEchnique) on 10 public datasets. Experimental results show that compared with TU and CDSMOTE, the proposed algorithm has the average AUC increased by 6.7% and 3.9% respectively, the average G-mean improved by 7.6% and 2.1% respectively. At the same time, the proposed algorithm achieves the highest average results on both evaluation indicators than all the algorithms to be compared. It can be seen that the proposed algorithm can effectively improve the imbalanced classification performance.

Key words: imbalanced classification, semi-supervised clustering, pseudo-labeled sample, oversampling, fusion

中图分类号: