《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (7): 2138-2144.DOI: 10.11772/j.issn.1001-9081.2024070919

• CCF第39届中国计算机应用大会 (CCF NCCA 2024) • 上一篇    下一篇

基于凝聚式层次聚类的微调筛选过采样方法

谷铮1,2,3, 陈学斌1,2,3(), 张宏扬1,2,3, 李雨欣1,2,3   

  1. 1.华北理工大学 理学院,河北 唐山 063210
    2.河北省数据科学与应用重点实验室(华北理工大学),河北 唐山 063210
    3.唐山市数据科学重点实验室(华北理工大学),河北 唐山 063210
  • 收稿日期:2024-07-03 修回日期:2024-09-25 接受日期:2024-09-29 发布日期:2025-07-10 出版日期:2025-07-10
  • 通讯作者: 陈学斌
  • 作者简介:谷铮(1999—),女,河北廊坊人,硕士研究生,CCF会员,主要研究方向:数据分析
    陈学斌(1970—),男,河北唐山人,教授,博士,CCF杰出会员,主要研究方向:大数据安全、物联网安全、网络安全 chxb@ncst.edu.cn
    张宏扬(1999—),男,江苏淮安人,硕士研究生,CCF会员,主要研究方向:数据安全、隐私保护
    李雨欣(2000—),女,山西临汾人,硕士研究生,CCF会员,主要研究方向:数据分析。
  • 基金资助:
    国家自然科学基金资助项目(U20A20179)

Fine-tuned and filtered oversampling method based on agglomerative hierarchical clustering

Zheng GU1,2,3, Xuebin CHEN1,2,3(), Hongyang ZHANG1,2,3, Yuxin LI1,2,3   

  1. 1.College of Sciences,North China University of Science and Technology,Tangshan Hebei 063210,China
    2.Hebei Provincial Key Laboratory of Data Science and Application (North China University of Science and Technology),Tangshan Hebei 063210,China
    3.Tangshan Key Laboratory of Data Science (North China University of Science and Technology),Tangshan Hebei 063210,China
  • Received:2024-07-03 Revised:2024-09-25 Accepted:2024-09-29 Online:2025-07-10 Published:2025-07-10
  • Contact: Xuebin CHEN
  • About author:GU Zheng, born in 1999, M. S. candidate. Her research interests include data analysis.
    CHEN Xuebin, born in 1970, Ph. D., professor. His research interests include data security, internet of things security, network security.
    ZHANG Hongyang, born in 1999, M. S. candidate. His research interests include data security, privacy protection.
    LI Yuxin, born in 2000, M. S. candidate. Her research interests include data analysis.
  • Supported by:
    This work is partially supported by National Natural Science Foundation of China(U20A20179)

摘要:

针对不平衡数据集分类效果差的问题,提出一种基于凝聚式层次聚类(AHC)的微调筛选过采样方法,该方法可适用于不平衡数据的多分类情况。首先,在不平衡数据集的聚类过程中应用AHC算法,分别聚类多数类与少数类,从而在考虑类别间关系的同时有效避免类重叠问题;其次,为了平衡数据集并保留原始数据的特征,设计一种微调过采样算法;再次,为了提升生成样本的分类准确率,提出一种基于倾向评分匹配的标签倾向评估与筛选方法;最后,通过实验对所提出的方法进行验证,并将该方法与MDO(Mahalanobis Distance-based Over-sampling technique)、AND-SMOTE (Automatic Neighborhood size Determination method for Synthetic Minority Over-sampling TEchnique)和K-means SMOTE这3种方法进行比较。实验结果表明,在Abalone、Contraceptive和Yeast等6个不同的数据集上,所提方法展现出了良好的性能,验证了它的有效性。

关键词: 不平衡数据, 多分类, 过采样, 凝聚式层次聚类, 标签倾向评估

Abstract:

A fine-tuned and filtered oversampling method based on Agglomerative Hierarchical Clustering (AHC) was proposed to address the issue of poor classification performance on imbalanced datasets, which can be applied to multi-class imbalanced data scenarios. Firstly, AHC algorithm was employed during the clustering process of imbalanced datasets, so that the majority and minority classes were clustered separately, thereby avoiding class overlap effectively while considering inter-class relationships. Secondly, to balance the dataset while preserving characteristics of the original data, a fine-tuned oversampling algorithm was designed. Thirdly, to improve classification accuracy of the generated samples, a label tendency evaluation and filtering method based on propensity score matching was introduced. Finally, the proposed method was validated through experiments and compared with three methods: MDO (Mahalanobis Distance-based Over-sampling technique), AND-SMOTE (Automatic Neighborhood size Determination method for Synthetic Minority Over-sampling TEchnique), and K-means SMOTE. Experimental results demonstrate that the proposed method has excellent performance on six different datasets such as Abalone, Contraceptive and Yeast, confirming effectiveness of the method.

Key words: imbalanced data, multi-class classification, oversampling, Agglomerative Hierarchical Clustering (AHC), label bias assessment

中图分类号: