《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (5): 1455-1463.DOI: 10.11772/j.issn.1001-9081.2021050736

• 数据科学与技术 • 上一篇    下一篇

密度峰值优化的球簇划分欠采样不平衡数据分类算法

刘学文1, 王继奎1(), 杨正国1, 李强1,2, 易纪海1, 李冰1, 聂飞平3   

  1. 1.兰州财经大学 信息工程学院, 兰州 730020
    2.甘肃省电子商务技术与应用重点实验室(兰州财经大学), 兰州 730020
    3.西北工业大学 光学影像分析与学习中心, 西安 710072
  • 收稿日期:2021-05-10 修回日期:2021-09-19 接受日期:2021-10-14 发布日期:2022-03-08 出版日期:2022-05-10
  • 通讯作者: 王继奎
  • 作者简介:刘学文(1996—),男,江西赣州人,硕士研究生,主要研究方向:机器学习、人工智能
    王继奎(1978—),男,山东滕州人,副教授,博士,CCF会员,主要研究方向:机器学习、人工智能 wjkweb@163.com
    杨正国(1987—),男,甘肃积石山人,副教授,博士,CCF会员,主要研究方向:机器学习、人工智能
    李强(1973—),男,甘肃兰州人,教授,硕士,CCF会员,主要研究方向:机器学习、人工智能
    易纪海(1974—),男,黑龙江伊春人,讲师,硕士,主要研究方向:机器学习、人工智能
    李冰(1997—),女,山西运城人,硕士研究生,主要研究方向:机器学习、人工智能
    聂飞平(1977—),男,江西吉安人,教授,博士生导师,博士,CCF会员,主要研究方向:机器学习、人工智能。
  • 基金资助:
    国家自然科学基金资助项目(61772427);甘肃省高等学校创新能力提升资助项目(2021B?145);甘肃省自然科学基金资助项目(17JR5RA177);甘肃省重点研发计划项目(21YF5FA087)

Imbalanced data classification algorithm based on ball cluster partitioning and undersampling with density peak optimization

Xuewen LIU1, Jikui WANG1(), Zhengguo YANG1, Qiang LI1,2, Jihai YI1, Bing LI1, Feiping NIE3   

  1. 1.School of Information Engineering,Lanzhou University of Finance and Economics,Lanzhou Gansu 730020,China
    2.Key Laboratory of E?Business Technology and Application of Gansu Province (Lanzhou University of Finance and Economics),Lanzhou Gansu 730020,China
    3.Center for OPTical IMagery Analysis and Learning (OPTIMAL),Northwestern Polytechnical University,Xi’an Shaanxi 710072,China
  • Received:2021-05-10 Revised:2021-09-19 Accepted:2021-10-14 Online:2022-03-08 Published:2022-05-10
  • Contact: Jikui WANG
  • About author:LIU Xuewen, born in 1996, M. S. candidate. His researchinterests include machine learning,artificial intelligence.
    WANG Jikui, born in 1978,Ph. D.,associate professor. Hisresearch interests include machine learning,artificial intelligence.
    YANG Zhengguo, born in 1987,Ph. D.,associate professor. Hisresearch interests include machine learning,artificial intelligence.
    LI Qiang, born in 1973,M. S.,professor. His research interestsinclude machine learning,artificial intelligence.
    YI Jihai, born in 1974,M. S.,lecturer. His research interestsinclude machine learning,artificial intelligence.
    LI Bing, born in 1997,M. S. candidate. Her research interestsinclude machine learning,artificial intelligence.
    NIE Feiping, born in 1977, Ph. D., professor. His researchinterests include machine learning,artificial intelligence.
  • Supported by:
    National Natural Science Foundation of China(61772427);Gansu Provincial Institutions of Higher Learning Innovation Ability Promotion Program(2021B-145);Natural Science Foundation of Gansu Province(17JR5RA177);Key Research and Development Program of Gansu Province(21YF5FA087)

摘要:

在集成算法中嵌入代价敏感和重采样方法是一种有效的不平衡数据分类混合策略。针对现有混合方法中误分代价计算和欠采样过程较少考虑样本的类内与类间分布的问题,提出了一种密度峰值优化的球簇划分欠采样不平衡数据分类算法DPBCPUSBoost。首先,利用密度峰值信息定义多数类样本的抽样权重,将存在“近邻簇”的多数类球簇划分为“易误分区域”和“难误分区域”,并提高“易误分区域”内样本的抽样权重;其次,在初次迭代过程中按照抽样权重对多数类样本进行欠采样,之后每轮迭代中按样本分布权重对多数类样本进行欠采样,并把欠采样后的多数类样本与少数类样本组成临时训练集并训练弱分类器;最后,结合样本的密度峰值信息与类别分布为所有样本定义不同的误分代价,并通过代价调整函数增加高误分代价样本的权重。在10个KEEL数据集上的实验结果表明,与现有自适应增强(AdaBoost)、代价敏感自适应增强(AdaCost)、随机欠采样增强(RUSBoost)和代价敏感欠采样自适应增强(USCBoost)等不平衡数据分类算法相比,DPBCPUSBoost在准确率(Accuracy)、F1分数(F1-Score)、几何均值(G-mean)和受试者工作特征(ROC)曲线下的面积(AUC)指标上获得最高性能的数据集数量均多于对比算法。实验结果验证了DPBCPUSBoost中样本误分代价和抽样权重定义的有效性。

关键词: 不平衡数据分类, 密度峰值, 球聚类, 代价敏感, 欠采样

Abstract:

It is an effective hybrid strategy for imbalanced data classification of integrating cost-sensitivity and resampling methods into the ensemble algorithms. Concerning the problem that the misclassification cost calculation and undersampling process less consider the intra-class and inter-class distributions of samples in the existing hybrid methods, an imbalanced data classification algorithm based on ball cluster partitioning and undersampling with density peak optimization was proposed, named Boosting algorithm based on Ball Cluster Partitioning and UnderSampling with Density Peak optimization (DPBCPUSBoost). Firstly, the density peak information was used to define the sampling weights of majority samples, and the majority ball cluster with “neighbor cluster” was divided into “area misclassified easily” and “area misclassified hardly”, then the sampling weight of samples in “area misclassified easily” was increased. Secondly, the majority samples were undersampled based on the sampling weights in the first iteration, then the majority samples were undersampled based on the sample distribution weight in every iteration. And the weak classifier was trained on the temporary training set combining the undersampled majority samples with all minority samples. Finally, the density peak information of samples was combined with the categorical distribution of samples to define the different misclassification costs for all samples, and the weights of samples with higher misclassification cost were increased by the cost adjustment function. Experimental results on 10 KEEL datasets indicate that, the number of datasets with the highest performance achieved by DPBCPUSBoost is more than that of the imbalanced data classification algorithms such as Adaptive Boosting (AdaBoost), Cost-sensitive AdaBoost (AdaCost), Random UnderSampling Boosting (RUSBoost) and UnderSampling and Cost-sensitive Boosting (USCBoost), in terms of evaluation metrics such as Accuracy, F1-Score, Geometric Mean (G-mean) and Area Under Curve (AUC) of Receiver Operating Characteristic (ROC). Experimental results verify that the definition of sample misclassification cost and sampling weight of the proposed DPBCPUSBoost is effective.

Key words: imbalanced data classification, density peak, ball clustering, cost-sensitive, undersampling

中图分类号: