Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (5): 1428-1436.DOI: 10.11772/j.issn.1001-9081.2023050803

Special Issue: 人工智能 2023年中国计算机学会人工智能会议(CCFAI 2023)

• 2023 CCF Conference on Artificial Intelligence (CCFAI 2023) • Previous Articles     Next Articles

Oversampling algorithm based on synthesizing minority class samples using relationship between features

Mingzhu LEI1, Hao WANG1, Rong JIA1, Lin BAI1, Xiaoying PAN1,2()   

  1. 1.School of Computer Science & Technology,Xi’an University of Posts and Telecommunications,Xi’an Shaanxi 710121,China
    2.Shaanxi Key Laboratory of Network Data Analysis and Intelligent Processing,Xi’an Shaanxi 710121,China
  • Received:2023-06-25 Revised:2023-07-30 Accepted:2023-08-02 Online:2023-08-03 Published:2024-05-10
  • Contact: Xiaoying PAN
  • About author:LEI Mingzhu, born in 1999, M. S. candidate. Her research interests include evolutionary computation, data mining.
    WANG Hao, born in 1997, M. S. candidate. His research interests include data mining, time series prediction.
    JIA Rong, born in 1996, M. S. Her research interests include data mining, ensemble learning.
    BAI Lin, born in 1980, M. S., associate professor. Her research interests include data mining, cluster analysis.
  • Supported by:
    Key Research and Development Program of Shaanxi Province(2023-YBSF-476)

基于特征间关系合成少数类样本的过采样算法

雷明珠1, 王浩1, 贾蓉1, 白琳1, 潘晓英1,2()   

  1. 1.西安邮电大学 计算机学院,西安 710121
    2.陕西省网络数据分析与智能处理重点实验室,西安 710121
  • 通讯作者: 潘晓英
  • 作者简介:雷明珠(1999—),女,陕西咸阳人,硕士研究生,CCF会员,主要研究方向:进化计算、数据挖掘
    王浩(1997—),男,陕西安康人,硕士研究生, CCF会员,主要研究方向:数据挖掘、时间序列预测
    贾蓉(1996—),女,山西运城人,硕士,主要研究方向:数据挖掘、集成学习
    白琳(1980—),女,陕西商洛人,副教授,硕士,CCF会员,主要研究方向:数据挖掘、聚类分析
    第一联系人:潘晓英(1981—),女,浙江丽水人,教授,博士,CCF会员,主要研究方向:数据挖掘、进化计算。
  • 基金资助:
    陕西省重点研发计划项目(2023?YBSF?476)

Abstract:

The phenomenon of data imbalance is very common in real life. In order to improve the overall classification accuracy, classifiers often misclassify minority class at the cost. But in real life, the consequences of misclassifying minority class may be very serious. Considering that the traditional resampling algorithm ignores the relationship between the spatial distribution of data and the sample features of minority class, a new sampling algorithm SABRF (Sampling Algorithm Based on Relationship between Features) was proposed to generate a new sample set. The key distinguishing features of imbalanced dataset were preserved through Pareto-based multi-objective feature selection, and the relationships among key features of minority class samples were captured through XGBoost (eXtreme Gradient Boosting) regression model. In addition, considering the quality of newly generated samples, a new sample selection strategy was proposed to retain better samples. Experiments were conducted on six publicly available UCI datasets and one real post-orthopedic thrombus dataset. Experimental results show that the proposed algorithm has good performance on Area Under receiver operating characteristic Curve (AUC), F1 score (F1_score) and Geometric Mean (G_mean). In addition, when using the new samples selected by the sample selection strategy based on multi-index evaluation for classification, the classification result of imbalanced data is also the best, which verifies the effectiveness of the sample selection strategy.

Key words: imbalanced data, oversampling, feature selection, sample quality evaluation, eXtreme Gradient Boosting (XGBoost) regression, Pareto frontier

摘要:

数据不平衡的现象在现实生活中非常普遍。为了提高整体分类精度,分类器有时会以错分少数类为代价。但在现实生活中,对少数类进行错误分类的后果非常严重。考虑到传统重采样算法容易忽略数据的空间分布和少数类样本特征之间的关系,提出一种基于特征关系的采样算法(SABRF)生成新的样本集。SABRF通过帕累托多目标特征选择保留不平衡数据集的关键区分特征,同时通过极端梯度提升(XGBoost)回归模型捕获少数类样本关键特征之间的关系。此外,还提出一个新的样本选择策略衡量新生成样本的质量。使用6个公开的UCI数据集和1个真实的骨科术后血栓数据集进行实验,结果表明,SABRF在受试者工作特征曲线下面积(AUC)、F1分数(F1_score)和几何平均值(G_mean)上均有较好的表现;此外,对使用基于多指标评价的样本选择策略挑选出的新样本进行分类,不平衡数据的分类结果也最好,验证了样本选择策略的有效性。

关键词: 不平衡数据, 过采样, 特征选择, 样本质量评估, 极端梯度提升回归, 帕累托前沿

CLC Number: