计算机应用 ›› 2020, Vol. 40 ›› Issue (6): 1662-1667.DOI: 10.11772/j.issn.1001-9081.2019101817

• 数据科学与技术 • 上一篇    下一篇

面向不均衡数据集的过抽样算法

崔鑫, 徐华, 宿晨   

  1. 江南大学 物联网工程学院,江苏 无锡 214122
  • 收稿日期:2019-10-27 修回日期:2019-12-17 出版日期:2020-06-10 发布日期:2020-06-18
  • 通讯作者: 崔鑫(1997—)
  • 作者简介:崔鑫(1997—),男,河南南阳人,硕士研究生,主要研究方向:数据挖掘、机器学习.徐华(1978—),女,江苏无锡人,副教授,博士,主要研究方向:计算智能、车间调度、大数据.宿晨(1993—),男,山东烟台人,硕士研究生,CCF会员,主要研究方向:机器学习、数据挖掘.

Over-sampling algorithm for imbalanced datasets

CUI Xin, XU Hua, SU Chen   

  1. School of Internet of Things Engineering, Jiangnan University, Wuxi Jiangsu 214122, China
  • Received:2019-10-27 Revised:2019-12-17 Online:2020-06-10 Published:2020-06-18
  • Contact: CUI Xin, born in 1997, M. S. candidate. His research interests include data mining, machine learning.XU Hu, born in 1978, Ph. D., associate professor. Her research interests include computing intelligence, workshop scheduling, big data.
  • About author:CUI Xin, born in 1997, M. S. candidate. His research interests include data mining, machine learning.XU Hua, born in 1978, Ph. D., associate professor. Her research interests include computing intelligence, workshop scheduling, big data.SU Chen, born in 1993, M. S. candidate. His research interests include machine learning, data mining.

摘要:

合成少数类过抽样技术(SMOTE)中的噪声样本可能参与合成新样本,所以难以保证新样本的合理性。针对这个问题,结合聚类算法提出了改进算法CSMOTE。该算法抛弃了SMOTE在最近邻间线性插值的思想,使用少数类的簇心与其对应簇中的样本进行线性插值合成新样本,并且对参与合成的样本进行了筛选,降低了噪声样本参与合成的可能。在六个实际数据集上,将CSMOTE算法与四个SMOTE的改进算法以及两种欠抽样算法进行了多次的对比实验,CSMOTE算法在所有数据集上均获得了最高的AUC值。实验结果表明,CSMOTE算法具有更高的分类性能,可以有效解决数据集中样本分布不均衡的问题。

关键词: 簇心, 不均衡数据集, 合成少数类过抽样技术, 聚类, 过采样

Abstract:

In Synthetic Minority Over-sampling TEchnique (SMOTE), noise samples may participate in the synthesis of new samples, so it is difficult to guarantee the rationality of the new samples. Aiming at this problem, combining clustering algorithm, an improved algorithm called Clustered Synthetic Minority Over-sampling TEchnique (CSMOTE) was proposed. In the algorithm, the idea of the linear interpolation between the nearest neighbors was abandoned, and the linear interpolation between the cluster centers of minority classes and the samples of corresponding clusters was used to synthesize new samples. And the samples involved in the synthesis were screened to reduce the possibility of noise samples participating in the synthesis. On six actual datasets, CSMOTE algorithm was compared with four SMOTE’s improved algorithms and two under-sampling algorithms for many times, and CSMOTE algorithm obtained the highest AUC values on all datasets. Experimental results show that CSMOTE algorithm has higher classification performance and can effectively solve the problem of unbalanced sample distribution in the datasets.

Key words: cluster center, imbalanced dataset, Synthetic Minority Over-sampling TEchnique (SMOTE), clustering, over-sampling

中图分类号: