面向不均衡数据集的过抽样算法

doi:10.11772/j.issn.1001-9081.2019101817

计算机应用 ›› 2020, Vol. 40 ›› Issue (6): 1662-1667.DOI: 10.11772/j.issn.1001-9081.2019101817

面向不均衡数据集的过抽样算法

崔鑫, 徐华, 宿晨

江南大学物联网工程学院，江苏无锡 214122

收稿日期:2019-10-27 修回日期:2019-12-17 出版日期:2020-06-10 发布日期:2020-06-18
通讯作者: 崔鑫(1997—)
作者简介:崔鑫（1997—），男，河南南阳人，硕士研究生，主要研究方向：数据挖掘、机器学习.徐华(1978—)，女，江苏无锡人，副教授，博士，主要研究方向：计算智能、车间调度、大数据.宿晨(1993—)，男，山东烟台人，硕士研究生，CCF会员，主要研究方向：机器学习、数据挖掘.

Over-sampling algorithm for imbalanced datasets

CUI Xin, XU Hua, SU Chen

School of Internet of Things Engineering, Jiangnan University, Wuxi Jiangsu 214122, China

Received:2019-10-27 Revised:2019-12-17 Online:2020-06-10 Published:2020-06-18
Contact: CUI Xin, born in 1997, M. S. candidate. His research interests include data mining, machine learning.XU Hu, born in 1978, Ph. D., associate professor. Her research interests include computing intelligence, workshop scheduling, big data.
About author:CUI Xin, born in 1997, M. S. candidate. His research interests include data mining, machine learning.XU Hua, born in 1978, Ph. D., associate professor. Her research interests include computing intelligence, workshop scheduling, big data.SU Chen, born in 1993, M. S. candidate. His research interests include machine learning, data mining.

摘要/Abstract

摘要：

合成少数类过抽样技术（SMOTE）中的噪声样本可能参与合成新样本，所以难以保证新样本的合理性。针对这个问题，结合聚类算法提出了改进算法CSMOTE。该算法抛弃了SMOTE在最近邻间线性插值的思想，使用少数类的簇心与其对应簇中的样本进行线性插值合成新样本，并且对参与合成的样本进行了筛选，降低了噪声样本参与合成的可能。在六个实际数据集上，将CSMOTE算法与四个SMOTE的改进算法以及两种欠抽样算法进行了多次的对比实验，CSMOTE算法在所有数据集上均获得了最高的AUC值。实验结果表明，CSMOTE算法具有更高的分类性能，可以有效解决数据集中样本分布不均衡的问题。

关键词: 簇心, 不均衡数据集, 合成少数类过抽样技术, 聚类, 过采样

Abstract:

In Synthetic Minority Over-sampling TEchnique (SMOTE), noise samples may participate in the synthesis of new samples, so it is difficult to guarantee the rationality of the new samples. Aiming at this problem, combining clustering algorithm, an improved algorithm called Clustered Synthetic Minority Over-sampling TEchnique (CSMOTE) was proposed. In the algorithm, the idea of the linear interpolation between the nearest neighbors was abandoned, and the linear interpolation between the cluster centers of minority classes and the samples of corresponding clusters was used to synthesize new samples. And the samples involved in the synthesis were screened to reduce the possibility of noise samples participating in the synthesis. On six actual datasets, CSMOTE algorithm was compared with four SMOTE’s improved algorithms and two under-sampling algorithms for many times, and CSMOTE algorithm obtained the highest AUC values on all datasets. Experimental results show that CSMOTE algorithm has higher classification performance and can effectively solve the problem of unbalanced sample distribution in the datasets.

Key words: cluster center, imbalanced dataset, Synthetic Minority Over-sampling TEchnique (SMOTE), clustering, over-sampling

中图分类号:

TP301.6

崔鑫, 徐华, 宿晨. 面向不均衡数据集的过抽样算法[J]. 计算机应用, 2020, 40(6): 1662-1667.

CUI Xin, XU Hua, SU Chen. Over-sampling algorithm for imbalanced datasets[J]. Journal of Computer Applications, 2020, 40(6): 1662-1667.

参考文献

1 ZHAOH, LIX. A cost sensitive decision tree algorithm based on weighted class distribution with batch deleting attribute mechanism [J]. Information Sciences, 2017, 378: 303-316.
2 PÉREZ-RODRíGUEZJ, ARROYO-PEÑAA G, GARCíA-PEDRAJASN. Simultaneous instance and feature selection and weighting using evolutionary computation: proposal and study [J]. Applied Soft Computing, 2015, 37: 416-443.
3 GUOH, LIY, LIY, et al. BPSO-Adaboost-KNN ensemble learning algorithm for multi-class imbalanced data classification [J]. Engineering Applications of Artificial Intelligence, 2016, 49:176-193.
4 LIUY, WANGY, RENX, et al. A classification method based on feature selection for imbalanced data [J]. IEEE Access, 2019, 7: 81794-81807.
5 HA J, LEE J S. A new under-sampling method using genetic algorithm for imbalanced data classification [C]// Proceedings of the 10th International Conference on Ubiquitous Information Management and Communication. New York: ACM, 2016: Article No.95.
6 RAYHANF, AHMEDS, MAHBUBA, et al. CUSBoost: cluster-based under-sampling with boosting for imbalanced classification[C]// Proceedings of the 2nd International Conference on Computational Systems and Information Technology for Sustainable Solution. Piscataway: IEEE, 2017: 1-5.
7 LINW C, TSAIC, HUY H, et al. Clustering-based undersampling in class-imbalanced data [J]. Information Sciences, 2017, 409/410:17-26.
8 CHAWLAN V, BOWYERK W, HALLL O, et al. SMOTE: synthetic minority over-sampling technique [J]. Journal of Artificial Intelligence Research, 2002, 16(1): 321-357.
9 古平,杨炀.面向不均衡数据集中少数类细分的过采样算法[J].计算机工程,2017,43(2):241-247. GUP, YANGY. Oversampling algorithm oriented to subdivision of minority class in imbalanced data set [J]. Computer Engineering, 2017, 43(2):241-247.
10 易未,毛力,孙俊,等.改进Smote算法在不平衡数据集上的分类研究[J].计算机与现代化,2018(3):83-88. YIW, MAOL, SUNJ, et al. Research on classification of improved smote algorithm on imbalanced datasets [J]. Computer and Modernization, 2018(3): 83-88.
11 杨毅,卢诚波,徐根海.面向不平衡数据集的一种精化Borderline-SMOTE方法[J].复旦学报(自然科学版),2017,56(5):537-544. YANGY, LUC B, XUG H. A refined Borderline-SMOTE method for imbalanced data set [J]. Journal of Fudan University (Natural Science), 2017, 56(5): 537-544.
12 BATISTAG E P, PRATIR C, MONARDM C. A study of the behavior of several methods for balancing machine learning training data [J]. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 20-29.
13 HANH, WANGW Y, MAOB H. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning [C]// Proceedings of the 2005 International Conference on Intelligent Computing, LNCS 3644. Berlin: Springer, 2005: 878-887.
14 袁铭.基于R-SMOTE 方法的非平衡数据分类研究[D].保定:河北大学,2015:15-32. YUANM. Research on the classification of imbalanced data sets based on R-SMOTE [D]. Baoding: Hebei University, 2015: 15-32.
15 赵清华,张艺豪,马建芬,等.改进SMOTE的非平衡数据集分类算法研究[J].计算机工程与应用,2018,54(18):168-173. ZHAOQ H, ZHANGY H, MAJ F, et al. Research on classification algorithm of imbalanced datasets based on improved SMOTE [J]. Computer Engineering and Applications, 2018, 54(18): 168-173.
16 FAWCETTT. An introduction to ROC analysis [J]. Pattern Recognition Letters, 2006, 27(8): 861-874.
17 魏浩,李红,刘小豫.一种改进的SMOTE算法[J].河南科学,2018,36(7):1009-1013. WEIH, LIH, LIUX Y. An improved SMOTE algorithm [J]. Henan Science, 2018, 36(7): 1009-1013.

[1]	陈恒恒, 倪志伟, 朱旭辉, 金媛媛, 陈千. 基于聚类分析的差分隐私高维数据发布方法[J]. 计算机应用, 2021, 41(9): 2578-2585.
[2]	祝承, 赵晓琦, 赵丽萍, 焦玉宏, 朱亚飞, 陈建英, 周伟, 谭颖. 基于谱聚类半监督特征选择的功能磁共振成像数据分类[J]. 计算机应用, 2021, 41(8): 2288-2293.
[3]	曾祥银, 郑伯川, 刘丹. 基于深度卷积神经网络和聚类的左右轨道线检测[J]. 计算机应用, 2021, 41(8): 2324-2329.
[4]	王垚, 孙国梓. 基于聚类和实例硬度的入侵检测过采样方法[J]. 计算机应用, 2021, 41(6): 1709-1714.
[5]	戴嫣然, 戴国庆, 袁玉波. 基于肤色学习的多人脸前景抽取方法[J]. 计算机应用, 2021, 41(6): 1659-1666.
[6]	马建红, 曹文斌, 刘元刚, 夏爽. 基于功效特征的专利聚类方法[J]. 计算机应用, 2021, 41(5): 1361-1366.
[7]	李国荣, 冶继民, 甄远婷. 基于新的鲁棒相似性度量的时间序列聚类[J]. 计算机应用, 2021, 41(5): 1343-1347.
[8]	王治和, 常筱卿, 杜辉. 基于万有引力的自适应近邻传播聚类算法[J]. 计算机应用, 2021, 41(5): 1337-1342.
[9]	龙超奇, 蒋瑜, 谢雨. 基于峰值网格改进的小波聚类算法[J]. 计算机应用, 2021, 41(4): 1122-1127.
[10]	李杏峰, 黄玉清, 任珍文, 李毅红. 基于自适应邻域的鲁棒多视图聚类算法[J]. 计算机应用, 2021, 41(4): 1093-1099.
[11]	吕佳, 鲜焱. 结合改进密度峰值聚类和共享子空间的协同训练算法[J]. 计算机应用, 2021, 41(3): 686-693.
[12]	郭佳, 韩李涛, 孙宪龙, 周丽娟. 自动确定聚类中心的比较密度峰值聚类算法[J]. 计算机应用, 2021, 41(3): 738-744.
[13]	邹志文, 秦程. 基于k-means++的动态构建空间主题R树方法[J]. 计算机应用, 2021, 41(3): 733-737.
[14]	秦静, 左长青, 汪祖民, 季长清, 王宝凤. 基于堆叠分类器的心电异常监测模型设计[J]. 计算机应用, 2021, 41(3): 887-890.
[15]	张恩, 李会敏, 常键. 可验证的隐私保护k-means聚类方案[J]. 计算机应用, 2021, 41(2): 413-421.

面向不均衡数据集的过抽样算法

Over-sampling algorithm for imbalanced datasets

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics