Under-sampling method based on sample density peaks for imbalanced data

doi:10.11772/j.issn.1001-9081.2019060962

Abstract

Abstract: Imbalanced data classification is an important problem in data mining and machine learning. The way of re-sampling of data is crucial to the accuracy of classification. Concerning the problem that the existing under-sampling methods for imbalanced data cannot keep the distribution of sampling samples in good agreement with that of original samples, an under-sampling method based on sample density peaks was proposed. Firstly, the density peak clustering algorithm was applied to cluster samples of majority class and to estimate the central and boundary regions of different clusters obtained, so that each sample weight was determined according to the local density and different density peak distribution of cluster region where the sample was in. Then, the samples of majority class were under-sampled based on weights, so that the population of extracted majority class samples was gradually reduced from central region to boundary region of its cluster. In this way, the extracted samples would well reflect original sample distribution while suppressing the noise. Finally, a balanced data set was constructed by the sampled majority samples and all minority samples for the classifier training. The experimental results on multiple datasets show that the proposed sampling method has the F1-measure and G-mean improved, compared with some existing methods such as RBBag (Roughly Balanced Bagging), uNBBag (under-sampling NeighBorhood Bagging), KAcBag (K-means AdaCost bagging), proving that the proposed method is an effective and feasible sampling method.

Key words: imbalanced data, density peak, sample weight, under-sampling, ensemble learning

摘要： 不平衡数据分类是数据挖掘和机器学习领域的一个重要问题，其中数据重抽样方法是影响分类准确率的一个重要因素。针对现有不平衡数据欠抽样方法不能很好地保持抽样样本与原有样本的分布一致的问题，提出一种基于样本密度峰值的不平衡数据欠抽样方法。首先，应用密度峰值聚类算法估计多数类样本聚成的不同类簇的中心区域和边界区域，进而根据样本所处类簇区域的局部密度和不同密度峰值的分布信息计算样本权重；然后，按照权重大小对多数类样本点进行欠抽样，使所抽取的多数类样本尽可能由类簇中心区域向边界区域逐步减少，在较好地反映原始数据分布的同时又可抑制噪声；最后，将抽取到的多数类样本与所有的少数类样本构成平衡数据集用于分类器的训练。多个数据集上的实验结果表明，与现有的RBBag、uNBBag和KAcBag等欠抽样方法相比，所提方法在F1-measure和G-mean指标上均取得一定的提升，是有效、可行的样本抽样方法。

关键词: 不平衡数据, 密度峰值, 样本权重, 欠抽样, 集成学习

CLC Number:

TP301.6

SU Junning, YE Dongyi. Under-sampling method based on sample density peaks for imbalanced data[J]. Journal of Computer Applications, 2020, 40(1): 83-89.

苏俊宁, 叶东毅. 基于样本密度峰值的不平衡数据欠抽样方法[J]. 计算机应用, 2020, 40(1): 83-89.

References

[1] TSANG S, KOH Y S, DOBBIE G, et al. Detecting online auction shilling frauds using supervised learning[J]. Expert Systems with Applications, 2014, 41(6):3027-3040.
[2] YU H, NI J, ZHAO J. ACOSampling:an ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data[J]. Neurocomputing, 2013, 101:309-318.
[3] WAN X, LIU J, CHEUNG W K, et al. Learning to improve medical decision making from imbalanced data without a priori cost[J]. BMC Medical Informatics and Decision Making, 2014, 14:No.111.
[4] WEISS G M. Mining with rarity:a unifying framework[J]. ACM SIGKDD Explorations Newsletter, 2004, 6(1):7-19.
[5] 叶志飞,文益民,吕宝粮.不平衡分类问题研究综述[J].智能系统学报,2009,4(2):148-156.(YE Z F, WEN Y M, LYU B L. A survey of imbalanced pattern classification problems[J]. CAAI Transactions on Intelligent Systems, 2009, 4(2):148-156.)
[6] BATISTA G E A P A, PRATI R C, MONARD M C. A study of the behavior of several methods for balancing machine learning training data[J]. ACM SIGKDD Explorations Newsletter, 2004, 6(1):20-29.
[7] LIU X, WU J, ZHOU Z. Exploratory under-sampling for class imbalance learning[C]//Proceedings of the 6th IEEE International Conference on Data Mining. Piscataway:IEEE, 2006:965-969.
[8] GALAR M, FERNANDEZ A, BARRENECHEA E, et al. A review on ensembles for the class imbalance problem:bagging-, boosting-, and hybrid-based approaches[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part C, 2012, 42(4):463-484.
[9] CHAWLA N V, BOWYER K W, HALL L O, et al. SMOTE:synthetic minority over-sampling technique[J]. Journal of Artificial Intelligence Research, 2002, 16(1):321-357.
[10] HAN H, WANG W, MAO B. Borderline-SMOTE:a new over-sampling method in imbalanced data sets learning[C]//Proceedings of the 2005 International Conference on Intelligent Computing, LNCS 3644. Berlin:Springer, 2005:878-887.
[11] KUBAT M, MATWIN S. Addressing the curse of imbalanced training sets:one-sided selection[C]//Proceedings of the 14th International Conference on Machine Learning. San Francisco, CA:Morgan Kaufmann, 1997:179-186.
[12] YEN S J, LEE Y S. Clustering-based under-sampling in class imbalanced data distributions[J]. Expert Systems with Applications, 2009, 36(3):5718-5727.
[13] SUN Z, SONG Q, ZHU X, et al. A novel ensemble method for classifying imbalanced data[J]. Pattern Recognition, 2015, 48(5):1623-1637.
[14] PARK Y, LUO L, PARHI K K, et al. Seizure prediction with spectral power of EEG using cost-sensitive support vector machines[J]. Epilepsia, 2011, 52(10):1761-1770.
[15] BREIMAN L. Bagging predictors[J]. Machine Learning, 1996, 24(2):123-140.
[16] FREUND Y, SCHAPIRE R E. A decision-theoretic generalization of on-line learning and an application to boosting[J]. Journal of Computer and System Sciences, 1997, 55(1):119-139.
[17] CHAWLA N V. C4.5 and imbalanced data sets:investigating the effect of sampling method, probabilistic estimate and decision tree structure[C]//Proceedings of the 2003 International Conference on Machine Learning. New York:JMLR.org, 2003:125-130.
[18] ELKAN C. The foundations of cost-sensitive learning[C]//Proceedings of the 17th International Joint Conference on Artificial Intelligence. San Francisco, CA:Morgan Kaufmann, 2001:973-978.
[19] HIDO S, KASHIMA H, TAKAHASHI Y. Roughly balanced bagging for imbalanced data[J]. Statistical Analysis and Data Mining, 2009, 2(5/6):412-426.
[20] BLASZCZYNSKI J, STEFANOWSKI J. Neighbourhood sampling in bagging for imbalanced data[J]. Neurocomputing, 2015, 150:529-542.
[21] 熊冰妍,王国胤,邓维斌.基于样本权重的不平衡数据欠抽样方法[J].计算机研究与发展,2016,53(11):2613-2622.(XIONG B Y, WANG G Y, DENG W B. Under-sampling method based on sample weight for imbalanced data[J]. Journal of Computer Research and Development, 2016, 53(11):2613-2622.)
[22] LIN W C, TSAI C F, HU Y H, et al. Clustering-based under-sampling in class-imbalanced data[J]. Information Sciences, 2017, 409/410:17-26.
[23] NANNI L, FANTOZZI C, LAZZARINI N. Coupling different methods for overcoming the class imbalance problem[J]. Neurocomputing, 2015, 158:48-61.
[24] RODRIGUEZ A, LAIO A. Clustering by fast search and find of density peaks[J]. Science, 2014, 344(6191):1492-1496.
[25] GARCÍA S, HERRERA F. Evolutionary under-sampling for classification with imbalanced datasets:proposals and taxonomy[J]. Evolutionary Computation, 2009, 17(3):275-306.
[26] MAN K F, TANG K S, KWONG S. Genetic algorithms:concepts and designs[J]. Assembly Automation, 2000, 20(1):86-87.
[27] SU C T, CHEN L S, YIH Y. Knowledge acquisition through information granulation for imbalanced data[J]. Expert Systems with Applications, 2006, 31(3):531-541.
[28] JAPKOWICZ N, SHAH M. Evaluating Learning Algorithms:A Classification Perspective[M]. Cambridge, UK:Cambridge University Press, 2011:206-290.
[29] LICHMAN M. UCI machine learning repository[EB/OL].[2019-02-20]. http://archive.ics.Uci.edu/ml.
[30] WANG S, YAO X. Diversity analysis on imbalanced data sets by using ensemble models[C]//Proceedings of the 2009 IEEE Symposium on Computational Intelligence and Data Mining. Piscataway:IEEE, 2009:324-331.
[31] HALL M, FRANK E, HOLMES G, et al. The WEKA data mining software:an update[J]. ACM SIGKDD Explorations Newsletter, 2009, 11(1):10-18.