基于样本密度峰值的不平衡数据欠抽样方法

doi:10.11772/j.issn.1001-9081.2019060962

计算机应用 ›› 2020, Vol. 40 ›› Issue (1): 83-89.DOI: 10.11772/j.issn.1001-9081.2019060962

基于样本密度峰值的不平衡数据欠抽样方法

苏俊宁, 叶东毅

福州大学数学与计算机科学学院, 福州 350108

收稿日期:2019-06-10 修回日期:2019-07-23 出版日期:2020-01-10 发布日期:2019-09-27
通讯作者: 叶东毅
作者简介:苏俊宁(1994-),男,福建福安人,硕士研究生,主要研究方向:机器学习、数据挖掘;叶东毅(1964-),男,福建泉州人,教授,博士,CCF会员,主要研究方向:机器学习、数据挖掘。
基金资助:
国家自然科学基金资助项目（61672158）；福建省高校产学合作项目（2018H6010）。

Under-sampling method based on sample density peaks for imbalanced data

SU Junning, YE Dongyi

College of Mathematics and Computer Science, Fuzhou University, Fuzhou Fujian 350108, China

Received:2019-06-10 Revised:2019-07-23 Online:2020-01-10 Published:2019-09-27
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61672158), the Industry-University Cooperation Fund of Fujian Province (22018H6010).

摘要/Abstract

摘要： 不平衡数据分类是数据挖掘和机器学习领域的一个重要问题，其中数据重抽样方法是影响分类准确率的一个重要因素。针对现有不平衡数据欠抽样方法不能很好地保持抽样样本与原有样本的分布一致的问题，提出一种基于样本密度峰值的不平衡数据欠抽样方法。首先，应用密度峰值聚类算法估计多数类样本聚成的不同类簇的中心区域和边界区域，进而根据样本所处类簇区域的局部密度和不同密度峰值的分布信息计算样本权重；然后，按照权重大小对多数类样本点进行欠抽样，使所抽取的多数类样本尽可能由类簇中心区域向边界区域逐步减少，在较好地反映原始数据分布的同时又可抑制噪声；最后，将抽取到的多数类样本与所有的少数类样本构成平衡数据集用于分类器的训练。多个数据集上的实验结果表明，与现有的RBBag、uNBBag和KAcBag等欠抽样方法相比，所提方法在F1-measure和G-mean指标上均取得一定的提升，是有效、可行的样本抽样方法。

关键词: 不平衡数据, 密度峰值, 样本权重, 欠抽样, 集成学习

Abstract: Imbalanced data classification is an important problem in data mining and machine learning. The way of re-sampling of data is crucial to the accuracy of classification. Concerning the problem that the existing under-sampling methods for imbalanced data cannot keep the distribution of sampling samples in good agreement with that of original samples, an under-sampling method based on sample density peaks was proposed. Firstly, the density peak clustering algorithm was applied to cluster samples of majority class and to estimate the central and boundary regions of different clusters obtained, so that each sample weight was determined according to the local density and different density peak distribution of cluster region where the sample was in. Then, the samples of majority class were under-sampled based on weights, so that the population of extracted majority class samples was gradually reduced from central region to boundary region of its cluster. In this way, the extracted samples would well reflect original sample distribution while suppressing the noise. Finally, a balanced data set was constructed by the sampled majority samples and all minority samples for the classifier training. The experimental results on multiple datasets show that the proposed sampling method has the F1-measure and G-mean improved, compared with some existing methods such as RBBag (Roughly Balanced Bagging), uNBBag (under-sampling NeighBorhood Bagging), KAcBag (K-means AdaCost bagging), proving that the proposed method is an effective and feasible sampling method.

Key words: imbalanced data, density peak, sample weight, under-sampling, ensemble learning

中图分类号:

TP301.6

苏俊宁, 叶东毅. 基于样本密度峰值的不平衡数据欠抽样方法[J]. 计算机应用, 2020, 40(1): 83-89.

SU Junning, YE Dongyi. Under-sampling method based on sample density peaks for imbalanced data[J]. Journal of Computer Applications, 2020, 40(1): 83-89.

参考文献

[1] TSANG S, KOH Y S, DOBBIE G, et al. Detecting online auction shilling frauds using supervised learning[J]. Expert Systems with Applications, 2014, 41(6):3027-3040.
[2] YU H, NI J, ZHAO J. ACOSampling:an ant colony optimization-based undersampling method for classifying imbalanced DNA microarray data[J]. Neurocomputing, 2013, 101:309-318.
[3] WAN X, LIU J, CHEUNG W K, et al. Learning to improve medical decision making from imbalanced data without a priori cost[J]. BMC Medical Informatics and Decision Making, 2014, 14:No.111.
[4] WEISS G M. Mining with rarity:a unifying framework[J]. ACM SIGKDD Explorations Newsletter, 2004, 6(1):7-19.
[5] 叶志飞,文益民,吕宝粮.不平衡分类问题研究综述[J].智能系统学报,2009,4(2):148-156.(YE Z F, WEN Y M, LYU B L. A survey of imbalanced pattern classification problems[J]. CAAI Transactions on Intelligent Systems, 2009, 4(2):148-156.)
[6] BATISTA G E A P A, PRATI R C, MONARD M C. A study of the behavior of several methods for balancing machine learning training data[J]. ACM SIGKDD Explorations Newsletter, 2004, 6(1):20-29.
[7] LIU X, WU J, ZHOU Z. Exploratory under-sampling for class imbalance learning[C]//Proceedings of the 6th IEEE International Conference on Data Mining. Piscataway:IEEE, 2006:965-969.
[8] GALAR M, FERNANDEZ A, BARRENECHEA E, et al. A review on ensembles for the class imbalance problem:bagging-, boosting-, and hybrid-based approaches[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part C, 2012, 42(4):463-484.
[9] CHAWLA N V, BOWYER K W, HALL L O, et al. SMOTE:synthetic minority over-sampling technique[J]. Journal of Artificial Intelligence Research, 2002, 16(1):321-357.
[10] HAN H, WANG W, MAO B. Borderline-SMOTE:a new over-sampling method in imbalanced data sets learning[C]//Proceedings of the 2005 International Conference on Intelligent Computing, LNCS 3644. Berlin:Springer, 2005:878-887.
[11] KUBAT M, MATWIN S. Addressing the curse of imbalanced training sets:one-sided selection[C]//Proceedings of the 14th International Conference on Machine Learning. San Francisco, CA:Morgan Kaufmann, 1997:179-186.
[12] YEN S J, LEE Y S. Clustering-based under-sampling in class imbalanced data distributions[J]. Expert Systems with Applications, 2009, 36(3):5718-5727.
[13] SUN Z, SONG Q, ZHU X, et al. A novel ensemble method for classifying imbalanced data[J]. Pattern Recognition, 2015, 48(5):1623-1637.
[14] PARK Y, LUO L, PARHI K K, et al. Seizure prediction with spectral power of EEG using cost-sensitive support vector machines[J]. Epilepsia, 2011, 52(10):1761-1770.
[15] BREIMAN L. Bagging predictors[J]. Machine Learning, 1996, 24(2):123-140.
[16] FREUND Y, SCHAPIRE R E. A decision-theoretic generalization of on-line learning and an application to boosting[J]. Journal of Computer and System Sciences, 1997, 55(1):119-139.
[17] CHAWLA N V. C4.5 and imbalanced data sets:investigating the effect of sampling method, probabilistic estimate and decision tree structure[C]//Proceedings of the 2003 International Conference on Machine Learning. New York:JMLR.org, 2003:125-130.
[18] ELKAN C. The foundations of cost-sensitive learning[C]//Proceedings of the 17th International Joint Conference on Artificial Intelligence. San Francisco, CA:Morgan Kaufmann, 2001:973-978.
[19] HIDO S, KASHIMA H, TAKAHASHI Y. Roughly balanced bagging for imbalanced data[J]. Statistical Analysis and Data Mining, 2009, 2(5/6):412-426.
[20] BLASZCZYNSKI J, STEFANOWSKI J. Neighbourhood sampling in bagging for imbalanced data[J]. Neurocomputing, 2015, 150:529-542.
[21] 熊冰妍,王国胤,邓维斌.基于样本权重的不平衡数据欠抽样方法[J].计算机研究与发展,2016,53(11):2613-2622.(XIONG B Y, WANG G Y, DENG W B. Under-sampling method based on sample weight for imbalanced data[J]. Journal of Computer Research and Development, 2016, 53(11):2613-2622.)
[22] LIN W C, TSAI C F, HU Y H, et al. Clustering-based under-sampling in class-imbalanced data[J]. Information Sciences, 2017, 409/410:17-26.
[23] NANNI L, FANTOZZI C, LAZZARINI N. Coupling different methods for overcoming the class imbalance problem[J]. Neurocomputing, 2015, 158:48-61.
[24] RODRIGUEZ A, LAIO A. Clustering by fast search and find of density peaks[J]. Science, 2014, 344(6191):1492-1496.
[25] GARCÍA S, HERRERA F. Evolutionary under-sampling for classification with imbalanced datasets:proposals and taxonomy[J]. Evolutionary Computation, 2009, 17(3):275-306.
[26] MAN K F, TANG K S, KWONG S. Genetic algorithms:concepts and designs[J]. Assembly Automation, 2000, 20(1):86-87.
[27] SU C T, CHEN L S, YIH Y. Knowledge acquisition through information granulation for imbalanced data[J]. Expert Systems with Applications, 2006, 31(3):531-541.
[28] JAPKOWICZ N, SHAH M. Evaluating Learning Algorithms:A Classification Perspective[M]. Cambridge, UK:Cambridge University Press, 2011:206-290.
[29] LICHMAN M. UCI machine learning repository[EB/OL].[2019-02-20]. http://archive.ics.Uci.edu/ml.
[30] WANG S, YAO X. Diversity analysis on imbalanced data sets by using ensemble models[C]//Proceedings of the 2009 IEEE Symposium on Computational Intelligence and Data Mining. Piscataway:IEEE, 2009:324-331.
[31] HALL M, FRANK E, HOLMES G, et al. The WEKA data mining software:an update[J]. ACM SIGKDD Explorations Newsletter, 2009, 11(1):10-18.

基于样本密度峰值的不平衡数据欠抽样方法

Under-sampling method based on sample density peaks for imbalanced data

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	毛铭泽, 曹芮浩, 闫春钢. 基于权值多样性的半监督分类算法[J]. 计算机应用, 2021, 41(9): 2473-2480.
[2]	肖振远, 王逸涵, 罗建桥, 熊鹰, 李柏林. 基于部分加权损失函数的RefineDet[J]. 计算机应用, 2021, 41(7): 1928-1932.
[3]	余东昌, 赵文芳, 聂凯, 张舸. 基于LightGBM算法的能见度预测模型[J]. 计算机应用, 2021, 41(4): 1035-1041.
[4]	郭佳, 韩李涛, 孙宪龙, 周丽娟. 自动确定聚类中心的比较密度峰值聚类算法[J]. 计算机应用, 2021, 41(3): 738-744.
[5]	秦静, 左长青, 汪祖民, 季长清, 王宝凤. 基于堆叠分类器的心电异常监测模型设计[J]. 计算机应用, 2021, 41(3): 887-890.
[6]	吕佳, 鲜焱. 结合改进密度峰值聚类和共享子空间的协同训练算法[J]. 计算机应用, 2021, 41(3): 686-693.
[7]	罗长银, 陈学斌, 马春地, 王君宇. 面向区块链的在线联邦增量学习算法[J]. 计算机应用, 2021, 41(2): 363-371.
[8]	周超然, 赵建平, 马太, 周欣. 基于注意力机制和集成学习的网页黑名单判别方法[J]. 计算机应用, 2021, 41(1): 133-138.
[9]	王俊红, 闫家荣. 基于欠采样和代价敏感的不平衡数据分类算法[J]. 计算机应用, 2021, 41(1): 48-52.
[10]	顾桐, 许国良, 李万林, 李家浩, 王志愿, 雒江涛. 基于集成LightGBM和贝叶斯优化策略的房价智能评估模型[J]. 计算机应用, 2020, 40(9): 2762-2767.
[11]	刘丹, 姚立霜, 王云锋, 裴作飞. 面向类不平衡流量数据的分类模型[J]. 计算机应用, 2020, 40(8): 2327-2333.
[12]	吴斌, 卢红丽, 江惠君. 自适应密度峰值聚类算法[J]. 计算机应用, 2020, 40(6): 1654-1661.
[13]	刘然, 刘宇, 顾进广. 基于自适应学习率优化的AdaNet改进[J]. 计算机应用, 2020, 40(10): 2804-2810.
[14]	王忠震, 黄勃, 方志军, 高永彬, 张娟. 改进SMOTE的不平衡数据集成分类算法[J]. 计算机应用, 2019, 39(9): 2591-2596.
[15]	尹玉, 詹永照, 姜震. 伪标签置信选择的半监督集成学习视频语义检测[J]. 计算机应用, 2019, 39(8): 2204-2209.