计算机应用 ›› 2020, Vol. 40 ›› Issue (1): 83-89.DOI: 10.11772/j.issn.1001-9081.2019060962

• 数据科学与技术 • 上一篇    下一篇

基于样本密度峰值的不平衡数据欠抽样方法

苏俊宁, 叶东毅   

  1. 福州大学 数学与计算机科学学院, 福州 350108
  • 收稿日期:2019-06-10 修回日期:2019-07-23 出版日期:2020-01-10 发布日期:2019-09-27
  • 通讯作者: 叶东毅
  • 作者简介:苏俊宁(1994-),男,福建福安人,硕士研究生,主要研究方向:机器学习、数据挖掘;叶东毅(1964-),男,福建泉州人,教授,博士,CCF会员,主要研究方向:机器学习、数据挖掘。
  • 基金资助:
    国家自然科学基金资助项目(61672158);福建省高校产学合作项目(2018H6010)。

Under-sampling method based on sample density peaks for imbalanced data

SU Junning, YE Dongyi   

  1. College of Mathematics and Computer Science, Fuzhou University, Fuzhou Fujian 350108, China
  • Received:2019-06-10 Revised:2019-07-23 Online:2020-01-10 Published:2019-09-27
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61672158), the Industry-University Cooperation Fund of Fujian Province (22018H6010).

摘要: 不平衡数据分类是数据挖掘和机器学习领域的一个重要问题,其中数据重抽样方法是影响分类准确率的一个重要因素。针对现有不平衡数据欠抽样方法不能很好地保持抽样样本与原有样本的分布一致的问题,提出一种基于样本密度峰值的不平衡数据欠抽样方法。首先,应用密度峰值聚类算法估计多数类样本聚成的不同类簇的中心区域和边界区域,进而根据样本所处类簇区域的局部密度和不同密度峰值的分布信息计算样本权重;然后,按照权重大小对多数类样本点进行欠抽样,使所抽取的多数类样本尽可能由类簇中心区域向边界区域逐步减少,在较好地反映原始数据分布的同时又可抑制噪声;最后,将抽取到的多数类样本与所有的少数类样本构成平衡数据集用于分类器的训练。多个数据集上的实验结果表明,与现有的RBBag、uNBBag和KAcBag等欠抽样方法相比,所提方法在F1-measure和G-mean指标上均取得一定的提升,是有效、可行的样本抽样方法。

关键词: 不平衡数据, 密度峰值, 样本权重, 欠抽样, 集成学习

Abstract: Imbalanced data classification is an important problem in data mining and machine learning. The way of re-sampling of data is crucial to the accuracy of classification. Concerning the problem that the existing under-sampling methods for imbalanced data cannot keep the distribution of sampling samples in good agreement with that of original samples, an under-sampling method based on sample density peaks was proposed. Firstly, the density peak clustering algorithm was applied to cluster samples of majority class and to estimate the central and boundary regions of different clusters obtained, so that each sample weight was determined according to the local density and different density peak distribution of cluster region where the sample was in. Then, the samples of majority class were under-sampled based on weights, so that the population of extracted majority class samples was gradually reduced from central region to boundary region of its cluster. In this way, the extracted samples would well reflect original sample distribution while suppressing the noise. Finally, a balanced data set was constructed by the sampled majority samples and all minority samples for the classifier training. The experimental results on multiple datasets show that the proposed sampling method has the F1-measure and G-mean improved, compared with some existing methods such as RBBag (Roughly Balanced Bagging), uNBBag (under-sampling NeighBorhood Bagging), KAcBag (K-means AdaCost bagging), proving that the proposed method is an effective and feasible sampling method.

Key words: imbalanced data, density peak, sample weight, under-sampling, ensemble learning

中图分类号: