• •    

基于样本势和噪声进化的不平衡数据过采样方法

冷强奎,孙薛梓,孟祥福   

  1. 辽宁工程技术大学
  • 收稿日期:2023-08-28 修回日期:2023-10-22 发布日期:2023-12-18
  • 通讯作者: 冷强奎
  • 基金资助:
    国家自然科学基金资助项目;国家自然科学基金资助项目;辽宁省自然科学基金资助项目;辽宁省教育厅科研项目;辽宁工程技术大学博士科研启动基金

An oversampling method for imbalanced data based on sample potential and noise evolution

  • Received:2023-08-28 Revised:2023-10-22 Online:2023-12-18

摘要: 在处理不平衡数据分类问题中,过采样方法被认为是一种有效的策略。现有方法大多采用K近邻技术选取采样种子样本,但K近邻参数值的改变会导致多数过采样方法表现出明显的不适定性。最近提出的径向基过采样方法(Radial-Based Oversampling, RBO)能够解决这个问题,但该方法在采样后容易出现大量噪声。基于此,本文提出了一种基于样本势和噪声进化的不平衡数据过采样方法,进一步对采样后的数据集进行迭代进化。其核心步骤是:首先,使用RBO方法通过计算样本势来合成少数类样本并改善原始数据的不平衡。其次,使用自然近邻(Natural Neighbors, NaN)作为错误检测技术检测过采样后数据集中存在的疑似噪声样本。最后,利用改进的差分进化(Differential Evolution, DE)方法对检测出的疑似噪声样本进行迭代进化。相比于传统过采样方法,本文方法能更充分挖掘数据集中的重要边界信息,从而为分类器提供更多辅助以改善其分类性能。在22个基准数据集上,与7种经典采样方法(结合3种不同分类器)进行了大量对比实验。实验结果表明,本文所提方法具有更高的F1和G-mean值,并且在噪声处理方面也优于带有后置过滤器的采样方法,可以更为有效地解决不平衡数据分类问题。此外,统计分析也表明其弗里德曼排名(Fridman Ranking)更高。

关键词: K近邻, 径向基过采样, 样本势, 自然近邻, 差分进化, 不平衡数据分类

Abstract: In dealing with the problem of imbalanced data classification, oversampling methods are considered effective strategies. Existing methods mostly employ K-nearest neighbor (KNN) technique to select oversampling seed samples, but changes in KNN parameter values often lead to significant instability for most oversampling methods. The recently proposed radial-based oversampling (RBO) method can address this issue, but it tends to introduce a substantial amount of noise after oversampling. In this paper, we propose an imbalanced data oversampling method based on sample potential and noise evolution to further iteratively refine the oversampled dataset. The core steps are as follows: Firstly, the RBO method is used to synthesize minority class samples and improve the imbalance of the original data by calculating sample potential. Secondly, natural neighbors (NaN) is employed as an error detection technique to identify suspected noise samples in the oversampled dataset. Finally, an improved differential evolution (DE) method is applied to iteratively refine the detected suspected noise samples. Compared to traditional oversampling methods, the proposed method can better explore important boundary information in the dataset, thus providing more assistance to classifiers to improve their classification performance. Extensive comparative experiments were conducted on 22 benchmark datasets with seven classical sampling methods (combined with three different classifiers). The experiment results show that the proposed method achieves higher F1 and G-mean values and is superior in noise handling compared to sampling methods with post-filters, which can more effectively deal with the problem of imbalanced data classification. In addition, statistical analysis also indicates a higher Friedman Ranking for the proposed method.

Key words: K-nearest neighbor, radial-based oversampling, sample potential, natural neighbor, differential evolution, imbalanced data classification

中图分类号: