Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (8): 2466-2475.DOI: 10.11772/j.issn.1001-9081.2023081145

• Data science and technology • Previous Articles     Next Articles

Oversampling method for imbalanced data based on sample potential and noise evolution

Qiangkui LENG(), Xuezi SUN, Xiangfu MENG   

  1. School of Electronics and Information Engineering,Liaoning Technical University,Huludao Liaoning 125105,China
  • Received:2023-08-29 Revised:2023-10-22 Accepted:2023-11-03 Online:2024-08-22 Published:2024-08-10
  • Contact: Qiangkui LENG
  • About author:LENG Qiangkui , born in 1981, Ph. D., professor. His researchinterests include artificial intelligence, machine learning.
    SUN Xuezi , born in 1998, M. S. candidate. His research interestsinclude artificial intelligence, machine learning.
    MENG Xiangfu, born in 1981, Ph. D., professor. His researchinterests big data analysis and application.
  • Supported by:
    This work is partially supported by National Natural ScienceFoundation of China (61602056, 61772249) ; Natural ScienceFoundation of Liaoning Province (2019-ZD-0493); Research Project ofLiaoning Provincial Department of Education (LQ2019012); DoctoralResearch Start-up Fund of Liaoning Technical University( 21-1043).


冷强奎(), 孙薛梓, 孟祥福   

  1. 辽宁工程技术大学 电子与信息工程学院,辽宁 葫芦岛 125105
  • 通讯作者: 冷强奎
  • 作者简介:冷强奎(1981—),男,辽宁建平人,教授,博士生导师,博士,CCF高级会员,主要研究方向:人工智能、机器学习
  • 基金资助:


In dealing with the problem of imbalanced data classification, oversampling methods are effective strategies. Existing methods mostly employ K-Nearest Neighbor (KNN) technique to select oversampling seed samples, but changes in KNN parameter values often lead to significant instability for most oversampling methods. Radial-Basis Oversampling (RBO) method can address this issue, but it tends to introduce a substantial amount of noise after oversampling. An imbalanced data oversampling method based on sample potential and noise evolution was proposed to further iteratively refine the oversampled dataset. Firstly, the RBO method was used to synthesize minority class samples and improve the imbalance of the original data by calculating sample potential. Secondly, Natural Neighbor (NaN) was employed as an error detection technique to identify suspected noise samples in the oversampled dataset. Finally, an improved Differential Evolution (DE) method was applied to iteratively refine the detected suspected noise samples. Compared to traditional oversampling methods, the proposed method can better explore important boundary information in the dataset, thus providing more assistance to classifiers to improve their classification performance. Extensive comparative experiments were conducted on 22 benchmark datasets with seven classical sampling methods (combined with three different classifiers). The experiment results show that the proposed method achieves higher F1values and G-mean values and is superior in noise handling compared to sampling methods with post-filters, which can more effectively deal with the problem of imbalanced data classification. In addition, statistical analysis also indicates the proposed method achieves a higher Friedman ranking.

Key words: K-Nearest Neighbor (KNN), Radial-Basis Oversampling (RBO), sample potential, natural neighbor, Differential Evolution (DE), imbalanced data classification



关键词: K近邻, 径向基过采样, 样本势, 自然近邻, 差分进化, 不平衡数据分类

CLC Number: