基于样本势和噪声进化的不平衡数据过采样方法

• •

基于样本势和噪声进化的不平衡数据过采样方法

冷强奎,孙薛梓,孟祥福

辽宁工程技术大学

收稿日期:2023-08-28 修回日期:2023-10-22 发布日期:2023-12-18
通讯作者: 冷强奎
基金资助:
国家自然科学基金资助项目;国家自然科学基金资助项目;辽宁省自然科学基金资助项目;辽宁省教育厅科研项目;辽宁工程技术大学博士科研启动基金

An oversampling method for imbalanced data based on sample potential and noise evolution

Received:2023-08-28 Revised:2023-10-22 Online:2023-12-18

摘要/Abstract

摘要： 在处理不平衡数据分类问题中，过采样方法被认为是一种有效的策略。现有方法大多采用K近邻技术选取采样种子样本，但K近邻参数值的改变会导致多数过采样方法表现出明显的不适定性。最近提出的径向基过采样方法(Radial-Based Oversampling, RBO)能够解决这个问题，但该方法在采样后容易出现大量噪声。基于此，本文提出了一种基于样本势和噪声进化的不平衡数据过采样方法，进一步对采样后的数据集进行迭代进化。其核心步骤是：首先，使用RBO方法通过计算样本势来合成少数类样本并改善原始数据的不平衡。其次，使用自然近邻(Natural Neighbors, NaN)作为错误检测技术检测过采样后数据集中存在的疑似噪声样本。最后，利用改进的差分进化(Differential Evolution, DE)方法对检测出的疑似噪声样本进行迭代进化。相比于传统过采样方法，本文方法能更充分挖掘数据集中的重要边界信息，从而为分类器提供更多辅助以改善其分类性能。在22个基准数据集上，与7种经典采样方法(结合3种不同分类器)进行了大量对比实验。实验结果表明，本文所提方法具有更高的F1和G-mean值，并且在噪声处理方面也优于带有后置过滤器的采样方法，可以更为有效地解决不平衡数据分类问题。此外，统计分析也表明其弗里德曼排名(Fridman Ranking)更高。

关键词: K近邻, 径向基过采样, 样本势, 自然近邻, 差分进化, 不平衡数据分类

Abstract: In dealing with the problem of imbalanced data classification, oversampling methods are considered effective strategies. Existing methods mostly employ K-nearest neighbor (KNN) technique to select oversampling seed samples, but changes in KNN parameter values often lead to significant instability for most oversampling methods. The recently proposed radial-based oversampling (RBO) method can address this issue, but it tends to introduce a substantial amount of noise after oversampling. In this paper, we propose an imbalanced data oversampling method based on sample potential and noise evolution to further iteratively refine the oversampled dataset. The core steps are as follows: Firstly, the RBO method is used to synthesize minority class samples and improve the imbalance of the original data by calculating sample potential. Secondly, natural neighbors (NaN) is employed as an error detection technique to identify suspected noise samples in the oversampled dataset. Finally, an improved differential evolution (DE) method is applied to iteratively refine the detected suspected noise samples. Compared to traditional oversampling methods, the proposed method can better explore important boundary information in the dataset, thus providing more assistance to classifiers to improve their classification performance. Extensive comparative experiments were conducted on 22 benchmark datasets with seven classical sampling methods (combined with three different classifiers). The experiment results show that the proposed method achieves higher F1 and G-mean values and is superior in noise handling compared to sampling methods with post-filters, which can more effectively deal with the problem of imbalanced data classification. In addition, statistical analysis also indicates a higher Friedman Ranking for the proposed method.

Key words: K-nearest neighbor, radial-based oversampling, sample potential, natural neighbor, differential evolution, imbalanced data classification

中图分类号:

TP391

冷强奎孙薛梓孟祥福. 基于样本势和噪声进化的不平衡数据过采样方法[J]. 计算机应用.

[1]	王波, 王浩, 杜晓昕, 郑晓东, 周薇. 基于亚群和差分进化的混合蜻蜓算法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2868-2876.
[2]	林剑, 叶璟轩, 刘雯雯, 邵晓雯. 求解带容量约束车辆路径问题的多模态差分进化算法[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2248-2254.
[3]	高乾顺, 范纯龙, 李炎达, 滕一平. 基于差分进化的神经网络通用扰动生成方法[J]. 《计算机应用》唯一官方网站, 2023, 43(11): 3436-3442.
[4]	徐小平, 唐阳丽, 王峰. 求解旅行商问题的人工协同搜索算法[J]. 《计算机应用》唯一官方网站, 2022, 42(6): 1837-1843.
[5]	聂青青, 万定生, 朱跃龙, 李致家, 姚成. 基于时域卷积网络的水文模型[J]. 《计算机应用》唯一官方网站, 2022, 42(6): 1756-1761.
[6]	刘学文, 王继奎, 杨正国, 李强, 易纪海, 李冰, 聂飞平. 密度峰值优化的球簇划分欠采样不平衡数据分类算法[J]. 《计算机应用》唯一官方网站, 2022, 42(5): 1455-1463.
[7]	代荣荣, 李宏慧, 付学良. 基于差分进化融合蚁群算法的数据中心流量调度机制[J]. 《计算机应用》唯一官方网站, 2022, 42(12): 3863-3869.
[8]	张祥飞, 鲁宇明, 张平生. 基于协同进化的约束多目标优化算法[J]. 计算机应用, 2021, 41(7): 2012-2018.
[9]	王月, 江逸茗, 兰巨龙. 基于改进三元组网络和K近邻算法的入侵检测[J]. 计算机应用, 2021, 41(7): 1996-2002.
[10]	蔡瑞光, 张德生, 肖燕婷. 参数独立的加权局部均值伪近邻分类算法[J]. 计算机应用, 2021, 41(6): 1694-1700.
[11]	贾鹤鸣, 姜子超, 李瑶, 孙康健. 基于改进斑点鬣狗优化算法的同步优化特征选择[J]. 计算机应用, 2021, 41(5): 1290-1298.
[12]	邵志胜, 张国富, 苏兆品, 李磊. 基于软件体系结构和广义差分进化的测试资源动态分配算法[J]. 《计算机应用》唯一官方网站, 2021, 41(12): 3692-3701.
[13]	夏伦腾, 张莉. 基于K近邻和动态时间规整算法的盲人物联网手杖系统[J]. 计算机应用, 2020, 40(8): 2441-2448.
[14]	薛锋, 史旭华, 史非凡. 基于代理模型的差分进化约束优化[J]. 计算机应用, 2020, 40(4): 1091-1096.
[15]	赵志学, 李夏苗, 周鲜成. 考虑拥堵区域的多车型绿色车辆路径问题优化[J]. 计算机应用, 2020, 40(3): 883-890.