一种面向非平衡类问题的k近邻分类算法

• •

一种面向非平衡类问题的k近邻分类算法

郭华平¹,周俊¹,邬长安¹,范明²

1. 信阳师范学院
2. 郑州大学

收稿日期:2017-09-07 修回日期:2017-10-30 发布日期:2017-10-30
通讯作者: 郭华平

A Novel k-Nearest Neighbor Classification Method for Class-Imbalanced Problem

Received:2017-09-07 Revised:2017-10-30 Online:2017-10-30

摘要/Abstract

摘要： 针对k近邻方法不能很好地解决非平衡类问题，提出了一种新的面向非平衡类问题的k近邻分类算法。与传统k近邻方法不同，在学习阶段，该算法首先使用划分算法（如K-Means）将多数类数据集划分为多个簇，然后将每个簇与少数类数据集合并成一个新的训练集用于训练一个k近邻模型。所以该算法构建了一个包含多个k近邻模型的分类器库。在预测阶段，使用划分算法（如K-Means）从分类器库中选择一个模型用于预测样本类别。通过这种方法，提出的算法有效地保证了k近邻模型既能有效发现数据局部特征，又能充分考虑数据的非平衡性对分类器性能的影响。另外，该算法也有效地提升了k近邻的预测效率。为了进一步提高该算法的性能，将过抽样技术SMOTE应用到该算法中。KEEL数据集上的实验结果表明，即使采用随机划分策略划分多数类数据集，提出的算法能有效地提高k近邻方法在评价指标recall、g-mean、f-measure和AUC上泛化性能；另外，过抽样技术能进一步提高该算法在非平衡类问题上的性能，并明显优于其它高级非平衡类处理方法。

关键词: 非平衡类, k近邻, 划分, 过抽样, 聚类算法

Abstract: K-Nearest Neighbor (kNN) is an extremely simple but surprisingly effective supervised learning method which can efficiently discover local characteristics of the data. This paper applies kNN to class-imbalanced data and proposes a novel kNN classification algorithm for imbalanced problem. Unlike traditional kNN, for the learning process, the proposed method firstly partitions the majority set into several clusters using partition algorithm(such as K-Means), merges each cluster with the minority set as a new training set to train a kNN model, and therefore the algorithm constructs a classifier library consisting of kNN models. For the prediction, the proposed method uses partition algorithm (such as K-Means) to select a model from the library to predict the class label of an instance. In this way, the proposed algorithm guarantees that the kNN model can efficiently discover local characteristics of the data, and also fully consider the effect of imbalance of the data on the performance of the classifier. Besides, the algorithm effectively promotes the efficiency of kNN. To further enhance the performance of the proposed algorithm, oversampling technique (SMOTE) is applied to the proposed method. Experimental results on KEEL data sets show that even employing the strategy of random partition to partition the majority set, the proposed method can effectively enhance the generalization performance of kNN method on evaluation measures of recall, g-mean, f-measure and AUC.

Key words: class-imbalanced problem, kNN, partition, oversampling, clustering

中图分类号:

TP181

郭华平周俊邬长安范明. 一种面向非平衡类问题的k近邻分类算法[J]. 计算机应用.

[1]	徐凯, 高琦凯, 殷明, 谭京京. 基于三维空间面积划分的轨迹相似性度量算法[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 318-323.
[2]	尚绍法, 蒋林, 李远成, 朱筠. 异构平台下卷积神经网络推理模型自适应划分和调度方法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2828-2835.
[3]	陆佳行, 戴华, 刘源龙, 周倩, 杨庚. 面向云环境密文排序检索的字典划分向量空间模型[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 1994-2000.
[4]	翟冉, 陈学斌, 张国鹏, 裴浪涛, 马征. 基于不同敏感度的改进K-匿名隐私保护算法[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1497-1503.
[5]	周琳, 肖玉芝, 刘鹏, 秦有鹏. 基于节点多关系的社团挖掘算法及其应用[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1489-1496.
[6]	王逸, 裴生雷, 王煜. 基于CSI和K-means-SVR的多指纹库室内定位方法[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1636-1640.
[7]	宗传玉, 宪超, 夏秀峰. 实例簇驱动的图结构聚类参数计算算法[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 398-406.
[8]	王啸飞, 鲍胜利, 陈炯环. 基于潜在因子模型在子空间上的缺失值注意力聚类算法[J]. 《计算机应用》唯一官方网站, 2023, 43(12): 3772-3778.
[9]	刘乾, 张洋铭, 万定生. 网格化分布式新安江模型并行计算算法[J]. 《计算机应用》唯一官方网站, 2023, 43(11): 3327-3333.
[10]	罗香玉, 闫克, 卢琰, 王甜, 辛刚. 基于社区改变量估计的非均匀时间片划分方法[J]. 《计算机应用》唯一官方网站, 2023, 43(11): 3457-3463.
[11]	李宇航, 杨玉丽, 马垚, 于丹, 陈永乐. 基于BERT模型的文本对抗样本生成方法[J]. 《计算机应用》唯一官方网站, 2023, 43(10): 3093-3098.
[12]	孙源, 沈文建, 倪朋勃, 毛敏, 谢雅琪, 徐朝农. 实时工业物联网的功率域非正交多址接入基站选址算法[J]. 《计算机应用》唯一官方网站, 2023, 43(1): 209-214.
[13]	王谨东, 李强. 基于Raft算法改进的实用拜占庭容错共识算法[J]. 《计算机应用》唯一官方网站, 2023, 43(1): 122-129.
[14]	孙泽强, 陈炳才, 崔晓博, 王磊, 陆雅诺. 融合频域注意力机制和解耦头的YOLOv5带钢表面缺陷检测[J]. 《计算机应用》唯一官方网站, 2023, 43(1): 242-249.
[15]	章曼, 张正军, 冯俊淇, 严涛. 基于自适应可达距离的密度峰值聚类算法[J]. 《计算机应用》唯一官方网站, 2022, 42(6): 1914-1921.