k-nearest neighbor classification method for class-imbalanced problem

doi:10.11772/j.issn.1001-9081.2017092181

Journal of Computer Applications ›› 2018, Vol. 38 ›› Issue (4): 955-959.DOI: 10.11772/j.issn.1001-9081.2017092181

Previous Articles Next Articles

k-nearest neighbor classification method for class-imbalanced problem

GUO Huaping¹, ZHOU Jun¹, WU Chang'an¹, FAN Ming²

1. School of Computer and Information Technology, Xinyang Normal University, Xinyang Henan 464000, China;
2. School of Information Engineering, Zhengzhou University, Zhengzhou Henan 450000, China

Received:2017-09-08 Revised:2017-10-30 Online:2018-04-10 Published:2018-04-09

面向非平衡类问题的k近邻分类算法

郭华平¹, 周俊¹, 邬长安¹, 范明²

1. 信阳师范学院计算机与信息技术学院, 河南信阳 464000;
2. 郑州大学信息工程学院, 郑州 450000

通讯作者: 郭华平
作者简介:郭华平(1982-),男,河南信阳人,副教授,博士,CCF会员,主要研究方向:机器学习、数据挖掘;周俊(1984-),男,河南信阳人,硕士研究生,主要研究方向:机器学习;邬长安(1959-),男,河南信阳人,教授,硕士,主要研究方向:模式识别、图像处理;范明(1948-),男,河南郑州人,教授,博士生导师,硕士,主要研究方向:机器学习、数据挖掘、数据库。

Abstract

Abstract: To improve the performance of k-Nearest Neighbor (kNN) model on class-imbalanced data, a new kNN classification algorithm was proposed. Different from the traditional kNN, for the learning process, the majority set was partitioned into several clusters by using partitioning method (such as K-Means), then each cluster was merged with the minority set as a new training set to train a kNN model, therefore a classifier library was constructed consisting of serval kNN models. For the prediction, using a partitioning method (such as K-Means), a model was selected from the classifier library to predict the class category of a sample. By this way, it is guaranteed that the kNN model can efficiently discover local characteristics of the data, and also fully consider the effect of imbalance of the data on the performance of the classifier. Besides, the efficiency of kNN was also effectively promoted. To further enhance the performance of the proposed algorithm, Synthetic Minority Over-sampling TEchnique (SMOTE) was applied to the proposed algorithm. Experimental results on KEEL data sets show that the proposed algorithm effectively enhances the generalization performance of kNN method on evaluation measures of recall, g-mean, f-measure and Area Under ROC Curve (AUC) on majority set partitioned by random partition strategy, and it also shows great superiority to other state-of-the-art methods.

Key words: class-imbalanced problem, k-Nearest Neighbor (kNN), partitioning, oversampling

摘要： 针对k近邻（kNN）方法不能很好地解决非平衡类问题，提出一种新的面向非平衡类问题的k近邻分类算法。与传统k近邻方法不同，在学习阶段，该算法首先使用划分算法（如K-Means）将多数类数据集划分为多个簇，然后将每个簇与少数类数据集合并成一个新的训练集用于训练一个k近邻模型，即该算法构建了一个包含多个k近邻模型的分类器库。在预测阶段，使用划分算法（如K-Means）从分类器库中选择一个模型用于预测样本类别。通过这种方法，提出的算法有效地保证了k近邻模型既能有效发现数据局部特征，又能充分考虑数据的非平衡性对分类器性能的影响。另外，该算法也有效地提升了k近邻的预测效率。为了进一步提高该算法的性能，将合成少数类过抽样技术（SMOTE）应用到该算法中。KEEL数据集上的实验结果表明，即使对采用随机划分策略划分的多数类数据集，所提算法也能有效地提高k近邻方法在评价指标recall、g-mean、f-measure和AUC上的泛化性能；另外，过抽样技术能进一步提高该算法在非平衡类问题上的性能，并明显优于其他高级非平衡类处理方法。

关键词: 非平衡类技术, k近邻, 划分, 过抽样

CLC Number:

TP181

GUO Huaping, ZHOU Jun, WU Chang'an, FAN Ming. k-nearest neighbor classification method for class-imbalanced problem[J]. Journal of Computer Applications, 2018, 38(4): 955-959.

郭华平, 周俊, 邬长安, 范明. 面向非平衡类问题的k近邻分类算法[J]. 计算机应用, 2018, 38(4): 955-959.

References

[1] GUO H, LI Y, SHANG J, et al. Learning from class-imbalanced data:review of methods and applications[J]. Expert Systems with Applications, 2017, 73:220-239.
[2] LIN W, TSAI C, HU Y, et al. Clustering-based undersampling in class-imbalanced data[J]. Information Sciences, 2017, 409:17-26.
[3] SANZ J, BERNARDO A D, HERRERA F, et al. A compact evolutionary interval-valued fuzzy rule-based classification system for the modeling and prediction of real-world financial applications with imbalanced data[J]. IEEE Transactions on Fuzzy Systems, 2015, 23(4):973-990.
[4] CHAWLA N, JAKOWICZ V N, KOTCZ A. Editorial:special issue on learning from imbalanced data sets[J]. ACM Special Interest Group on Knowledge Discovery and Data Mining Explorations, 2004, 6(1):1-6.
[5] 郭华平, 董亚东, 毛海涛, 等. 一种基于逻辑判别式的稀有类分类方法[J]. 小型微型计算机系统, 2016, 37(1):140-145.(GUO H P, DONG Y D, MAO H T, et al. Logistic discrimination based rare-class classification method[J]. Journal of Chinese Computer Systems, 2016, 37(1):140-145.)
[6] SUN Y, WONG A K C, KAMEL M S. Classification of imbalanced data:a review[J]. International Journal of Pattern Recognition and Artificial Intelligence, 2009, 23(4):687-719.
[7] TAHIR M A, KITTLER J, MIKOLAJCZYK K, et al. A multiple expert approach to the class imbalance problem using inverse random under sampling[C]//Proceedings of 8th International Workshop on Multiple Classifier Systems. Berlin:Springer, 2009:82-91.
[8] CHAWLA N V, BOWYER K W, HALL L O, et al. SMOTE:synthetic minority over-sampling technique[J]. Journal of Artificial Intelligence Research, 2002, 16:321-357.
[9] LIU X, WU J, ZHOU Z. Exploratory undersampling for class-imbalance learning[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 2009, 39(2):539-550.
[10] LI P, QIAO P, LIU Y. A hybrid re-sampling method for SVM learning from imbalanced data sets[C]//FSKD 2008:Proceedings of the Fifth International Conference on Fuzzy Systems and Knowledge Discovery. Washington, DC:IEEE Computer Society,2008:65-69.
[11] WANG B X, JAPKOWICZ N. Boosting support vector machines for imbalanced data sets[J]. Knowledge and Information Systems, 2010, 25(1):1-20.
[12] ZHANG Y, FU P, LIU W, et al. Imbalanced data classification based on scaling kernel-based support vector machine[J]. Neural Computing and Applications, 2014, 25(3/4):927-935.
[13] GUO H, LIU H, WU C, et al. Logistic discrimination based on G-mean and F-measure for imbalanced problem[J]. Journal of Intelligent and Fuzzy Systems, 2016, 31(3):1155-1166.
[14] ALCALA-FDEZ J, FERNANDEZ A, LUENGO J, et al. KEEL data-mining software tool:data set repository, integration of algorithms and experimental analysis framework[J]. Journal of Multiple-Valued Logic and Soft Computing, 2011, 17(2/3):255-287.
[15] OLIVEIRA G V, COUTINHO F P, CAMPELLO R J G B, et al. Improving k-means through distributed scalable metaheuristics[J]. Neurocomputing, 2017, 246:45-57.
[16] BERKHIN P. A survey of clustering data mining techniques[J]. Grouping Multidimensional Data, 2006, 43(1):25-71.
[17] RUI XU, DONALD C. WUNSCH Ⅱ. Survey of clustering algorithms[J]. IEEE Transactions on Neural Networks, 2005, 16(3):645-678.

k-nearest neighbor classification method for class-imbalanced problem

面向非平衡类问题的k近邻分类算法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

[1]	WANG Yao, SUN Guozi. Oversampling method for intrusion detection based on clustering and instance hardness [J]. Journal of Computer Applications, 2021, 41(6): 1709-1714.
[2]	JIANG Kun, LIU Zheng, ZHU Lei, LI Xiaoxing. Fixed word-aligned partition compression algorithm of inverted list based on directed acyclic graph [J]. Journal of Computer Applications, 2021, 41(3): 727-732.
[3]	LIU Huijian, LIU Junsong, WANG Jiawei, XUE Gang. Service composition partitioning method based on process partitioning technology [J]. Journal of Computer Applications, 2020, 40(3): 799-805.
[4]	YANG Cheng, LU Jiamin, FENG Jun. Survey of large-scale resource description framework data partitioning methods in distributed environment [J]. Journal of Computer Applications, 2020, 40(11): 3184-3191.
[5]	ZHAO Ji, CHENG Cheng. Dynamic cooperative random drift particle swarm optimization algorithm assisted by evolution information [J]. Journal of Computer Applications, 2020, 40(11): 3119-3126.
[6]	WANG Zhongzhen, HUANG Bo, FANG Zhijun, GAO Yongbin, ZHANG Juan. Improved SMOTE unbalanced data integration classification algorithm [J]. Journal of Computer Applications, 2019, 39(9): 2591-2596.
[7]	TIAN Chen, ZHOU Lijuan. Credit assessment method based on majority weight minority oversampling technique and random forest [J]. Journal of Computer Applications, 2019, 39(6): 1707-1712.
[8]	JIN Huanhuan, YIN Haibo, HE Lingna. Deep automatic sleep staging model using synthetic minority technique [J]. Journal of Computer Applications, 2018, 38(9): 2483-2488.
[9]	SHEN Xueli, QIN Shujuan. Anomaly detection based on synthetic minority oversampling technique and deep belief network [J]. Journal of Computer Applications, 2018, 38(7): 1941-1945.
[10]	LIAO Tianxing, WANG Ling. Collaborative filtering recommendation algorithm combined with item tag similarity [J]. Journal of Computer Applications, 2018, 38(4): 1007-1011.
[11]	GU Junhua, HUO Shijie, WU Junyan, YIN Jun, ZHANG Suqi. Parallel multi-layer graph partitioning method for solving maximum clique problems [J]. Journal of Computer Applications, 2018, 38(12): 3425-3432.
[12]	GUO Liangmin, WANG Anxin, ZHENG Xiaoyao. Trajectory privacy protection method based on district partitioning [J]. Journal of Computer Applications, 2018, 38(11): 3263-3269.
[13]	ZHANG Yonghong, GE Taotao, TIAN Wei, XIA Guanghao, HE Jing. Evaluation of susceptibility to debris flow hazards based on geological big data [J]. Journal of Computer Applications, 2018, 38(11): 3319-3325.
[14]	LUO Xiaoxia, SI Fengwei, LUO Xiangyu. Effects of large-scale graph structural feature on partitioning quality [J]. Journal of Computer Applications, 2018, 38(1): 1-5.
[15]	REN Shuai, ZHANG Tao, YANG Tao, SUO Li, MU Dejun. Information hiding algorithm based on spherical segmentation of 3D model [J]. Journal of Computer Applications, 2017, 37(9): 2576-2580.