面向K最近邻分类的遗传实例选择算法

doi:10.11772/j.issn.1001-9081.2018041337

计算机应用 ›› 2018, Vol. 38 ›› Issue (11): 3112-3118.DOI: 10.11772/j.issn.1001-9081.2018041337

• 第七届中国数据挖掘会议(CCDM 2018) • 上一篇下一篇

面向K最近邻分类的遗传实例选择算法

黄宇扬¹, 董明刚^1,2, 敬超^1,2

1. 桂林理工大学信息科学与工程学院, 桂林 541004;
2. 广西嵌入式技术与智能系统重点实验室(桂林理工大学), 桂林 541004

收稿日期:2018-04-30 修回日期:2018-06-21 出版日期:2018-11-10 发布日期:2018-11-10
通讯作者: 董明刚
作者简介:黄宇扬(1996-),男,广西百色人,主要研究方向:机器学习、智能计算;董明刚(1977-),男,湖北安陆人,教授,博士,CCF会员,主要研究方向:智能计算、机器学习;敬超(1983-),男,河南长葛人,讲师,博士,CCF会员,主要研究方向:云数据中心的能耗优化、深度强化学习。
基金资助:
国家自然科学基金资助项目（61563012，61203109）；广西自然科学基金资助项目（2014GXNSFAA118371，2015GXNSFBA139260）；广西嵌入式技术与智能系统重点实验室基金资助项目。

Genetic instance selection algorithm for K-nearest neighbor classifier

HUANG Yuyang¹, DONG Minggang^1,2, JING Chao^1,2

1. College of Information Science and Engineering, Guilin University of Technology, Guilin Guangxi 541004, China;
2. Guangxi Key Laboratory of Embedded Technology and Intelligent System(Guilin University of Technology), Guilin Guangxi 541004, China

Received:2018-04-30 Revised:2018-06-21 Online:2018-11-10 Published:2018-11-10
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61563012, 61203109), the Guangxi Natural Science Foundation (2014GXNSFAA118371, 2015GXNSFBA139260), the Guangxi Key Laboratory of Embedded Technology and Intelligent System Foundation.

摘要/Abstract

摘要： 针对传统的实例选择算法会误删训练集中非噪声样本、算法效率低的不足，提出了一种面向K最近邻（KNN）的遗传实例选择算法。该算法采用基于决策树和遗传算法的二阶段筛选机制，先使用决策树确定噪声样本存在的范围；再使用遗传算法在该范围内精确删除噪声样本，可有效地降低误删率并提高效率，采用基于最近邻规则的验证集选择策略，进一步提高了遗传算法实例选择的准确度；最后引进基于均方误差（MSE）的分类精度惩罚函数来计算遗传算法中个体的适应度，提高有效性和稳定性。在20个数据集上，该方法相较于基于预分类的KNN （PRKNN）、基于协同进化的实例特征选择算法（IFS-CoCo）、K最近邻（KNN），在分类精度上的提升分别为0.07~26.9个百分点、0.03~11.8个百分点、0.2~12.64个百分点，在AUC和Kappa的上的提升分别为0.25~18.32个百分点、1.27~23.29个百分点、0.04~12.82个百分点。实验结果表明，该方法相较于当前实例选择算法在分类精度和分类效率上均具有优势。

关键词: K最近邻, 遗传算法, 决策树, 实例选择, 噪声样本, 机器学习

Abstract: Traditional instance selection algorithms may remove non-noise samples by mistake and have low algorithm efficiency. For this issue, a genetic instance selection algorithm for K-Nearest Neighbor (KNN) classifier was proposed. A two-stage selection mechanism based on decision tree and genetic algorithm was used in the algorithm. Firstly, the decision tree was used to determine the range of noise samples. Then, the genetic algorithm was used to remove the noise samples in this range precisely, which could reduce the risk of mistaken remove effectively and improve the algorithm efficiency. Secondly, the 1NN-based selection strategy of validation set was proposed to improve the instance selection accuracy of the genetic algorithm. Finally, the MSE (Mean Squared Error)-based objective function was used as the fitness function, which could improve the effectiveness and stability of the algorithm. Compared with PRe-classification based KNN (PRKNN), Instance and Feature Selection based on Cooperative Coevolution (IFS-CoCo), K-Nearest Neighbors (KNN), the improvement in classification accuracy is 0.07 to 26.9 percentage points, 0.03 to 11.8 percentage points and 0.2 to 12.64 percentage points respectively, the improvement in AUC (Area Under Curve) and Kappa is 0.25 to 18.32 percentage points, 1.27 to 23.29 percentage points, and 0.04 to 12.82 percentage points respectively. The experimental results show that the proposed method has advantages in terms of classification accuracy and classification efficiency.

Key words: K-Nearest Neighbor (KNN), genetic algorithm, decision tree, instance selection, noise sample, machine learning

中图分类号:

TP181

黄宇扬, 董明刚, 敬超. 面向K最近邻分类的遗传实例选择算法[J]. 计算机应用, 2018, 38(11): 3112-3118.

HUANG Yuyang, DONG Minggang, JING Chao. Genetic instance selection algorithm for K-nearest neighbor classifier[J]. Journal of Computer Applications, 2018, 38(11): 3112-3118.

参考文献

[1] COVER T, HART P. Nearest neighbor pattern classification[J]. IEEE Transactions on Information Theory, 1967, 13(1):21-27.
[2] 冯立伟, 张成, 李元,等. 基于统计模量和局部近邻标准化的局部离群因子故障检测方法[J]. 计算机应用, 2018, 38(4):965-970.(FENG L W, ZHANG C, LI Y, et al. Local outlier factor fault detection method based on statistical pattern and local nearest neighborhood standardization[J]. Journal of Computer Applications, 2018, 38(4):965-970.)
[3] ARIF M, AKRAM M U, MINHAS F A A. Pruned fuzzy K-nearest neighbor classifier for beat classification[J]. Journal of Biomedical Science & Engineering, 2010, 3(4):380-389.
[4] DENG Z, ZHU X, CHENG D, et al. Efficient kNN classification algorithm for big data[J]. Neurocomputing, 2016, 195(C):143-148.
[5] FAYED H A, ATIYA A F. A novel template reduction approach for the K-nearest neighbor method[J]. IEEE Transactions on Neural Networks, 2009,20(5):890-896.
[6] ZHANG S, LI X, MING Z, et al. Efficient kNN classification with different numbers of nearest neighbors[J]. IEEE Transactions on Neural Networks & Learning Systems, 2017, 29(5):1774-1785.
[7] GILPITA R, YAO X. Evolving edited k-nearest neighbor classifiers[J]. International Journal of Neural Systems, 2009, 18(6):459-467.
[8] XIE H, LIANG D, ZHANG Z, et al. A novel pre-classification based kNN algorithm[C]//Proceedings of the 2016 IEEE 16th International Conference on Data Mining Workshops. Piscataway, NJ:IEEE, 2017:1269-1275.
[9] EIBEN A E, SCHOENAUER M. Evolutionary computing[J]. Soft Computing, 1998, 82(1):1-6.
[10] DONG C R, CHAN P P K, NG W W Y, et al. 2-stage instance selection algorithm for KNN based on nearest unlike neighbors[C]//Proceedings of the 2010 International Conference on Machine Learning and Cybernetics. Piscataway, NJ:IEEE, 2010:134-140.
[11] ACAMPORA G, TORTORA G, VITIELLO A. Applying SPEA2 to prototype selection for nearest neighbor classification[C]//Proceedings of the 2016 IEEE International Conference on Systems, Man, and Cybernetics. Piscataway, NJ:IEEE, 2017:3924-3929.
[12] DERRAC J, GARCíA S, HERRERA F. IFS-Coco:Instance and feature selection based on cooperative coevolution with nearest neighbor rule[J]. Pattern Recognition, 2010, 43(6):2082-2105.
[13] DERRAC J, TRIGUERO I, GARCIA S, et al. Integrating instance selection, instance weighting, and feature weighting for nearest neighbor classifiers by coevolutionary algorithms[J]. IEEE Transactions on Systems, Man & Cybernetics, Part B, 2012, 42(5):1383-1397.
[14] DERRAC J, CORNELIS C, GARCÍA S, et al. Enhancing evolutionary instance selection algorithms by means of fuzzy rough set based feature selection[J]. Information Sciences, 2012, 186(1):73-92.
[15] WITTEN I H, FRANK E. Data mining:practical machine learning tools and techniques[J]. ACM SIGMOD Record, 2005, 31(1):76-77.
[16] HANLEY J A, MCNEIL B J. The meaning and use of the area under a Receiver Operating Characteristic (ROC) curve[J]. Radiology, 1982, 143(1):29.
[17] COHEN J. A coefficient of agreement for nominal scales[J]. Educational & Psychological Measurement, 1960, 20(1):37-46.
[18] WILCOXON F. Individual comparisons by ranking methods[J]. Proceedings of the Biometrics Bulletin, 1945, 1(6):80-83.

面向K最近邻分类的遗传实例选择算法

Genetic instance selection algorithm for K-nearest neighbor classifier

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	郭棉, 张锦友. 移动边缘计算环境中面向机器学习的计算迁移策略[J]. 计算机应用, 2021, 41(9): 2639-2645.
[2]	毛铭泽, 曹芮浩, 闫春钢. 基于权值多样性的半监督分类算法[J]. 计算机应用, 2021, 41(9): 2473-2480.
[3]	秦斌斌, 彭良康, 卢向明, 钱江波. 司机分心驾驶检测研究进展[J]. 计算机应用, 2021, 41(8): 2330-2337.
[4]	张闻强, 邢征, 杨卫东. 基于多区域采样策略的混合粒子群优化求解多目标柔性作业车间调度问题[J]. 计算机应用, 2021, 41(8): 2249-2257.
[5]	张盟, 郭健全. 需求和回收不确定的闭环供应链渠道结构选择[J]. 计算机应用, 2021, 41(7): 2100-2107.
[6]	杨震, 马健霄, 王宝杰. 设置待行区条件下双环相位信号配时优化模型[J]. 计算机应用, 2021, 41(7): 2108-2112.
[7]	李进, 王凤, 杨沈宇. 换电模式下电动车货运路径优化模型与算法[J]. 计算机应用, 2021, 41(6): 1792-1798.
[8]	李舒仪, 韩晓龙. 海铁联运港口混合作业模式下轨道吊与集卡协同调度[J]. 计算机应用, 2021, 41(5): 1506-1513.
[9]	周美玲, 陈淮莉. 基于负荷平衡的电动汽车模糊多目标充电调度算法[J]. 计算机应用, 2021, 41(4): 1192-1198.
[10]	王彬溶, 谭代伦, 郑伯川. 基于旅行商问题转化和遗传算法求解汽配件喷涂顺序[J]. 计算机应用, 2021, 41(3): 881-886.
[11]	姜倩玉, 王凤英, 贾立鹏. 基于感知哈希算法和特征融合的恶意代码检测方法[J]. 计算机应用, 2021, 41(3): 780-785.
[12]	马晓梅, 何非. 基于改进遗传算法的标签印刷生产调度技术[J]. 计算机应用, 2021, 41(3): 860-866.
[13]	曹阳, 闫秋艳, 吴鑫. 不平衡时间序列集成分类算法[J]. 计算机应用, 2021, 41(3): 651-656.
[14]	秦静, 左长青, 汪祖民, 季长清, 王宝凤. 基于堆叠分类器的心电异常监测模型设计[J]. 计算机应用, 2021, 41(3): 887-890.
[15]	黄书召, 田军委, 乔路, 王沁, 苏宇. 基于改进遗传算法的无人机路径规划[J]. 计算机应用, 2021, 41(2): 390-397.