计算机应用 ›› 2018, Vol. 38 ›› Issue (11): 3112-3118.DOI: 10.11772/j.issn.1001-9081.2018041337

• 第七届中国数据挖掘会议(CCDM 2018) • 上一篇    下一篇

面向K最近邻分类的遗传实例选择算法

黄宇扬1, 董明刚1,2, 敬超1,2   

  1. 1. 桂林理工大学 信息科学与工程学院, 桂林 541004;
    2. 广西嵌入式技术与智能系统重点实验室(桂林理工大学), 桂林 541004
  • 收稿日期:2018-04-30 修回日期:2018-06-21 出版日期:2018-11-10 发布日期:2018-11-10
  • 通讯作者: 董明刚
  • 作者简介:黄宇扬(1996-),男,广西百色人,主要研究方向:机器学习、智能计算;董明刚(1977-),男,湖北安陆人,教授,博士,CCF会员,主要研究方向:智能计算、机器学习;敬超(1983-),男,河南长葛人,讲师,博士,CCF会员,主要研究方向:云数据中心的能耗优化、深度强化学习。
  • 基金资助:
    国家自然科学基金资助项目(61563012,61203109);广西自然科学基金资助项目(2014GXNSFAA118371,2015GXNSFBA139260);广西嵌入式技术与智能系统重点实验室基金资助项目。

Genetic instance selection algorithm for K-nearest neighbor classifier

HUANG Yuyang1, DONG Minggang1,2, JING Chao1,2   

  1. 1. College of Information Science and Engineering, Guilin University of Technology, Guilin Guangxi 541004, China;
    2. Guangxi Key Laboratory of Embedded Technology and Intelligent System(Guilin University of Technology), Guilin Guangxi 541004, China
  • Received:2018-04-30 Revised:2018-06-21 Online:2018-11-10 Published:2018-11-10
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61563012, 61203109), the Guangxi Natural Science Foundation (2014GXNSFAA118371, 2015GXNSFBA139260), the Guangxi Key Laboratory of Embedded Technology and Intelligent System Foundation.

摘要: 针对传统的实例选择算法会误删训练集中非噪声样本、算法效率低的不足,提出了一种面向K最近邻(KNN)的遗传实例选择算法。该算法采用基于决策树和遗传算法的二阶段筛选机制,先使用决策树确定噪声样本存在的范围;再使用遗传算法在该范围内精确删除噪声样本,可有效地降低误删率并提高效率,采用基于最近邻规则的验证集选择策略,进一步提高了遗传算法实例选择的准确度;最后引进基于均方误差(MSE)的分类精度惩罚函数来计算遗传算法中个体的适应度,提高有效性和稳定性。在20个数据集上,该方法相较于基于预分类的KNN (PRKNN)、基于协同进化的实例特征选择算法(IFS-CoCo)、K最近邻(KNN),在分类精度上的提升分别为0.07~26.9个百分点、0.03~11.8个百分点、0.2~12.64个百分点,在AUC和Kappa的上的提升分别为0.25~18.32个百分点、1.27~23.29个百分点、0.04~12.82个百分点。实验结果表明,该方法相较于当前实例选择算法在分类精度和分类效率上均具有优势。

关键词: K最近邻, 遗传算法, 决策树, 实例选择, 噪声样本, 机器学习

Abstract: Traditional instance selection algorithms may remove non-noise samples by mistake and have low algorithm efficiency. For this issue, a genetic instance selection algorithm for K-Nearest Neighbor (KNN) classifier was proposed. A two-stage selection mechanism based on decision tree and genetic algorithm was used in the algorithm. Firstly, the decision tree was used to determine the range of noise samples. Then, the genetic algorithm was used to remove the noise samples in this range precisely, which could reduce the risk of mistaken remove effectively and improve the algorithm efficiency. Secondly, the 1NN-based selection strategy of validation set was proposed to improve the instance selection accuracy of the genetic algorithm. Finally, the MSE (Mean Squared Error)-based objective function was used as the fitness function, which could improve the effectiveness and stability of the algorithm. Compared with PRe-classification based KNN (PRKNN), Instance and Feature Selection based on Cooperative Coevolution (IFS-CoCo), K-Nearest Neighbors (KNN), the improvement in classification accuracy is 0.07 to 26.9 percentage points, 0.03 to 11.8 percentage points and 0.2 to 12.64 percentage points respectively, the improvement in AUC (Area Under Curve) and Kappa is 0.25 to 18.32 percentage points, 1.27 to 23.29 percentage points, and 0.04 to 12.82 percentage points respectively. The experimental results show that the proposed method has advantages in terms of classification accuracy and classification efficiency.

Key words: K-Nearest Neighbor (KNN), genetic algorithm, decision tree, instance selection, noise sample, machine learning

中图分类号: