Journal of Computer Applications ›› 2021, Vol. 41 ›› Issue (9): 2658-2667.DOI: 10.11772/j.issn.1001-9081.2020111801

Special Issue: 先进计算

• Advanced computing • Previous Articles     Next Articles

Improved feature selection and classification algorithm for gene expression programming based on layer distance

ZHAN Hang1, HE Lang1, HUANG Zhangcan1, LI Huafeng1, ZHANG Qiang1, TAN Qing2   

  1. 1. School of Science, Wuhan University of Technology, Wuhan Hubei 430070, China;
    2. School of Mathematics and Statistics, Wuhan University, Wuhan Hubei 430072, China
  • Received:2020-11-17 Revised:2021-03-09 Online:2021-09-10 Published:2021-09-15
  • Supported by:
    This work is partially supported by the Surface Program of National Natural Science Foundation of China (61672391).

改进的基于层次距离的基因表达式编程特征选择分类算法

湛航1, 何朗1, 黄樟灿1, 李华峰1, 张蔷1, 谈庆2   

  1. 1. 武汉理工大学 理学院, 武汉 430070;
    2. 武汉大学 数学与统计学院, 武汉 430072
  • 通讯作者: 何朗
  • 作者简介:湛航(1996-),男,湖北孝感人,硕士研究生,主要研究方向:智能计算;何朗(1974-),男,湖北鄂州人,教授,博士,主要研究方向:演化计算、智能计算;黄樟灿(1960-),男,浙江嵊州人,教授,博士,主要研究方向:智能计算、图像处理;李华峰(1995-),男,山西临汾人,硕士研究生,主要研究方向:智能计算;张蔷(1996-),女,河南信阳人,硕士研究生,主要研究方向:智能计算;谈庆(1991-),男,湖南长沙人,博士研究生,主要研究方向:智能计算、图像处理。
  • 基金资助:
    国家自然科学基金面上项目(61672391)。

Abstract: Concerning the problem that the interpretable mapping relationship between data features and data categories do not be revealed by general feature selection algorithms. on the basis of Gene Expression Programming (GEP),by introducing the initialization methods, mutation strategies and fitness evaluation methods,an improved Feature Selection classification algorithm based on Layer Distance for GEP(FSLDGEP) was proposed. Firstly,the selection probability was defined to initialize the individuals in the population directionally, so as to increase the number of effective individuals in the population. Secondly, the layer neighborhood of the individual was proposed, so that each individual in the population would mutate based on its layer neighborhood, and the blind and unguided problem in the process of mutation was solved。Finally, the dimension reduction rate and classification accuracy were combined as the fitness value of the individual, which changed the population evolutionary mode of single optimization goal and balanced the relationship between the above two. The 5-fold and 10-fold verifications were performed on 7 datasets, the functional mapping relationship between data features and their categories was given by the proposed algorithm, and the obtained mapping function was used for data classification. Compared with Feature Selection based on Forest Optimization Algorithm (FSFOA), feature evaluation and selection based on Neighborhood Soft Margin (NSM), Feature Selection based on Neighborhood Effective Information Ratio (FS-NEIR)and other comparison algorithms, the proposed algorithm has obtained the best results of the dimension reduction rate on Hepatitis, Wisconsin Prognostic Breast Cancer (WPBC), Sonar and Wisconsin Diagnostic Breast Cancer (WDBC) datasets, and has the best average classification accuracy on Hepatitis, Ionosphere, Musk1, WPBC, Heart-Statlog and WDBC datasets. Experimental results shows that the feasibility, effectiveness and superiority of the proposed algorithm in feature selection and classification are verified.

Key words: feature selection, function discovery, Gene Expression Programming (GEP), population initialization, layer neighborhood

摘要: 针对一般特征选择算法未能揭示数据特征与数据类别之间的可解释性映射关系的问题,在基因表达式编程(GEP)的基础上,通过引入初始化方法、变异策略以及适应度评价方法,提出了一种改进的基于层次距离的GEP特征选择分类算法(FSLDGEP)。首先,利用定义的选择概率有导向地初始化种群个体,从而增加种群中有效个体的数量;其次,定义个体的层次邻域,使种群个体基于其层次邻域进行变异,并解决了变异过程中的盲目无导向性问题;最后,将维度缩减率与分类准确率结合起来作为个体的适应度值,从而改变种群单一优化目标的进化模式,并平衡两者之间的关系。在7个数据集上进行5折交叉和10折交叉验证,所提算法给出了数据特征及其类别之间的函数映射关系,将得到的映射函数用于数据分类。与森林优化特征选择算法(FSFOA)、邻域软边界特征选择算法(NSM)、基于邻域有效信息比的特征选择算法(FS-NEIR)等对比算法相比,所提算法的维度缩减率在Hepatitis、WPBC(Wisconsin Prognostic Breast Cancer)、Sonar、WDBC(Wisconsin Diagnostic Breast Cancer)数据集上得到了最好结果;与对比算法相比,所提算法的平均分类准确率在Hepatitis、Ionosphere、Musk1、WPBC、Heart-Statlog、WDBC数据集上得到了最好结果。实验结果验证了所提算法在特征选择分类问题上的可行性、有效性和优越性。

关键词: 特征选择, 函数发现, 基因表达式编程, 种群初始化, 层次邻域

CLC Number: