• •    

基于改进LDGEP的数据特征选择分类

湛航1,何朗1,黄樟灿2,李华峰2,张蔷1,谈庆2   

  1. 1. 武汉理工大学理学院
    2. 武汉理工大学
  • 收稿日期:2020-11-17 修回日期:2021-03-09 发布日期:2021-03-09
  • 通讯作者: 何朗

Data feature selection classification based on improved LDGEP

  • Received:2020-11-17 Revised:2021-03-09 Online:2021-03-09

摘要: 针对一般特征选择算法仅给出了重要特征,没有揭示特征与数据类别之间可解释性映射关系的缺点,提出了一种改进的层次距离基因表达式编程特征选择分类算法。该方法首先在种群初始化时,利用定义的选择概率有导向地使个体头部侧重于函数符号的选择,增加初始种群中有效个体的数量;其次,定义个体的层次邻域,使种群个体在变异时选择其层次邻域内的符号,改变层次变异的盲目无导向性问题;最后,将个体的维度缩减率与分类准确率结合作为个体的适应度值,改变种群单一优化目标的进化模式,平衡两者之间的关系。在7个数据集上进行5-Fold或10-Fold验证,所提算法给出了数据特征及其类别之间简洁明了的函数映射关系。将算法得到的映射函数用于数据分类,与其它算法相比,维度缩减率在Hapatitis,WPBC,Sonar,WDBC数据集上分别提高了9.47%,1.34%,0.17%,1.34%。其平均分类准确率在Hepatitis,Ionosphere,Musk1,WPBC,Heart-Statlog,WDBC数据集上分别提高了1.04%,1.04%,2.31%,2.89%,1.7%,1.41%。实验结果验证了所提算法在特征选择分类问题上的可行性、有效性和优越性。

关键词: 特征选择, 数据分类, 函数发现, 种群初始化, 层次邻域

Abstract: Abstract: Aiming at the shortcomings of general feature selection algorithms that only provide important features, but do not reveal the interpretable mapping relationship between features and data categories, an improved hierarchical distance gene expression programming feature selection classification algorithm is proposed. This method firstly uses the defined selection probability to make the head of the individual focus on the selection of function symbols during population initialization to increase the number of effective individuals in the initial population; secondly, define the layer neighborhood of individuals to make the population individuals select symbols in its layer neighborhood when mutating, change the blind and undirected problem of hierarchical variation; finally, combine the individual’s dimensionality reduction rate and classification accuracy as the individual’s fitness value, change the evolutionary model of the population’s single optimization goal, and balance the two relationship between people. Perform 5-Fold or 10-Fold verification on 7 data sets. The proposed algorithm gives a concise and clear function mapping relationship between data features and their categories. The mapping function obtained by the algorithm is classified, compared with other algorithms, the dimensional reduction rate of the Hepatitis, WPBC, Sonar, and WDBC data sets are imporoved by 9.47%, 1.34%, 0.17%, and 1.34% respectively. The average classification accuracy rate of Hapatitis, Ionosphere, Musk1, WPBC, Heart-Statlog, WDBC data sets are improved by 1.04%, 1.04%, 2.31%, 2.89%, 1.7%, 1.41% respectively. Experimental results verify the feasibility, effectiveness and superiority of the proposed algorithm in feature selection and classification.

Key words: feature selection, data classification, function discovery, population initialization, layer neighborhood

中图分类号: