Journal of Computer Applications ›› 2020, Vol. 40 ›› Issue (12): 3478-3484.DOI: 10.11772/j.issn.1001-9081.2020060871

• 2020 China Conference on Granular Computing and Knowledge Discovery(CGCKD 2020) • Previous Articles     Next Articles

Multi-level feature selection algorithm based on mutual information

YONG Juya1,2, ZHOU Zhongmei1,2   

  1. 1. School of Computer Science, Minnan Normal University, Zhangzhou Fujian 363000, China;
    2. Key Laboratory of Data Science and Intelligence Application, Fujian Province University, Zhangzhou Fujian 363000, China
  • Received:2020-06-12 Revised:2020-08-20 Online:2020-12-10 Published:2020-10-20
  • Supported by:
    This work is partially supported by the Natural Science Foundation of Fujian Province (2018J01545).

基于互信息的多级特征选择算法

雍菊亚1,2, 周忠眉1,2   

  1. 1. 闽南师范大学 计算机学院, 福建 漳州 363000;
    2. 数据科学与智能应用福建省高等学校重点实验室, 福建 漳州 363000
  • 通讯作者: 周忠眉(1965-),女,福建漳州人,教授,博士,CCF会员,主要研究方向:数据挖掘、机器学习。64523040@qq.com
  • 作者简介:雍菊亚(1994-),女,江苏扬州人,硕士研究生,CCF会员,主要研究方向:数据挖掘
  • 基金资助:
    福建省自然科学基金资助项目(2018J01545)。

Abstract: Focusing on the problem that the process of removing redundancy will be very complicated due to the large number of the selected features, and the problem that some features only can have strong correlation with label after being combined with other features in the feature selection, a Multi-Level Feature Selection algorithm based on Mutual Information (MI_MLFS) was proposed. Firstly, the features were divided into strongly correlated, sub-strongly correlated and other features according to the degrees of correlations between features and label. Secondly, after selecting strongly correlated features, features with low redundancy in the sub-strongly correlated features were selected. Finally, the features which were able to enhance the correlation between the selected feature subset and label were selected. Among 15 datasets, MI_MLFS was compared with the algorithms of ReliefF, minimal-Redundancy-Maximal-Relevance criterion (mRMR), Joint Mutual Information (JMI), Conditional Mutual Information Maximization criterion (CMIM) and Double Input Symmetrical Relevance (DISR). The results show that MI_MLFS achieves the highest classification accuracy in 13 datasets and 11 datasets with Support Vector Machine (SVM) classifier and Classification And Regression Tree (CART) classifier respectively. MI_MLFS has better classification performance than many classical feature selection algorithms.

Key words: feature selection, mutual information, multi-level, correlation, redundancy, classification accuracy

摘要: 针对在特征选择中选取特征较多时造成的去冗余过程很复杂的问题,以及一些特征需与其他特征组合后才会与标签有较强相关度的问题,提出了一种基于互信息的多级特征选择算法(MI_MLFS)。首先,根据特征与标签的相关度,将特征分为强相关、次强相关和其他特征;其次,选取强相关特征后,在次强相关特征中,选取冗余度较低的特征;最后,选取能增强已选特征集合与标签相关度的特征。在15组数据集上,将MI_MLFS与ReliefF、最大相关最小冗余(mRMR)算法、基于联合互信息(JMI)算法、条件互信息最大化准则(CMIM)算法和双输入对称关联(DISR)算法进行对比实验,结果表明MI_MLFS在支持向量机(SVM)和分类回归树(CART)分类器上分别有13组和11组数据集获得了最高的分类准确率。相较多种经典特征选择方法,MI_MLFS算法有更好的分类性能。

关键词: 特征选择, 互信息, 多级, 相关度, 冗余度, 分类准确率

CLC Number: