计算机应用 ›› 2019, Vol. 39 ›› Issue (3): 734-741.DOI: 10.11772/j.issn.1001-9081.2018081694

• 数据科学与技术 • 上一篇    下一篇

基于最大联合条件互信息的特征选择

毛莺池1,2, 曹海1, 平萍1, 李晓芳2,3   

  1. 1. 河海大学 计算机与信息学院, 南京 211100;
    2. 常州工学院 计算机信息工程学院, 江苏 常州 213022;
    3. 江苏高校文化创意协同创新中心, 南京 210000
  • 收稿日期:2018-08-15 修回日期:2018-09-25 出版日期:2019-03-10 发布日期:2019-03-11
  • 作者简介:毛莺池(1976-),女,上海人,教授,博士,CCF会员,主要研究方向:分布式计算、并行处理、分布式数据管理;曹海(1991-),男,河南信阳人,硕士研究生,主要研究方向:分布式计算、数据挖掘;平萍(1982-),女,江苏吴江人,副教授,博士,CCF会员,主要研究方向:信息编码、数字图像处理;李晓芳(1971-),女,内蒙赤峰人,副教授,博士,CCF会员,主要研究方向:分布式数据管理。
  • 基金资助:
    "十三五"国家重点研发计划项目(2018YFC0407105);华能集团重点研发课题资助项目(HNKJ17-21);中央高校业务费课题资助项目(2017B16814,2017B20914)。

Feature selection based on maximum conditional and joint mutual information

MAO Yingchi1,2, CAO Hai1, PING Ping1, LI Xiaofang2,3   

  1. 1. College of Computer and Information, Hohai University, Nanjing Jiangsu 211100, China;
    2. College of Computer and Information Engineering, Changzhou Institute of Technology, Changzhou Jiangsu 213022, China;
    3. Jiangsu Collaborative Innovation Center for Cultural Creativity, Nanjing Jiangsu 210000, China
  • Received:2018-08-15 Revised:2018-09-25 Online:2019-03-10 Published:2019-03-11
  • Contact: 毛莺池
  • Supported by:
    This work is partially supported by the National Key Technology Research and Development Program of the Ministry of Science and Technology of China (2018YFC0407905), the Key Technology Project of China Huaneng Group (HNKJ17-21), the Fundamental Research Funds for the Central Universities (2017B16814, 2017B20914).

摘要: 在高维数据如图像数据、基因数据、文本数据等的分析过程中,当样本存在冗余特征时会大大增加问题分析复杂难度,因此在数据分析前从中剔除冗余特征尤为重要。基于互信息(MI)的特征选择方法能够有效地降低数据维数,提高分析结果精度,但是,现有方法在特征选择过程中评判特征是否冗余的标准单一,无法合理排除冗余特征,最终影响分析结果。为此,提出一种基于最大联合条件互信息的特征选择方法(MCJMI)。MCJMI选择特征时考虑整体联合互信息与条件互信息两个因素,两个因素融合增强特征选择约束。在平均预测精度方面,MCJMI与信息增益(IG)、最小冗余度最大相关性(mRMR)特征选择相比提升了6个百分点;与联合互信息(JMI)、最大化联合互信息(JMIM)相比提升了2个百分点;与LW向前搜索方法(SFS-LW)相比提升了1个百分点。在稳定性方面,MCJMI稳定性达到了0.92,优于JMI、JMIM、SFS-LW方法。实验结果表明MCJMI能够有效地提高特征选择的准确率与稳定性。

关键词: 信息熵, 互信息, 条件互信息, 联合互信息, 特征选择

Abstract: In the analysis process of high-dimensional data such as image data, genetic data and text data, when samples have redundant features, the complexity of the problem is greatly increased, so it is important to reduce redundant features before data analysis. The feature selection based on Mutual Information (MI) can reduce the data dimension and improve the accuracy of the analysis results, but the existing feature selection methods cannot reasonably eliminate the redundant features because of the single standard. To solve the problem, a feature selection method based on Maximum Conditional and Joint Mutual Information (MCJMI) was proposed. Joint mutual information and conditional mutual information were both considered when selecting features with MCJMI, improving the feature selection constraint. Exerimental results show that the detection accuracy is improved by 6% compared with Information Gain (IG) and minimum Redundancy Maximum Relevance (mRMR) feature selection; 2% compared with Joint Mutual Information (JMI) and Joint Mutual Information Maximisation (JMIM); and 1% compared with LW index with Sequence Forward Search algorithm (SFS-LW). And the stability of MCJMI reaches 0.92, which is better than JMI, JMIM and SFS-LW. In summary the proposed method can effectively improve the accuracy and stability of feature selection.

Key words: information entropy, Mutual Information (MI), conditional mutual information, joint mutual information, feature selection

中图分类号: