基于最大联合条件互信息的特征选择

doi:10.11772/j.issn.1001-9081.2018081694

计算机应用 ›› 2019, Vol. 39 ›› Issue (3): 734-741.DOI: 10.11772/j.issn.1001-9081.2018081694

基于最大联合条件互信息的特征选择

毛莺池^1,2, 曹海¹, 平萍¹, 李晓芳^2,3

1. 河海大学计算机与信息学院, 南京 211100;
2. 常州工学院计算机信息工程学院, 江苏常州 213022;
3. 江苏高校文化创意协同创新中心, 南京 210000

收稿日期:2018-08-15 修回日期:2018-09-25 出版日期:2019-03-10 发布日期:2019-03-11
作者简介:毛莺池(1976-),女,上海人,教授,博士,CCF会员,主要研究方向:分布式计算、并行处理、分布式数据管理;曹海(1991-),男,河南信阳人,硕士研究生,主要研究方向:分布式计算、数据挖掘;平萍(1982-),女,江苏吴江人,副教授,博士,CCF会员,主要研究方向:信息编码、数字图像处理;李晓芳(1971-),女,内蒙赤峰人,副教授,博士,CCF会员,主要研究方向:分布式数据管理。
基金资助:
"十三五"国家重点研发计划项目（2018YFC0407105）；华能集团重点研发课题资助项目（HNKJ17-21）；中央高校业务费课题资助项目（2017B16814，2017B20914）。

Feature selection based on maximum conditional and joint mutual information

MAO Yingchi^1,2, CAO Hai¹, PING Ping¹, LI Xiaofang^2,3

1. College of Computer and Information, Hohai University, Nanjing Jiangsu 211100, China;
2. College of Computer and Information Engineering, Changzhou Institute of Technology, Changzhou Jiangsu 213022, China;
3. Jiangsu Collaborative Innovation Center for Cultural Creativity, Nanjing Jiangsu 210000, China

Received:2018-08-15 Revised:2018-09-25 Online:2019-03-10 Published:2019-03-11
Contact: 毛莺池
Supported by:
This work is partially supported by the National Key Technology Research and Development Program of the Ministry of Science and Technology of China (2018YFC0407905), the Key Technology Project of China Huaneng Group (HNKJ17-21), the Fundamental Research Funds for the Central Universities (2017B16814, 2017B20914).

摘要/Abstract

摘要： 在高维数据如图像数据、基因数据、文本数据等的分析过程中，当样本存在冗余特征时会大大增加问题分析复杂难度，因此在数据分析前从中剔除冗余特征尤为重要。基于互信息（MI）的特征选择方法能够有效地降低数据维数，提高分析结果精度，但是，现有方法在特征选择过程中评判特征是否冗余的标准单一，无法合理排除冗余特征，最终影响分析结果。为此，提出一种基于最大联合条件互信息的特征选择方法（MCJMI）。MCJMI选择特征时考虑整体联合互信息与条件互信息两个因素，两个因素融合增强特征选择约束。在平均预测精度方面，MCJMI与信息增益（IG）、最小冗余度最大相关性（mRMR）特征选择相比提升了6个百分点；与联合互信息（JMI）、最大化联合互信息（JMIM）相比提升了2个百分点；与LW向前搜索方法（SFS-LW）相比提升了1个百分点。在稳定性方面，MCJMI稳定性达到了0.92，优于JMI、JMIM、SFS-LW方法。实验结果表明MCJMI能够有效地提高特征选择的准确率与稳定性。

关键词: 信息熵, 互信息, 条件互信息, 联合互信息, 特征选择

Abstract: In the analysis process of high-dimensional data such as image data, genetic data and text data, when samples have redundant features, the complexity of the problem is greatly increased, so it is important to reduce redundant features before data analysis. The feature selection based on Mutual Information (MI) can reduce the data dimension and improve the accuracy of the analysis results, but the existing feature selection methods cannot reasonably eliminate the redundant features because of the single standard. To solve the problem, a feature selection method based on Maximum Conditional and Joint Mutual Information (MCJMI) was proposed. Joint mutual information and conditional mutual information were both considered when selecting features with MCJMI, improving the feature selection constraint. Exerimental results show that the detection accuracy is improved by 6% compared with Information Gain (IG) and minimum Redundancy Maximum Relevance (mRMR) feature selection; 2% compared with Joint Mutual Information (JMI) and Joint Mutual Information Maximisation (JMIM); and 1% compared with LW index with Sequence Forward Search algorithm (SFS-LW). And the stability of MCJMI reaches 0.92, which is better than JMI, JMIM and SFS-LW. In summary the proposed method can effectively improve the accuracy and stability of feature selection.

Key words: information entropy, Mutual Information (MI), conditional mutual information, joint mutual information, feature selection

中图分类号:

TP393.0

毛莺池, 曹海, 平萍, 李晓芳. 基于最大联合条件互信息的特征选择[J]. 计算机应用, 2019, 39(3): 734-741.

MAO Yingchi, CAO Hai, PING Ping, LI Xiaofang. Feature selection based on maximum conditional and joint mutual information[J]. Journal of Computer Applications, 2019, 39(3): 734-741.

参考文献

[1] GANDHI S S, PRABHUNE S S. Overview of feature subset selection algorithm for high dimensional data[C]//ICISC 2017:Proceedings of the 2017 IEEE International Conference on Inventive Systems and Control. Piscataway, NJ:IEEE, 2017:1-6.
[2] FLEURET F. Fast binary feature selection with conditional mutual information[J]. Journal of Machine Learning Research, 2004, 5(3):1531-1555.
[3] LIU H, DITZLER G. Speeding up joint mutual information feature selection with an optimization heuristic[C]//Proceedings of the 2017 IEEE Symposium Series on Computational Intelligence. Piscataway, NJ:IEEE, 2018:1-8.
[4] MIN F, XU J. Semi-greedy heuristics for feature selection with test cost constraints[J]. Granular Computing, 2016, 1(3):199-211.
[5] TSAGRIS M, LAGANI V, TSAMARDINOS I. Feature selection for high-dimensional temporal data[J]. BMC Bioinformatics, 2018, 19:17.
[6] 黄志艳.一种基于信息增益的特征选择方法[J].山东农业大学学报(自然科学版), 2013,44(2):252-256.(HUANG Z Y. Based on the information gain text feature selection method[J]. Journal of Shandong Agricultural University (Natural Science), 2013,44(2):252-256.)
[7] 刘海峰,刘守生,宋阿羚.基于词频分布信息的优化IG特征选择方法[J].计算机工程与应用,2017,53(4):113-117.(LIU H F, LIU S S, SONG A L. Improved method of IG feature selection based on word frequency distribution[J]. Computer Engineering and Applications, 2017, 53(4):113-117.)
[8] BATTITI R. Using mutual information for selecting features in supervised neural net learning[J]. IEEE Transactions on Neural Networks, 1994, 5(4):537-550.
[9] HOQUE N, BHATTACHARYYA D K, KALITA J K. MIFS-ND:a mutual information-based feature selection method[J]. Expert Systems with Applications, 2014, 41(14):6371-6385.
[10] CHO D, LEE B. Optimized automatic sleep stage classification using the Normalized Mutual Information Feature Selection (NMIFS) method[C]//Proceedings of the 201739th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. Piscataway, NJ:IEEE, 2017:3094-3097.
[11] PENG H, LONG F, DING C. Feature selection based on mutual information:criteria of max-dependency, max-relevance, and min-redundancy[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005, 27(8):1226-1238.
[12] 董泽民,石强.基于归一化模糊联合互信息最大的特征选择[J].计算机工程与应用,2017,53(22):105-110.(DONG Z M, SHI Q. Feature selection using normalized fuzzy joint mutual information maximum[J]. Computer Engineering and Applications, 2017, 53(22):105-110.)
[13] BENNASAR M, HICKS Y, SETCHI R. Feature selection using joint mutual information maximisation[J]. Expert Systems with Applications, 2015, 42(22):8520-8532.
[14] LI J, DONG W, MENG D. Grouped gene selection of cancer via adaptive sparse group lasso based on conditional mutual information[J]. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2018, 15(6):2028-2038.
[15] LIU C, WANG W, ZHAO Q, et al. A new feature selection method based on a validity index of feature subset[J]. Pattern Recognition Letters, 2017, 92:1-8.
[16] AMARATUNGA D, CABRERA J. High-dimensional data[J]. Journal of the National Science Foundation of Sri Lanka, 2016, 44(1):3.
[17] DUA, D. AND KARRA TANISKIDOU, E. UCI Machine Learning Repository[DB/OL].[2018-07-13]. http://archive.ics.uci.edu/ml.
[18] ROSS B C. Mutual information between discrete and continuous data sets[J]. PLoS One, 2014, 9(2):e87357.
[19] CHELVAN P M, PERUMAL K. A study on selection stability measures for various feature selection algorithms[C]//Proceedings of the 2016 IEEE International Conference on Computational Intelligence and Computing Research. Piscataway, NJ:IEEE, 2017:1-4.

基于最大联合条件互信息的特征选择

Feature selection based on maximum conditional and joint mutual information

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	湛航, 何朗, 黄樟灿, 李华峰, 张蔷, 谈庆. 改进的基于层次距离的基因表达式编程特征选择分类算法[J]. 计算机应用, 2021, 41(9): 2658-2667.
[2]	祝承, 赵晓琦, 赵丽萍, 焦玉宏, 朱亚飞, 陈建英, 周伟, 谭颖. 基于谱聚类半监督特征选择的功能磁共振成像数据分类[J]. 计算机应用, 2021, 41(8): 2288-2293.
[3]	李蒙蒙, 秦伟, 刘艺, 刁兴春. 结合头脑风暴优化的混合蚁群优化算法[J]. 计算机应用, 2021, 41(8): 2412-2417.
[4]	林筠超, 万源. 基于图结构优化的自适应多度量非监督特征选择方法[J]. 计算机应用, 2021, 41(5): 1282-1289.
[5]	王治和, 常筱卿, 杜辉. 基于万有引力的自适应近邻传播聚类算法[J]. 计算机应用, 2021, 41(5): 1337-1342.
[6]	贾鹤鸣, 姜子超, 李瑶, 孙康健. 基于改进斑点鬣狗优化算法的同步优化特征选择[J]. 计算机应用, 2021, 41(5): 1290-1298.
[7]	张志浩, 林耀进, 卢舜, 郭晨, 王晨曦. 缺失标记下基于类属属性的多标记特征选择[J]. 计算机应用, 2021, 41(10): 2849-2857.
[8]	黄学雨, 徐浩特, 陶剑文. 具有特征选择的多源自适应分类框架[J]. 计算机应用, 2020, 40(9): 2499-2506.
[9]	顾桐, 许国良, 李万林, 李家浩, 王志愿, 雒江涛. 基于集成LightGBM和贝叶斯优化策略的房价智能评估模型[J]. 计算机应用, 2020, 40(9): 2762-2767.
[10]	刘丹, 姚立霜, 王云锋, 裴作飞. 面向类不平衡流量数据的分类模型[J]. 计算机应用, 2020, 40(8): 2327-2333.
[11]	肖跃雷, 张云娇. 基于特征选择和超参数优化的恐怖袭击组织预测方法[J]. 计算机应用, 2020, 40(8): 2262-2267.
[12]	袁园, 吴文, 万毅. 基于熵驱动域适应学习的单幅图像阴影检测方法[J]. 计算机应用, 2020, 40(7): 2131-2136.
[13]	汪志远, 降爱莲, 奥斯曼·穆罕默德. 基于正则互表示的无监督特征选择方法[J]. 计算机应用, 2020, 40(7): 1896-1900.
[14]	陈程军, 毛莺池, 王绎超. 基于激活-熵的分层迭代剪枝策略的CNN模型压缩[J]. 计算机应用, 2020, 40(5): 1260-1265.
[15]	谢琪, 徐旭, 程耕国, 陈和平. 基于新的森林优化算法的特征选择算法[J]. 计算机应用, 2020, 40(5): 1266-1271.