互信息与模糊C均值聚类集成的特征优选方法

计算机应用 ›› 2014, Vol. 34 ›› Issue (9): 2608-2611.

互信息与模糊C均值聚类集成的特征优选方法

朱接文,肖军

江西工业工程职业技术学院计算机工程系，江西萍乡 337000

收稿日期:2014-03-07 修回日期:2014-05-13 出版日期:2014-09-01 发布日期:2014-09-30
通讯作者: 朱接文
作者简介:
朱接文(1976-),男,江西抚州人,副教授,硕士,主要研究方向:智能信息处理、数据库;
肖军(1975-),男,江西泰和人,副教授,硕士,主要研究方向:电子与通信。

Feature selection method based on integration of mutual information and fuzzy C-means clustering

ZHU Jiewen,XIAO Jun

Department of Computer Engineering, Jiangxi Polytechnic College, Pingxiang Jiangxi 337000, China

Received:2014-03-07 Revised:2014-05-13 Online:2014-09-01 Published:2014-09-30
Contact: ZHU Jiewen

摘要/Abstract

摘要：

针对大型数据中大量冗余特征的存在可能降低数据分类性能的问题，提出了一种基于互信息(MI)与模糊C均值(FCM)聚类集成的特征自动优选方法FCC-MI。首先分析了互信息特征及其相关度函数，根据相关度对特征进行排序；然后按照最大相关度对应的特征对数据进行分组，采用FCM聚类方法自动确定最优特征数目；最后基于相关度对特征进行了优选。在UCI机器学习数据库的7个数据集上进行实验，并与相关文献中提出的基于类内方差与相关度结合的特征选择方法(WCMFS)、基于近似Markov blanket和动态互信息的特征选择算法(B-AMBDMI)及基于互信息和遗传算法的两阶段特征选择方法(T-MI-GA)进行对比。理论分析和实验结果表明，FCC-MI不但提高了数据分类的效率，而且在有效保证分类精度的同时能自动确定最优特征子集，减少了数据集的特征数目，适用于海量、数据特征相关性大的特征约简及数据分析。

Abstract:

Plenty of redundant features may reduce the performance of data classification in massive dataset, so a new method of automatic feature selection based on the integration of Mutual Information and Fuzzy C-Means (FCM) clustering, named FCC-MI, was proposed to resolve this problem. Firstly, MI and its correlation function were analyzed, then the features were sorted according to the correlation value. Secondly, the data was grouped according to the feature with the maximum correlation, and the number of the optimal features were determined automatically by FCM clustering method. At last, the optimization selection of the features was performed using correlation value. Experiments on seven datasets of UCI machine learning database were conducted to compare FCC-MI with three methods come from the literatures, including WCMFS (Within class variance and Correlation Measure Feature Selection), B-AMBDMI (Based on Approximating Markov Blank and Dynamic Mutual Information), and T-MI-GA (Two-stage feature selection algorithm based on MI and GA). The theoretical analysis and experimental results show that the proposed method not only improves the efficiency of data classification, but also ensures the classification accuracy and automatically determine the optimal feature subset, which reduces the number of the features of the dataset, thus it is suitable for feature reduction and analysis of mass data with large correlation features.

中图分类号:

TP391

朱接文肖军. 互信息与模糊C均值聚类集成的特征优选方法[J]. 计算机应用, 2014, 34(9): 2608-2611.

ZHU Jiewen XIAO Jun. Feature selection method based on integration of mutual information and fuzzy C-means clustering[J]. Journal of Computer Applications, 2014, 34(9): 2608-2611.

参考文献

［1］WU S, ZHANG W, HUANG H, et al.FD-CABOSFV interval variable high dimensional data clustering ［J］. China Journal of Information Systems, 2011, 5(2): 77-87. (武森,张文丽,黄慧敏,等. FD-CABOSFV区间变量高维数据聚类［J］.信息系统学报, 2011, 5(2):77-87)

［2］ZHANG X, SUN Z, XU G, et al.A feature selection algorithm combining within-class variance with correlation measure ［J］. Journal of Harbin Institute of Technology, 2011, 43(2): 133-136. (张晓光, 孙正, 徐桂云,等. 一种类内方差与相关度结合的特征选择算法［J］. 哈尔滨工业大学学报, 2011, 43(2): 133~136.)

［3］RATA G A, VEGA J, MURARI A, et al.Improved feature selection based on genetic algorithm for real time disruption prediction on JET ［J］. Fusion Engineering and Design, 2012, 87(9):1670-1678.

［4］ZHANG Y, YAN Y. A feature selection method based on adaptive genetic strategy ［J］. Joural of Changchun University of Techonology, 2010, 31(2): 126-130. (张云鹏，闫一功. 一种基于自适应遗传策略的特征选择算法［J］.长春工业大学学报, 2010,31(2):126-130.)

［5］YAO X, WANG X, ZHANG Y, et al.Ensemble feature selection algorithm based on Markov blanket and mutual information ［J］. Journal of Systems Engineering and Electronics, 2012, 34(5): 1046-1050. (姚旭, 王晓丹, 张玉玺,等.基于Markov blanket和互信息的集成特征选择算法［J］.系统工程与电子技术, 2012, 34(5): 1046-1050.)

［6］SYLVAIN V, TEODOR T, ABDESSAMAD K. Fault detection and identification with a new feature selection based on mutual information ［J］. Journal of Press Control, 2008, 18(5): 479-490.

［7］GUO B F, MARK S N. Gait feature subset selection by mutual information［J］. IEEE Transactions on systems, Man and Cybernetics — Part A: System and Humans, 2009, 39(1): 36-46.

［8］HSU H H, HSIEH C W, LU M. Hybrid feature selection by combining fliters and wrappers ［J］. Expert Systems with Applications, 2011, 38(7): 8144-8150.

［9］QIU G, WANG N, WANG W. Two-stage feature selection algorithm based on mutual information and genetic algorithm ［J］. Application Research of Computers, 2012, 29(8): 2903-2905. (裘国永, 王娜, 汪万紫. 基于互信息和遗传算法的两阶段特征选择方法［J］. 计算机应用研究, 2012, 29(8): 2903-2905.)

［10］ESTEVEZ P A, MICHEL T, PEREZ C A, et al.Normalized mutual information feature selection ［J］. IEEE Transactions on Neural Networks, 2009, 20(2): 189-201.

［11］XIAO M, LIU Y, ZHOU X. A property optimization method in support of approximately duplicated records detecting ［C］// Proceedings of the 2009 IEEE International Conference on Intelligent Computing and Intelligent Systems. Piscataway: IEEE, 2009, 3: 118-122.

［12］BLAKE C, MERZ C. UCI repository of machine learning database ［EB/OL］. ［2013-03-15］. http://www.ics.uci.edu/~mlearn/MLR epository.html.

[1]	吴军欧阳艾嘉张琳. 基于影响度的统计显著序列模式挖掘算法[J]. 计算机应用, 0, (): 0-0.
[2]	张璐方春祝铭. 基于Res2Net-YOLACT和融合特征的室内跌倒检测算法[J]. 计算机应用, 0, (): 0-0.
[3]	殷雨昌王洪元陈莉冯尊登肖宇. 基于单标注样本的多损失学习与联合度量视频行人重识别[J]. 计算机应用, 0, (): 0-0.
[4]	胡军许正康刘立钟福金张清华. 融合多粒度社区信息的网络嵌入方法[J]. 计算机应用, 0, (): 0-0.
[5]	李润泽孙雪姣. 基于时间条件提取序列的数据流偏好查询[J]. 计算机应用, 0, (): 0-0.
[6]	罗圣钦陈金怡李洪均. 基于注意力机制的多尺度残差UNet实现乳腺癌灶分割[J]. 计算机应用, 0, (): 0-0.
[7]	曹一珉蔡磊高敬阳. 基于生成对抗网络的基因数据生成方法[J]. 计算机应用, 0, (): 0-0.
[8]	陈冲闫珠赵继轩何为梁华庆. 基于集合经验模态分解和长短期记忆网络的催化裂化装置NOx排放预测[J]. 计算机应用, 0, (): 0-0.
[9]	徐光柱林文杰陈莎匡婉雷帮军周军. U-Net与自适应阈值脉冲耦合神经网络相结合的眼底血管分割方法[J]. 计算机应用, 0, (): 0-0.
[10]	杨鼎康黄帅王顺利翟鹏李一丹张立华. 基于对抗生成网络和网络集成的面部表情识别方法EE-GAN[J]. 计算机应用, 0, (): 0-0.
[11]	李讷徐光柱雷帮军马国亮石勇涛. 交通道路行驶车辆车标识别算法[J]. 计算机应用, 0, (): 0-0.
[12]	孟杰王莉杨延杰廉飚. 基于多模态深度融合的虚假信息检测[J]. 计算机应用, 0, (): 0-0.
[13]	秦庭威赵鹏程秦品乐曾建朝柴锐黄永琦. 基于残差注意力机制的点云配准算法[J]. 计算机应用, 0, (): 0-0.
[14]	鲁永帅唐英杰马鑫然. 基于深度特征融合的无纺布低对比度浆丝缺陷检测方法[J]. 计算机应用, 0, (): 0-0.
[15]	王宇航周永霞吴良武. 基于高斯函数的池化算法[J]. 计算机应用, 0, (): 0-0.