计算机应用 ›› 2014, Vol. 34 ›› Issue (9): 2608-2611.

• 数据技术 • 上一篇    下一篇

互信息与模糊C均值聚类集成的特征优选方法

朱接文,肖军   

  1. 江西工业工程职业技术学院 计算机工程系, 江西 萍乡 337000
  • 收稿日期:2014-03-07 修回日期:2014-05-13 出版日期:2014-09-01 发布日期:2014-09-30
  • 通讯作者: 朱接文
  • 作者简介: 
    朱接文(1976-),男,江西抚州人,副教授,硕士,主要研究方向:智能信息处理、数据库;
    肖军(1975-),男,江西泰和人,副教授,硕士,主要研究方向:电子与通信。

Feature selection method based on integration of mutual information and fuzzy C-means clustering

ZHU Jiewen,XIAO Jun   

  1. Department of Computer Engineering, Jiangxi Polytechnic College, Pingxiang Jiangxi 337000, China
  • Received:2014-03-07 Revised:2014-05-13 Online:2014-09-01 Published:2014-09-30
  • Contact: ZHU Jiewen

摘要:

针对大型数据中大量冗余特征的存在可能降低数据分类性能的问题,提出了一种基于互信息(MI)与模糊C均值(FCM)聚类集成的特征自动优选方法FCC-MI。首先分析了互信息特征及其相关度函数,根据相关度对特征进行排序;然后按照最大相关度对应的特征对数据进行分组,采用FCM聚类方法自动确定最优特征数目;最后基于相关度对特征进行了优选。在UCI机器学习数据库的7个数据集上进行实验,并与相关文献中提出的基于类内方差与相关度结合的特征选择方法(WCMFS)、基于近似Markov blanket和动态互信息的特征选择算法(B-AMBDMI)及基于互信息和遗传算法的两阶段特征选择方法(T-MI-GA)进行对比。理论分析和实验结果表明,FCC-MI不但提高了数据分类的效率,而且在有效保证分类精度的同时能自动确定最优特征子集,减少了数据集的特征数目,适用于海量、数据特征相关性大的特征约简及数据分析。

Abstract:

Plenty of redundant features may reduce the performance of data classification in massive dataset, so a new method of automatic feature selection based on the integration of Mutual Information and Fuzzy C-Means (FCM) clustering, named FCC-MI, was proposed to resolve this problem. Firstly, MI and its correlation function were analyzed, then the features were sorted according to the correlation value. Secondly, the data was grouped according to the feature with the maximum correlation, and the number of the optimal features were determined automatically by FCM clustering method. At last, the optimization selection of the features was performed using correlation value. Experiments on seven datasets of UCI machine learning database were conducted to compare FCC-MI with three methods come from the literatures, including WCMFS (Within class variance and Correlation Measure Feature Selection), B-AMBDMI (Based on Approximating Markov Blank and Dynamic Mutual Information), and T-MI-GA (Two-stage feature selection algorithm based on MI and GA). The theoretical analysis and experimental results show that the proposed method not only improves the efficiency of data classification, but also ensures the classification accuracy and automatically determine the optimal feature subset, which reduces the number of the features of the dataset, thus it is suitable for feature reduction and analysis of mass data with large correlation features.

中图分类号: