《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (3): 767-771.DOI: 10.11772/j.issn.1001-9081.2023030365

• 数据科学与技术 • 上一篇    下一篇

最大相关和最大差异的高维数据特征选择算法

孟圣洁(), 于万钧, 陈颖   

  1. 上海应用技术大学 计算机科学与信息工程学院,上海 201418
  • 收稿日期:2023-04-04 修回日期:2023-05-16 接受日期:2023-05-19 发布日期:2023-06-05 出版日期:2024-03-10
  • 通讯作者: 孟圣洁
  • 作者简介:于万钧(1966—),男,吉林德惠人,教授,博士,CCF会员,主要研究方向:人工智能、大数据
    陈颖(1974—),女,重庆人,副教授,博士,主要研究方向:图像处理、生物特征识别。
  • 基金资助:
    国家自然科学基金资助项目(61976140)

Feature selection algorithm for high-dimensional data with maximum correlation and maximum difference

Shengjie MENG(), Wanjun YU, Ying CHEN   

  1. School of Computer Science & Information Engineering,Shanghai Institute of Technology,Shanghai 201418,China
  • Received:2023-04-04 Revised:2023-05-16 Accepted:2023-05-19 Online:2023-06-05 Published:2024-03-10
  • Contact: Shengjie MENG
  • About author:YU Wanjun,born in 1966, Ph. D., professor. His research interests include artificial intelligence, big data.
    CHEN Ying,born in 1974, Ph. D., associate professor. Her research interests include image processing, biometrics.
  • Supported by:
    National Natural Science Foundation of China(61976140)

摘要:

针对高维数据存在冗余信息且维度过高的问题,提出基于信息量的最大相关最大差异特征选择算法(MCD)。首先,利用互信息(MI)度量特征和标签之间的相关性,对特征进行排序,选择互信息最大的特征加入特征子集;然后,引入信息距离度量特征之间的信息冗余性及差异性,设计评价准则对每个特征进行评价,使特征子集中特征和标签的相关性、特征之间的差异性最大;最后,用前向搜索策略结合评价准则进行属性约简,最优化特征子集。采用2种不同的分类器,在6个数据集上和mRMR(minimal-Redundancy-Maximal-Relevance criterion)、RReliefF等5个经典算法进行对比实验,利用分类精度验证MCD的有效性。在支持向量机(SVM)分类器下,平均分类精度提高了5.67~23.80个百分点;在K-近邻(KNN)分类器下,平均分类精度提高了2.69~25.18个百分点。可见,MCD在绝大多数情况下,能有效去除冗余特征,分类精度有明显提高。

关键词: 特征选择, 高维数据, 特征冗余, 相关性, 分类准确率, 降维

Abstract:

Aiming at the problems of redundant information and too high dimension in high-dimensional data, a Maximum Correlation maximum Difference feature selection algorithm (MCD) based on the maximum correlation of information quantity was proposed. Firstly, the correlation between Mutual Information (MI) measurement features and labels was used to sort and select features with the largest mutual information into feature subsets according to the relevant knowledge of information theory. Then, the information distance was introduced to measure the information redundancy and difference between the two features, and the evaluation criteria were designed to evaluate each feature, so that the correlation between the features and labels, and the difference between the features were the largest. Finally, the forward search strategy combined with the evaluation criteria was used to reduce the attributes and optimize the feature subset. Using 2 different classifiers, comparative experiments were carried out on 6 datasets with 5 classical algorithms such as mRMR (minimal-Redundancy-Maximal-Relevance criterion) and RReliefF, and the validity of MCD was verified by using the classification accuracy. Under the Support Vector Machine (SVM) classifier, the average classification accuracy increased by 5.67 - 23.80 percentage points, respectively; and under the K-Nearest Neighbor (KNN) classifier, the average classification accuracy increased by 2.69 - 25.18 percentage points, respectively. It can be seen that in the vast majority of cases, MCD can effectively remove redundant features and significantly improve classification accuracy.

Key words: feature selection, high-dimensional data, feature redundancy, correlation, classification accuracy, dimensionality reduction

中图分类号: