计算机应用 ›› 2020, Vol. 40 ›› Issue (7): 1896-1900.DOI: 10.11772/j.issn.1001-9081.2019122075

• 人工智能 • 上一篇    下一篇

基于正则互表示的无监督特征选择方法

汪志远, 降爱莲, 奥斯曼·穆罕默德   

  1. 太原理工大学 信息与计算机学院, 山西 晋中 030600
  • 收稿日期:2019-12-09 修回日期:2020-02-24 出版日期:2020-07-10 发布日期:2020-03-26
  • 通讯作者: 降爱莲
  • 作者简介:汪志远(1992-),男,安徽宿州人,硕士研究生,主要研究方向:机器学习、特征选择;降爱莲(1969-),女,山西太原人,副教授,博士,CCF会员,主要研究方向:人工智能、大数据、特征选择、计算机视觉;奥斯曼·穆罕默德(1993-),男,埃塞俄比亚人,硕士研究生,主要研究方向:深度学习、图像处理。
  • 基金资助:
    山西省回国留学人员科研资助项目(2017-051)。

Unsupervised feature selection method based on regularized mutual representation

WANG Zhiyuan, JIANG Ailian, MUHAMMAD Osman   

  1. College of Information and Computer, Taiyuan University of Technology, Jinzhong Shanxi 030600, China
  • Received:2019-12-09 Revised:2020-02-24 Online:2020-07-10 Published:2020-03-26
  • Supported by:
    This work is partially supported by the Research Project of Shanxi Scholarship Council of China (2017-051).

摘要: 针对高维数据含有的冗余特征影响机器学习训练效率和泛化能力的问题,为提升模式识别准确率、降低计算复杂度,提出了一种基于正则互表示(RMR)性质的无监督特征选择方法。首先,利用特征之间的相关性,建立由Frobenius范数约束的无监督特征选择数学模型;然后,设计分治-岭回归优化算法对模型进行快速优化;最后,根据模型最优解综合评估每个特征的重要性,选出原始数据中具有代表性的特征子集。在聚类准确率指标上,RMR方法与Laplacian方法相比提升了7个百分点,与非负判别特征选择(NDFS)方法相比提升了7个百分点,与正则自表示(RSR)方法相比提升了6个百分点,与自表示特征选择(SR_FS)方法相比提升了3个百分点;在数据冗余率指标上,RMR方法与Laplacian方法相比降低了10个百分点,与NDFS方法相比降低了7个百分点,与RSR方法相比降低了3个百分点,与SR_FS方法相比降低了2个百分点。实验结果表明,RMR方法能够有效地选出重要特征,降低数据冗余率,提升样本聚类准确率。

关键词: 特征选择, 无监督学习, 分治算法, 岭回归, 正则化

Abstract: The redundant features of high-dimensional data affect the training efficiency and generalization ability of machine learning. In order to improve the accuracy of pattern recognition and reduce the computational complexity, an unsupervised feature selection method based on Regularized Mutual Representation (RMR) property was proposed. Firstly, the correlations between features were utilized to establish a mathematical model for unsupervised feature selection constrained by Frobenius norm. Then, a divide-and-conquer ridge regression optimization algorithm was designed to quickly optimize the model. Finally, the importances of the features were jointly evaluated according to the optimal solution to the model, and a representative feature subset was selected from the original data. On the clustering accuracy, RMR method is improved by 7 percentage points compared with the Laplacian method, improved by 7 percentage points compared with the Nonnegative Discriminative Feature Selection (NDFS) method, improved by 6 percentage points compared with the Regularized Self-Representation (RSR) method, and improved by 3 percentage points compared with the Self-Representation Feature Selection (SR_FS) method. On the redundancy rate, RMR method is reduced by 10 percentage points compared with the Laplacian method, reduced by 7 percentage points compared with the NDFS method, reduced by 3 percentage points compared with the RSR method, and reduced by 2 percentage points compared with the SR_FS method. The experimental results show that RMR method can effectively select important features, reduce redundancy rate of data and improve clustering accuracy of samples.

Key words: feature selection, unsupervised learning, divide-and-conquer algorithm, ridge regression, regularization

中图分类号: