计算机应用 ›› 2018, Vol. 38 ›› Issue (10): 2856-2861.DOI: 10.11772/j.issn.1001-9081.2018020448

• 数据科学与技术 • 上一篇    下一篇

基于自步学习的无监督属性选择算法

龚永红1, 郑威2, 吴林2, 谭马龙2, 余浩2   

  1. 1. 桂林航天工业学院 图书馆, 广西 桂林 541004;
    2. 广西师范大学 广西多源信息挖掘与安全重点实验室, 广西 桂林 541004
  • 收稿日期:2018-03-17 修回日期:2018-04-24 出版日期:2018-10-10 发布日期:2018-10-13
  • 通讯作者: 郑威
  • 作者简介:龚永红(1971-),女,广西桂林人,主要研究方向:数据挖掘、图书情报学;郑威(1989-),男,吉林延吉人,硕士研究生,主要研究方向:数据挖掘、机器学习;吴林(1993-),女,安徽安庆人,硕士研究生,主要研究方向:数据挖掘、机器学习;谭马龙(1993-),男,湖北襄阳人,硕士研究生,主要研究方向:数据挖掘、机器学习;余浩(1994-),男,江西上饶人,硕士研究生,主要研究方向:数据挖掘、机器学习。
  • 基金资助:
    国家自然科学基金资助项目(61573270);广西自然科学基金资助项目(2015GXNSFCB139011);广西研究生教育创新计划项目(YCSW2018093)。

Unsupervised feature selection algorithm based on self-paced learning

GONG Yonghong1, ZHENG Wei2, WU Lin2, TAN Malong2, YU Hao2   

  1. 1. Library, Guilin University of Aerospace Technology, Guilin Guangxi 541004, China;
    2. Guangxi Key Laboratory of Multi-source Information Mining and Security, Guangxi Normal University, Guilin Guangxi 541004, China
  • Received:2018-03-17 Revised:2018-04-24 Online:2018-10-10 Published:2018-10-13
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61573270), the Natural Science Foundation of Guangxi Province (2015GXNSFCB139011), the Guangxi Graduate Education Innovation Project (YCSW2018093).

摘要: 针对现有属性选择算法平等地对待每个样本而忽略样本之间的差异性,从而使学习模型无法避免噪声样本影响问题,提出一种融合自步学习理论的无监督属性选择(UFS-SPL)算法。首先自动选取一个重要的样本子集训练得到属性选择的鲁棒性初始模型,然后逐步自动引入次要样本提升模型的泛化能力,最终获得一个能避免噪声干扰而同时具有鲁棒性和泛化性的属性选择模型。在真实数据集上与凸半监督多标签属性选择(CSFS)、正则化自表达(RSR)和无监督属性选择的耦合字典学习方法(CDLFS)相比,UFS-SPL的聚类准确率、互信息和纯度平均提升12.06%、10.54%和10.5%。实验结果表明,UFS-SPL能够有效降低数据集中无关信息的影响。

关键词: 无监督学习, 属性选择, 自步学习, 自表达, 稀疏学习

Abstract: Concerning that the samples are treated equally and the difference of samples is ignored in the conventional feature selection algorithms, as well as the learning model cannot effectively avoid the influence from the noise samples, an Unsupervised Feature Selection algorithm based on Self-Paced Learning (UFS-SPL) was proposed. Firstly, a sample subset containing important samples for training was selected automatically to construct the initial feature selection model, then more important samples were added gradually into the former model to improve its generalization ability, until a robust and generalized feature selection model was constructed or all samples were selected. Compared with Convex Semi-supervised multi-label Feature Selection (CSFS), Regularized Self-Representation (RSR) and Coupled Dictionary Learning method for unsupervised Feature Selection (CDLFS), the clustering accuracy, normalized mutual information and purity of UFS-SPL were increased by 12.06%, 10.54% and 10.5%, respectively. The experimental results show that UFS-SPL can effectively remove the effect of irrelevant information from original data sets.

Key words: unsupervised learning, feature selection, self-paced learning, self-representation, sparse learning

中图分类号: