Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (12): 3766-3775.DOI: 10.11772/j.issn.1001-9081.2023121783

• Artificial intelligence • Previous Articles     Next Articles

Unsupervised feature selection model with dictionary learning and sample correlation preservation

Jingxin LIU1, Wenjing HUANG1, Liangsheng XU2, Chong HUANG3(), Jiansheng WU1   

  1. 1.School of Mathematics and Computer Sciences,Nanchang University,Nanchang Jiangxi 330031,China
    2.Northern Lianchuang Communication Company Limited,Nanchang Jiangxi 330096,China
    3.Information Office,Nanchang University,Nanchang Jiangxi 330031,China
  • Received:2023-12-27 Revised:2024-02-06 Accepted:2024-02-23 Online:2024-03-11 Published:2024-12-10
  • Contact: Chong HUANG
  • About author:LIU Jingxin, born in 1999, M. S. candidate. Her research interests include machine learning, data mining.
    HUANG Wenjing, born in 2003. Her research interests include machine learning, data mining.
    XU Liangsheng, born in 1987, M. S., engineer. His research interests include command and communication, machine learning, data analysis.
    WU Jiansheng, born in 1986, Ph. D., lecturer. His research interests include machine learning, data mining.
  • Supported by:
    National Natural Science Foundation of China(62066027);Natural Science Foundation of Jiangxi Province(20212BAB212011);Postgraduate Innovation Foundation of Jiangxi Province(YC2022-s160)

字典学习与样本关联保持结合的无监督特征选择模型

刘晶鑫1, 黄雯静1, 徐亮胜2, 黄冲3(), 吴建生1   

  1. 1.南昌大学 数学与计算机学院,南昌 330031
    2.北方联创通信有限公司,南昌 330096
    3.南昌大学 信息化办公室,南昌 330031
  • 通讯作者: 黄冲
  • 作者简介:刘晶鑫(1999—),女,四川江油人,硕士研究生,主要研究方向:机器学习、数据挖掘
    黄雯静(2003—),女,福建莆田人,主要研究方向:机器学习、数据挖掘
    徐亮胜(1987—),男,江西上饶人,工程师,硕士,主要研究方向:指挥与通信、机器学习、数据分析
    吴建生(1986—),男,江西上饶人,讲师,博士,主要研究方向:机器学习、数据挖掘。
  • 基金资助:
    国家自然科学基金资助项目(62066027);江西省自然科学基金资助项目(20212BAB212011);江西省研究生创新基金资助项目(YC2022?s160)

Abstract:

Focusing on the issue that most unsupervised feature selection models based on dictionary learning cannot fully exploit the intrinsic correlations among data, which reduces the accuracy of feature importance judgment, an unsupervised feature selection model with Dictionary Learning and Sample Correlation Preservation (DLSCP) was proposed. Firstly, the original data were encoded by learning the dictionary atoms, and the latent representations to characterize data distribution were obtained in the dictionary space. Secondly, the intrinsic correlations among data were learned adaptively in the dictionary space to alleviate the influence of redundant and noisy features, thus obtaining accurate local structure among data. Finally, the intrinsic correlations among data were used to measure the relevance and importance of data features. Experimental results on TOX dataset show that, when selecting 50 features, DLSCP improves the Normalized Mutual Information (NMI) and clustering Accuracy (Acc) by 13.33 and 7.95 percentage points respectively compared to non negative spectral analysis model NDFS(Nonnegative Discriminative Feature Selection) and by 15.74 and 7.31 percentage points respectively compared to unsupervised feature selection model with hidden space embedding LSEUFS (Latent Space Embedding for Unsupervised Feature Selection via joint dictionary learning), which verifies the effectiveness of DLSCP.

Key words: unsupervised feature selection, dictionary learning, adaptive graph learning, sample correlation preservation, similarity matrix

摘要:

针对大多数基于字典学习的无监督特征选择模型没有充分挖掘数据间的本质关联,进而降低了特征重要性判断的准确性这一问题,提出一种字典学习与样本关联保持结合的无监督特征选择模型(DLSCP)。首先,从数据中学习字典基以完成对原始数据的编码,并在字典空间中获得能够反映数据分布的隐表示;其次,进一步在字典空间中自适应地学习数据间的本质关联,以消除冗余特征和噪声特征的影响,从而获得准确的数据间的局部几何结构;最后,利用数据间的本质关联评估数据特征的关联性和重要性。在TOX数据集上的实验结果表明,当选择50个特征时,DLSCP在归一化互信息(NMI)和聚类准确度(Acc)这2个评价指标上,相较于非负谱分析模型NDFS(Nonnegative Discriminative Feature Selection)分别提升了13.33和7.95个百分点,相较于隐空间嵌入无监督特征选择模型LSEUFS(Latent Space Embedding for Unsupervised Feature Selection via joint dictionary learning)分别提升了15.74和7.31个百分点,验证了DLSCP的有效性。

关键词: 无监督特征选择, 字典学习, 自适应图学习, 样本关联保持, 相似度矩阵

CLC Number: