《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (12): 3772-3778.DOI: 10.11772/j.issn.1001-9081.2022121838

• 数据科学与技术 • 上一篇    下一篇

基于潜在因子模型在子空间上的缺失值注意力聚类算法

王啸飞1,2, 鲍胜利1,2(), 陈炯环1,2   

  1. 1.中国科学院 成都计算机应用研究所,成都 610041
    2.中国科学院大学,北京 100049
  • 收稿日期:2022-12-12 修回日期:2023-02-13 接受日期:2023-02-16 发布日期:2023-03-09 出版日期:2023-12-10
  • 通讯作者: 鲍胜利
  • 作者简介:王啸飞(1997—),男,湖南慈利人,硕士研究生,主要研究方向:机器学习、推荐算法
    鲍胜利(1973—),男,安徽黄山人,研究员,博士,主要研究方向:软件工程、大数据;Email: baoshengli@casit.com.cn
    陈炯环(1998—),男,山东潍坊人,硕士研究生,主要研究方向:机器学习、大数据。
  • 基金资助:
    中国科学院西部青年学者项目(RRJZ2021003)

Missing value attention clustering algorithm based on latent factor model in subspace

Xiaofei WANG1,2, Shengli BAO1,2(), Jionghuan CHEN1,2   

  1. 1.Chengdu Institute of Computer Application,Chinese Academy of Sciences,Chengdu Sichuan 610041,China
    2.University of Chinese Academy of Sciences,Beijing 100049,China
  • Received:2022-12-12 Revised:2023-02-13 Accepted:2023-02-16 Online:2023-03-09 Published:2023-12-10
  • Contact: Shengli BAO
  • About author:WANG Xiaofei, born in 1997, M. S. candidate. His research interests include machine learning, recommendation algorithm.
    CHEN Jionghuan, born in 1998, M. S. candidate. His research interests include machine learning, big data.
  • Supported by:
    Western Young Scholars Project of Chinese Academy of Sciences(RRJZ2021003)

摘要:

针对传统聚类算法在对缺失样本进行数据填充过程中存在样本相似度难度量且填充数据质量差的问题,提出一种基于潜在因子模型(LFM)在子空间上的缺失值注意力聚类算法。首先,通过LFM将原始数据空间映射到低维子空间,降低样本的稀疏程度;其次,通过分解原空间得到的特征矩阵构建不同特征间的注意力权重图,优化子空间样本间的相似度计算方式,使样本相似度的计算更准确、泛化性更好;最后,为了降低样本相似度计算过程中过高的时间复杂度,设计一种多指针的注意力权重图进行优化。在4个按比例随机缺失的数据集上进行实验。在Hand-digits数据集上,相较于面向高维特征缺失数据的K近邻插补子空间聚类(KISC)算法,在数据缺失比例为10%的情况下,所提算法的聚类准确度(ACC)提高了2.33个百分点,归一化互信息(NMI)提高了2.77个百分点,在数据缺失比例为20%的情况下,所提算法的ACC提高了0.39个百分点,NMI提高了1.33个百分点,验证了所提算法的有效性。

关键词: 潜在因子模型, 缺失值, 注意力机制, 聚类算法, 子空间

Abstract:

To solve the problems that traditional clustering algorithms are difficult to measure the sample similarity and have poor quality of filled data in the process of filling missing samples, a missing value attention clustering algorithm based on Latent Factor Model (LFM) in subspace was proposed. First, LFM was used to map the original data space to a low dimensional subspace to reduce the sparsity of samples. Then, the attention weight graph between different features was constructed by decomposing the feature matrix obtained from the original space, and the similarity calculation method between subspace samples was optimized to make the calculation of sample similarity more accurate and more generalized. Finally, to reduce the high time complexity in the process of sample similarity calculation, a multi-pointer attention weight graph was designed for optimization. The algorithm was tested on four proportional random missing datasets. On the Hand-digits dataset, compared with the KISC (K-nearest neighbors Interpolation Subspace Clustering) algorithm for high-dimensional feature missing data, when the missing data was 10%, the Accuracy (ACC) of the proposed algorithm was improved by 2.33 percentage points and the Normalized Mutual Information (NMI) was improved by 2.77 percentage points; when the missing data was 20%, the ACC of the proposed algorithm was improved by 0.39 percentage points, and the NMI was improved by 1.33 percentage points, which verified the effectiveness of the proposed algorithm.

Key words: Latent Factor Model (LFM), missing value, attention mechanism, clustering algorithm, subspace

中图分类号: