Journal of Computer Applications ›› 2022, Vol. 42 ›› Issue (11): 3322-3329.DOI: 10.11772/j.issn.1001-9081.2021111964

• CCF Bigdata 2021 • Previous Articles     Next Articles

K‑nearest neighbor imputation subspace clustering algorithm for high‑dimensional data with feature missing

Yongjian QIAO1, Xiaolin LIU1,2, Liang BAI1,2()   

  1. 1.School of Computer and Information Technology,Shanxi University,Taiyuan Shanxi 030006,China
    2.Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education (Shanxi University),Taiyuan Shanxi 030006,China
  • Received:2021-11-19 Revised:2021-11-29 Accepted:2021-12-06 Online:2021-12-31 Published:2022-11-10
  • Contact: Liang BAI
  • About author:QIAO Yongjian, born in 1995, M. S. candidate. His research interests include missing data clustering.
    LIU Xiaolin, born in 1990, Ph. D. candidate. Her research interests include machine learning, clustering analysis.
    BAI Liang, born in 1982, Ph. D., professor. His research interests include clustering analysis.
  • Supported by:
    National Natural Science Foundation of China(62022052);Shanxi Basic Research Program(201901D211192);“1331 Project” Quality and Efficiency Improvement Construction Program of Shanxi Province

面向高维特征缺失数据的K最近邻插补子空间聚类算法

乔永坚1, 刘晓琳1,2, 白亮1,2()   

  1. 1.山西大学 计算机与信息技术学院,太原 030006
    2.计算智能与中文信息处理教育部重点实验室(山西大学),太原 030006
  • 通讯作者: 白亮
  • 作者简介:乔永坚(1995—),男,山西临汾人,硕士研究生,主要研究方向:缺失数据聚类
    刘晓琳(1990—),女,山西太原人,博士研究生,CCF会员,主要研究方向:机器学习、聚类分析
    白亮(1982—),男,山西太原人,教授,博士,CCF会员,主要研究方向:聚类分析。bailiang@sxu.edu.cn
  • 基金资助:
    国家自然科学基金资助项目(62022052);山西省基础研究计划项目(201901D211192);山西省“1331工程”提质增效建设计划项目

Abstract:

During the clustering process of high?dimensional data with feature missing, there are problems of the curse of dimensionality caused by data high dimension and the invalidity of effective distance calculation between samples caused by data feature missing. To resolve above issues, a K?Nearest Neighbor (KNN) imputation subspace clustering algorithm for high?dimensional data with feature missing was proposed, namely KISC. Firstly, the nearest neighbor relationship in the subspace of the high?dimensional data with feature missing was used to perform KNN imputation on the feature missing data in the original space. Then, multiple iterations of matrix decomposition and KNN imputation were used to obtain the final reliable subspace structure of the data, and the clustering analysis was performed in that obtained subspace structure. The clustering results in the original space of six image datasets show that the KISC algorithm has better performance than the comparison algorithm which clusters directly after interpolation, indicating that the subspace structure can identify the potential clustering structure of the data more easily and effectively; the clustering results in the subspace of six high?dimensional datasets shows that the KISC algorithm outperforms the comparison algorithm in all datasets, and has the optimal clustering Accuracy and Normalized Mutual Information (NMI) on most of the datasets. The KISC algorithm can deal with high?dimensional data with feature missing more effectively and improve the clustering performance of these data.

Key words: high?dimensional data, feature missing, imputation algorithm, subspace structure, clustering

摘要:

针对高维特征缺失数据在聚类过程中面临的因数据高维引发的维度灾难问题和数据特征缺失导致的样本间有效距离计算失效问题,提出一种面向高维特征缺失数据的K最近邻(KNN)插补子空间聚类算法KISC。首先,利用高维特征缺失数据的子空间下的近邻关系对原始空间下的特征缺失数据进行KNN插补;然后,利用多次迭代矩阵分解和KNN插补获得数据最终可靠的子空间结构,并在该子空间结构进行聚类分析。在6个图像数据集原始空间的聚类结果表明,相较于经过插补后直接进行聚类的对比算法,KISC算法聚类效果更好,说明子空间结构能够更加容易且有效地识别数据的潜在聚类结构;在6个高维数据集子空间下的聚类结果显示,KISC算法在各个数据集的聚类性能均优于对比算法,且在大多数据集上取得了最优的聚类精确度(ACC)和标准互信息(NMI)。KISC算法能够更加有效地处理高维特征缺失数据,提高算法的聚类性能。

关键词: 高维数据, 特征缺失, 插补算法, 子空间结构, 聚类

CLC Number: