计算机应用 ›› 2019, Vol. 39 ›› Issue (11): 3280-3287.DOI: 10.11772/j.issn.1001-9081.2019050928

• 数据科学与技术 • 上一篇    下一篇

高维不确定数据的子空间聚类算法

万静, 郑龙君, 何云斌, 李松   

  1. 哈尔滨理工大学 计算机科学与技术学院, 哈尔滨 150080
  • 收稿日期:2019-06-03 修回日期:2019-08-29 发布日期:2019-09-11 出版日期:2019-11-10
  • 通讯作者: 万静
  • 作者简介:万静(1972-),女,江苏泰兴人,教授,博士,主要研究方向:数据库理论与应用、嵌入式系统;郑龙君(1993-),男,黑龙江佳木斯人,硕士研究生,主要研究方向:数据挖掘、空间数据聚类;何云斌(1972-),男,福建平潭人,教授,博士,主要研究方向:数据库理论与应用;李松(1977-),男,江苏沛县人,副教授,博士,主要研究方向:数据库理论与应用、数据挖掘、数据查询。
  • 基金资助:
    国家自然科学基金资助项目(61872105);黑龙江教育厅科学技术研究项目(1253lz004);黑龙江省留学归国人员科学基金资助项目(LC2018030)。

Subspace clustering algorithm for high dimensional uncertain data

WAN Jing, ZHENG Longjun, HE Yunbin, LI Song   

  1. School of Computer Science and Technology, Harbin University of Science and Technology, Harbin Heilongjiang 150080, China
  • Received:2019-06-03 Revised:2019-08-29 Online:2019-09-11 Published:2019-11-10
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61872105), the Science and Technology Research Project of Heilongjiang Education Department (1253lz004), the Science Foundation for Returned Scholars of Heilongjiang Province (LC2018030).

摘要: 如何降低不确定数据对高维数据聚类的影响是当前的研究难点。针对由不确定数据与维度灾难导致的聚类精度低的问题,采用先将不确定数据确定化,后对确定数据聚类的方法。在将不确定数据确定化的过程中,将不确定数据分为值不确定数据与维度不确定数据,并分别处理以提高算法效率。采用结合期望距离的K近邻(KNN)查询得到对聚类结果影响最小的不确定数据近似值以提高聚类精度。在得到确定数据之后,采用子空间聚类的方式避免维度灾难的影响。实验结果证明,基于Clique的高维不确定数据聚类算法(UClique)在UCI数据集上有较好的表现,有良好的抗噪声能力和伸缩性,在高维数据上能得到较好的聚类结果,在不同的不确定数据集实验中能够得到较高精度的实验结果,体现出算法具有一定的健壮性,能够有效地对高维不确定数据集聚类。

关键词: 高维, 不确定, Clique算法, K近邻

Abstract: How to reduce the impact of uncertain data on high dimensional data clustering is the difficulty of current research. Aiming at the problem of low clustering accuracy caused by uncertain data and curse of dimensionality, the method of determining the uncertain data and then clustering the certain data was adopted. In the process of determining the uncertain data, uncertain data were divided into value uncertain data and dimension uncertain data, and were processed separately to improve algorithm efficiency. K-Nearest Neighbor (KNN) query combined with expected distance was used to obtain the approximate value of uncertain data with the least impact on the clustering results, so as to improve the clustering accuracy. After determining the uncertain data, the method of subspace clustering was adopted to avoid the impact of the curse of dimensionality. The experimental results show that high-dimensional uncertain data clustering algorithm based on Clique for Uncertain data (UClique) has good performance on UCI datasets, has good anti-noise performance and scalability, can obtain better clustering results on high dimensional data, and can achieve the experimental results with higher accuracy on different uncertain datasets, showing that the algorithm is robust and can effectively cluster high dimensional uncertain data.

Key words: high-dimension, uncertain, Clique (Clique for all data) algorithm, K-Nearest Neighbor (KNN)

中图分类号: