高维不确定数据的子空间聚类算法

doi:10.11772/j.issn.1001-9081.2019050928

计算机应用 ›› 2019, Vol. 39 ›› Issue (11): 3280-3287.DOI: 10.11772/j.issn.1001-9081.2019050928

高维不确定数据的子空间聚类算法

万静, 郑龙君, 何云斌, 李松

哈尔滨理工大学计算机科学与技术学院, 哈尔滨 150080

收稿日期:2019-06-03 修回日期:2019-08-29 发布日期:2019-09-11 出版日期:2019-11-10
通讯作者: 万静
作者简介:万静(1972-),女,江苏泰兴人,教授,博士,主要研究方向:数据库理论与应用、嵌入式系统;郑龙君(1993-),男,黑龙江佳木斯人,硕士研究生,主要研究方向:数据挖掘、空间数据聚类;何云斌(1972-),男,福建平潭人,教授,博士,主要研究方向:数据库理论与应用;李松(1977-),男,江苏沛县人,副教授,博士,主要研究方向:数据库理论与应用、数据挖掘、数据查询。
基金资助:
国家自然科学基金资助项目（61872105）；黑龙江教育厅科学技术研究项目（1253lz004）；黑龙江省留学归国人员科学基金资助项目（LC2018030）。

Subspace clustering algorithm for high dimensional uncertain data

WAN Jing, ZHENG Longjun, HE Yunbin, LI Song

School of Computer Science and Technology, Harbin University of Science and Technology, Harbin Heilongjiang 150080, China

Received:2019-06-03 Revised:2019-08-29 Online:2019-09-11 Published:2019-11-10
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61872105), the Science and Technology Research Project of Heilongjiang Education Department (1253lz004), the Science Foundation for Returned Scholars of Heilongjiang Province (LC2018030).

摘要/Abstract

摘要： 如何降低不确定数据对高维数据聚类的影响是当前的研究难点。针对由不确定数据与维度灾难导致的聚类精度低的问题，采用先将不确定数据确定化，后对确定数据聚类的方法。在将不确定数据确定化的过程中，将不确定数据分为值不确定数据与维度不确定数据，并分别处理以提高算法效率。采用结合期望距离的K近邻（KNN）查询得到对聚类结果影响最小的不确定数据近似值以提高聚类精度。在得到确定数据之后，采用子空间聚类的方式避免维度灾难的影响。实验结果证明，基于Clique的高维不确定数据聚类算法（UClique）在UCI数据集上有较好的表现，有良好的抗噪声能力和伸缩性，在高维数据上能得到较好的聚类结果，在不同的不确定数据集实验中能够得到较高精度的实验结果，体现出算法具有一定的健壮性，能够有效地对高维不确定数据集聚类。

关键词: 高维, 不确定, Clique算法, K近邻

Abstract: How to reduce the impact of uncertain data on high dimensional data clustering is the difficulty of current research. Aiming at the problem of low clustering accuracy caused by uncertain data and curse of dimensionality, the method of determining the uncertain data and then clustering the certain data was adopted. In the process of determining the uncertain data, uncertain data were divided into value uncertain data and dimension uncertain data, and were processed separately to improve algorithm efficiency. K-Nearest Neighbor (KNN) query combined with expected distance was used to obtain the approximate value of uncertain data with the least impact on the clustering results, so as to improve the clustering accuracy. After determining the uncertain data, the method of subspace clustering was adopted to avoid the impact of the curse of dimensionality. The experimental results show that high-dimensional uncertain data clustering algorithm based on Clique for Uncertain data (UClique) has good performance on UCI datasets, has good anti-noise performance and scalability, can obtain better clustering results on high dimensional data, and can achieve the experimental results with higher accuracy on different uncertain datasets, showing that the algorithm is robust and can effectively cluster high dimensional uncertain data.

Key words: high-dimension, uncertain, Clique (Clique for all data) algorithm, K-Nearest Neighbor (KNN)

中图分类号:

TP311.13

万静, 郑龙君, 何云斌, 李松. 高维不确定数据的子空间聚类算法[J]. 计算机应用, 2019, 39(11): 3280-3287.

WAN Jing, ZHENG Longjun, HE Yunbin, LI Song. Subspace clustering algorithm for high dimensional uncertain data[J]. Journal of Computer Applications, 2019, 39(11): 3280-3287.

参考文献

[1] CRISTÍBAL T, PADRÍN G, QUESADA-ARENCIBIA A, et al. Systematic approach to analyze travel time in road-based mass transit systems based on data mining[J]. IEEE Access, 2018, 6:32861-32873.
[2] JEZEWSKI M, CZABANSKI R, LESKI J M. Fuzzy classifier based on clustering with pairs of ε-hyperballs and its application to support fetal state assessment[J]. Expert Systems with Applications, 2019, 118(15):109-126.
[3] CHARLES V, TSOLAS I E, GHERMAN T. Satisficing data envelopment analysis:a Bayesian approach for peer mining in the banking sector[J]. Annals of Operations Research, 2018,269(1/2):81-102.
[4] FERRERO E, AGARWAL P. Connecting genetics and gene expression data for target prioritisation and drug repositioning[J]. Biodata Mining, 2018, 11(1):7.
[5] FRÄNTI P, SIERANOJA S. K-means properties on six clustering benchmark datasets[J]. Applied Intelligence, 2018, 48(12):4743-4759.
[6] TRIPATHI A, PANWAR K. Modified CURE algorithm with enhancement to identify number of clusters[J]. International Journal of Artificial Intelligence and Soft Computing, 2016,5(3):226-240.
[7] ZHENG Z, MA Y, ZHENG H, et al. UGC:real-time, ultra-robust feature correspondence via unilateral grid-based clustering[J]. IEEE Access, 2018, 6:55501-55508.
[8] SEYEDI S A, LOTFI A, MORADI P, et al. Dynamic graph-based label propagation for density peaks clustering[J]. Expert Systems with Applications,2019, 115:314-328.
[9] YANG M S, LAI C Y. A robust EM clustering algorithm for Gaussian mixture models[J]. Pattern Recognition, 2012, 45(11):3950-3961.
[10] BRODINOVÁ Š, ZAHARIEVA M, FILZMOSER P, et al. Clustering of imbalanced high-dimensional media data[J]. Advances in Data Analysis & Classification, 2017, 12(2):261-284.
[11] ZHU W, YAN Y. Joint linear regression and nonnegative matrix factorization based on self-organized graph for image clustering and classification[J].IEEE Access, 2018, 6:38820-38834.
[12] AAMARI E, LEVRARD C. Stability and minimax optimality of tangential delaunay complexes for manifold reconstruction[J]. Discrete & Computational Geometry, 2018, 59(4):923-971.
[13] WANG Y, DUAN X, LIU X, et al. A spectral clustering method with semantic interpretation based on axiomatic fuzzy set theory[J]. Applied Soft Computing, 2018, 64:59-74.
[14] LIU H, ZHANG X, ZHANG X, et al. Self-adapted mixture distance measure for clustering uncertain data[J]. Knowledge-Based Systems, 2017, 126:33-47.
[15] GOYAL P, KUMARI S, SINGH S, et al. A parallel framework for grid-based bottom-up subspace clustering[C]//Proceedings of the 2016 IEEE International Conference on Data Science and Advanced Analytics. Piscataway:IEEE, 2016:331-340.
[16] ZHANG C,FU H, HU Q, et al. Generalized latent multi-view subspace clustering[EB/OL].[2018-03-20]. https://ieeexplore.ieee.org/document/8502831.
[17] ZHU Y, TING K M, CARMAN M J. Grouping points by shared subspaces for effective subspace clustering[J]. Pattern Recognition, 2018, 83:230-244.
[18] LI X, LU Q, DONG Y, et al. Robust subspace clustering by cauchy loss function[J]. IEEE Transactions on Neural Networks and Learning Systems, 2018, 30(7):2067-2078.
[19] CHEN H, WANG W, FENG X. Structured sparse subspace clustering with grouping-effect-within-cluster[J]. Pattern Recognition, 2018, 10(83):107-118.
[20] 范虹,侯存存,朱艳春,等.烟花算法优化的软子空间MR图像聚类算法[J].软件学报,2017,28(11):3080-3093. (FAN H, HOU C C, ZHU Y C, et al. Soft subspace algorithm for MR image clustering based on fireworks optimization algorithm[J]. Journal of Software, 2017, 28(11):3080-3093.)
[21] 傅文进,吴小俊.基于l_2范数的加权低秩子空间聚类[J].软件学报,2017,28(12):3347-3357.(FU W J, WU X J. Weighted low rank subspace clustering based on l_2 norm[J]. Journal of Software, 2017, 28(12):3347-3357.)
[22] SEIDL T. Nearest neighbor classification[M]//Data Mining in Agriculture. Berlin:Springer, 2009:83-106.
[23] ALTMAN N S. An introduction to kernel and nearest-neighbor nonparametric regression[J]. American Statistician, 1992, 46(3):175-185.
[24] 肖宇鹏,何云斌,万静,等.基于模糊C-均值的空间不确定数据聚类[J].计算机工程,2015,41(10):47-52.(XIAO Y P, HE Y B, WAN J, et al. Clustering of space uncertain data based on fuzzy C-means[J]. Computer Engineering, 2015, 41(10):47-52.)
[25] 周晓云, 孙志挥, 张柏礼, 等. 高维数据流子空间聚类发现及维护算法[J]. 计算机研究与发展, 2006, 43(5):834-840.(ZHOU X Y, SUN Z H, ZHANG B L,et al. An efficient discovering and maintenance algorithm of subspace clustering over high dimensional data streams[J]. Journal of Computer Research and Development, 2006, 43(5):834-840.)
[26] 孙翌,胡爱.基于多维度关联的机构知识库数据模型的构建与分析[J].现代情报,2018,38(7):95-106.(SUN Y, HU A. Construction and analysis on data model of institutional repository based on multidimensional linked data[J]. Journal of Modern Information, 2018, 38(7):95-106.)

高维不确定数据的子空间聚类算法

Subspace clustering algorithm for high dimensional uncertain data

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	陈恒恒, 倪志伟, 朱旭辉, 金媛媛, 陈千. 基于聚类分析的差分隐私高维数据发布方法[J]. 计算机应用, 2021, 41(9): 2578-2585.
[2]	曹玉红, 徐海, 刘荪傲, 王紫霄, 李宏亮. 基于深度学习的医学影像分割研究综述[J]. 计算机应用, 2021, 41(8): 2273-2287.
[3]	张盟, 郭健全. 需求和回收不确定的闭环供应链渠道结构选择[J]. 计算机应用, 2021, 41(7): 2100-2107.
[4]	张豪, 朱睿, 宋栿尧, 方鹏, 夏秀峰. 距离-关键字相似度约束的双色反k近邻查询方法[J]. 计算机应用, 2021, 41(6): 1686-1693.
[5]	程美英, 钱乾, 倪志伟, 朱旭辉. 信息筛选多任务优化自组织迁移算法[J]. 计算机应用, 2021, 41(6): 1748-1755.
[6]	王心, 朱浩华, 刘光灿. 卷积鲁棒主成分分析[J]. 计算机应用, 2021, 41(5): 1314-1318.
[7]	乔钢柱, 王瑞, 孙超利. 基于分解的高维多目标改进进化算法[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3097-3103.
[8]	裴仪瑶, 郭会明, 张丹普, 陈文博. 基于定位不确定性的鲁棒3D目标检测方法[J]. 计算机应用, 2021, 41(10): 2979-2984.
[9]	王丽娟, 陈少敏, 尹明, 许跃颖, 郝志峰, 蔡瑞初, 温雯. 基于近邻图改进的块对角子空间聚类算法[J]. 计算机应用, 2021, 41(1): 36-42.
[10]	王守华, 吴黎荣, 纪元法, 孙希延. 基于格理论的模糊度快速解算方法[J]. 计算机应用, 2020, 40(8): 2299-2304.
[11]	严华健, 张国富, 苏兆品, 刘扬. 救灾物资高维多目标自适应分配问题建模与求解[J]. 计算机应用, 2020, 40(8): 2410-2419.
[12]	霍晴晴, 郭健全. 基于改进遗传算法的生鲜多目标闭环物流网络模型[J]. 计算机应用, 2020, 40(5): 1494-1500.
[13]	周磊磊, 梁承姬, 胡筱渊. 不确定干扰约束下外集卡提箱策略[J]. 计算机应用, 2020, 40(3): 891-896.
[14]	郝秦霞. 基于R2指标的高维多目标差分进化推荐式课程系统[J]. 计算机应用, 2020, 40(10): 2951-2959.
[15]	吴小莉, 郑艺峰. 基于K近邻算法的噪声种类识别和强度估计[J]. 计算机应用, 2020, 40(1): 264-270.