《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (11): 3322-3329.DOI: 10.11772/j.issn.1001-9081.2021111964
所属专题: 第九届CCF大数据学术会议(CCF Bigdata 2021)
收稿日期:
2021-11-19
修回日期:
2021-11-29
接受日期:
2021-12-06
发布日期:
2021-12-31
出版日期:
2022-11-10
通讯作者:
白亮
作者简介:
乔永坚(1995—),男,山西临汾人,硕士研究生,主要研究方向:缺失数据聚类基金资助:
Yongjian QIAO1, Xiaolin LIU1,2, Liang BAI1,2()
Received:
2021-11-19
Revised:
2021-11-29
Accepted:
2021-12-06
Online:
2021-12-31
Published:
2022-11-10
Contact:
Liang BAI
About author:
QIAO Yongjian, born in 1995, M. S. candidate. His research interests include missing data clustering.Supported by:
摘要:
针对高维特征缺失数据在聚类过程中面临的因数据高维引发的维度灾难问题和数据特征缺失导致的样本间有效距离计算失效问题,提出一种面向高维特征缺失数据的K最近邻(KNN)插补子空间聚类算法KISC。首先,利用高维特征缺失数据的子空间下的近邻关系对原始空间下的特征缺失数据进行KNN插补;然后,利用多次迭代矩阵分解和KNN插补获得数据最终可靠的子空间结构,并在该子空间结构进行聚类分析。在6个图像数据集原始空间的聚类结果表明,相较于经过插补后直接进行聚类的对比算法,KISC算法聚类效果更好,说明子空间结构能够更加容易且有效地识别数据的潜在聚类结构;在6个高维数据集子空间下的聚类结果显示,KISC算法在各个数据集的聚类性能均优于对比算法,且在大多数据集上取得了最优的聚类精确度(ACC)和标准互信息(NMI)。KISC算法能够更加有效地处理高维特征缺失数据,提高算法的聚类性能。
中图分类号:
乔永坚, 刘晓琳, 白亮. 面向高维特征缺失数据的K最近邻插补子空间聚类算法[J]. 计算机应用, 2022, 42(11): 3322-3329.
Yongjian QIAO, Xiaolin LIU, Liang BAI. K‑nearest neighbor imputation subspace clustering algorithm for high‑dimensional data with feature missing[J]. Journal of Computer Applications, 2022, 42(11): 3322-3329.
符号 | 定义 | 符号 | 定义 |
---|---|---|---|
高维特征缺失数据集 | 系数矩阵的第 | ||
完整数据集 | 基矩阵的第 | ||
特征缺失数据 | 微调后的系数矩阵的第 | ||
完整数据 | 数据 | ||
数据 | 类的簇数 |
表1 符号与定义
Tab. 1 Symbols and definitions
符号 | 定义 | 符号 | 定义 |
---|---|---|---|
高维特征缺失数据集 | 系数矩阵的第 | ||
完整数据集 | 基矩阵的第 | ||
特征缺失数据 | 微调后的系数矩阵的第 | ||
完整数据 | 数据 | ||
数据 | 类的簇数 |
数据集 | 样本个数 | 样本维度 | 类别数 |
---|---|---|---|
Handwritten | 2 000 | 76 | 10 |
Scene | 2 688 | 215 | 8 |
ORL | 400 | 1 024 | 40 |
Yale | 165 | 1 024 | 15 |
COIL20 | 1 440 | 1 024 | 20 |
3Sources | 416 | 3 560 | 6 |
表2 数据集描述
Tab. 2 Description of datasets
数据集 | 样本个数 | 样本维度 | 类别数 |
---|---|---|---|
Handwritten | 2 000 | 76 | 10 |
Scene | 2 688 | 215 | 8 |
ORL | 400 | 1 024 | 40 |
Yale | 165 | 1 024 | 15 |
COIL20 | 1 440 | 1 024 | 20 |
3Sources | 416 | 3 560 | 6 |
数据集 | 数据缺失 比例/% | ACC/% | |||||||
---|---|---|---|---|---|---|---|---|---|
0值插补 | 最大值插补 | 最小值插补 | 均值插补 | KNN插补 | EM插补 | MF一次插补 | 本文插补算法 | ||
Hand‑written | 10 | 52.00 | 51.22 | 52.70 | 59.85 | 58.40 | 59.75 | 58.69 | 69.93 |
20 | 48.39 | 37.27 | 48.62 | 60.65 | 58.79 | 62.24 | 60.40 | 68.25 | |
30 | 40.28 | 27.47 | 40.54 | 54.56 | nan | 56.22 | 60.88 | 65.76 | |
40 | 30.47 | 23.79 | 33.71 | 58.19 | nan | 60.13 | 57.32 | 65.76 | |
Scene | 10 | 54.54 | 43.02 | 54.51 | 55.03 | 54.81 | 55.36 | 52.56 | 56.77 |
20 | 54.35 | 35.69 | 54.68 | 54.70 | 54.39 | 55.21 | 52.86 | 56.92 | |
30 | 52.47 | 25.29 | 52.53 | 54.83 | nan | 54.95 | 52.13 | 58.18 | |
40 | 51.50 | 19.60 | 51.18 | 53.34 | nan | 55.56 | 47.70 | 67.31 | |
ORL | 10 | 34.99 | 36.84 | 42.41 | 49.64 | 48.87 | 48.59 | 48.44 | 58.11 |
20 | 25.93 | 22.50 | 31.76 | 46.90 | 46.23 | 48.37 | 48.45 | 60.41 | |
30 | 22.04 | 17.66 | 20.34 | 42.87 | nan | 48.63 | 49.20 | 58.33 | |
40 | 18.55 | 14.51 | 14.88 | 37.59 | nan | 49.03 | 49.23 | 57.98 | |
Yale | 10 | 35.90 | 28.46 | 36.10 | 37.53 | 37.91 | 35.46 | 35.07 | 42.15 |
20 | 33.71 | 23.63 | 32.38 | 35.63 | 36.40 | 34.58 | 34.90 | 42.36 | |
30 | 30.32 | 21.03 | 30.92 | 33.25 | nan | 36.66 | 36.42 | 45.55 | |
40 | 26.60 | 16.42 | 28.02 | 29.63 | nan | 35.15 | 35.43 | 45.33 | |
COIL20 | 10 | 54.57 | 47.80 | 53.33 | 52.30 | 52.11 | 51.12 | 52.56 | 61.83 |
20 | 53.20 | 40.71 | 52.95 | 50.32 | 53.15 | 53.48 | 51.08 | 63.83 | |
30 | 53.21 | 29.98 | 53.14 | 51.16 | nan | 53.92 | 49.44 | 63.51 | |
40 | 52.55 | 20.42 | 52.91 | 51.57 | nan | 53.89 | 50.63 | 60.99 | |
3Sources | 10 | 44.02 | 23.18 | 44.02 | 53.95 | 53.34 | 55.88 | 44.42 | 63.66 |
20 | 39.88 | 22.80 | 39.88 | 48.45 | 49.89 | 51.89 | 52.49 | 63.76 | |
30 | 34.65 | 22.38 | 34.65 | 42.19 | nan | 49.88 | 49.17 | 60.17 | |
40 | 31.72 | 21.97 | 31.72 | 39.64 | nan | 48.24 | 45.15 | 57.95 |
表3 在原始空间聚类结果的ACC值比较
Tab. 3 Comparison of ACC value among spatial clustering results in original space
数据集 | 数据缺失 比例/% | ACC/% | |||||||
---|---|---|---|---|---|---|---|---|---|
0值插补 | 最大值插补 | 最小值插补 | 均值插补 | KNN插补 | EM插补 | MF一次插补 | 本文插补算法 | ||
Hand‑written | 10 | 52.00 | 51.22 | 52.70 | 59.85 | 58.40 | 59.75 | 58.69 | 69.93 |
20 | 48.39 | 37.27 | 48.62 | 60.65 | 58.79 | 62.24 | 60.40 | 68.25 | |
30 | 40.28 | 27.47 | 40.54 | 54.56 | nan | 56.22 | 60.88 | 65.76 | |
40 | 30.47 | 23.79 | 33.71 | 58.19 | nan | 60.13 | 57.32 | 65.76 | |
Scene | 10 | 54.54 | 43.02 | 54.51 | 55.03 | 54.81 | 55.36 | 52.56 | 56.77 |
20 | 54.35 | 35.69 | 54.68 | 54.70 | 54.39 | 55.21 | 52.86 | 56.92 | |
30 | 52.47 | 25.29 | 52.53 | 54.83 | nan | 54.95 | 52.13 | 58.18 | |
40 | 51.50 | 19.60 | 51.18 | 53.34 | nan | 55.56 | 47.70 | 67.31 | |
ORL | 10 | 34.99 | 36.84 | 42.41 | 49.64 | 48.87 | 48.59 | 48.44 | 58.11 |
20 | 25.93 | 22.50 | 31.76 | 46.90 | 46.23 | 48.37 | 48.45 | 60.41 | |
30 | 22.04 | 17.66 | 20.34 | 42.87 | nan | 48.63 | 49.20 | 58.33 | |
40 | 18.55 | 14.51 | 14.88 | 37.59 | nan | 49.03 | 49.23 | 57.98 | |
Yale | 10 | 35.90 | 28.46 | 36.10 | 37.53 | 37.91 | 35.46 | 35.07 | 42.15 |
20 | 33.71 | 23.63 | 32.38 | 35.63 | 36.40 | 34.58 | 34.90 | 42.36 | |
30 | 30.32 | 21.03 | 30.92 | 33.25 | nan | 36.66 | 36.42 | 45.55 | |
40 | 26.60 | 16.42 | 28.02 | 29.63 | nan | 35.15 | 35.43 | 45.33 | |
COIL20 | 10 | 54.57 | 47.80 | 53.33 | 52.30 | 52.11 | 51.12 | 52.56 | 61.83 |
20 | 53.20 | 40.71 | 52.95 | 50.32 | 53.15 | 53.48 | 51.08 | 63.83 | |
30 | 53.21 | 29.98 | 53.14 | 51.16 | nan | 53.92 | 49.44 | 63.51 | |
40 | 52.55 | 20.42 | 52.91 | 51.57 | nan | 53.89 | 50.63 | 60.99 | |
3Sources | 10 | 44.02 | 23.18 | 44.02 | 53.95 | 53.34 | 55.88 | 44.42 | 63.66 |
20 | 39.88 | 22.80 | 39.88 | 48.45 | 49.89 | 51.89 | 52.49 | 63.76 | |
30 | 34.65 | 22.38 | 34.65 | 42.19 | nan | 49.88 | 49.17 | 60.17 | |
40 | 31.72 | 21.97 | 31.72 | 39.64 | nan | 48.24 | 45.15 | 57.95 |
数据集 | 数据缺失 比例/% | NMI/% | |||||||
---|---|---|---|---|---|---|---|---|---|
0值插补 | 最大值插补 | 最小值插补 | 均值插补 | KNN插补 | EM插补 | MF一次插补 | 本文插补算法 | ||
Hand‑written | 10 | 48.89 | 41.16 | 51.43 | 56.45 | 57.70 | 57.13 | 56.93 | 62.41 |
20 | 41.34 | 26.98 | 41.80 | 58.24 | 57.32 | 57.02 | 56.98 | 59.73 | |
30 | 33.49 | 17.10 | 33.89 | 49.77 | nan | 51.22 | 50.96 | 55.80 | |
40 | 23.96 | 14.06 | 25.66 | 52.18 | nan | 51.76 | 49.71 | 56.39 | |
Scene | 10 | 42.39 | 29.39 | 42.51 | 42.63 | 42.38 | 43.22 | 39.88 | 45.14 |
20 | 41.38 | 21.40 | 41.24 | 42.07 | 41.66 | 43.00 | 40.21 | 45.32 | |
30 | 39.76 | 11.23 | 40.04 | 41.45 | nan | 42.65 | 39.20 | 46.22 | |
40 | 38.64 | 4.710 | 38.77 | 40.72 | nan | 43.49 | 34.86 | 48.69 | |
ORL | 10 | 51.58 | 55.06 | 60.96 | 68.45 | 68.43 | 69.06 | 69.03 | 75.64 |
20 | 41.16 | 36.54 | 46.89 | 66.05 | 65.41 | 68.73 | 68.79 | 76.97 | |
30 | 38.49 | 30.05 | 30.17 | 62.24 | nan | 68.94 | 69.36 | 75.61 | |
40 | 63.14 | 23.45 | 22.08 | 56.34 | nan | 68.99 | 69.19 | 75.09 | |
Yale | 10 | 39.52 | 30.82 | 40.57 | 42.24 | 43.05 | 41.12 | 40.50 | 46.52 |
20 | 36.84 | 24.15 | 36.07 | 39.64 | 41.51 | 40.73 | 40.36 | 46.53 | |
30 | 33.51 | 19.97 | 33.97 | 37.20 | nan | 41.84 | 41.38 | 47.79 | |
40 | 29.04 | 13.82 | 30.67 | 33.15 | nan | 40.10 | 40.86 | 46.40 | |
COIL20 | 10 | 69.76 | 64.25 | 69.69 | 69.29 | 68.70 | 68.62 | 69.14 | 72.11 |
20 | 68.10 | 54.58 | 68.87 | 68.04 | 69.74 | 69.98 | 68.14 | 73.73 | |
30 | 68.48 | 44.10 | 67.85 | 68.12 | nan | 70.04 | 66.57 | 73.43 | |
40 | 66.94 | 33.06 | 67.42 | 68.15 | nan | 70.15 | 66.33 | 72.50 | |
3Sources | 10 | 27.44 | 2.140 | 27.44 | 35.81 | 37.59 | 39.99 | 27.44 | 52.42 |
20 | 19.91 | 1.790 | 19.91 | 30.56 | 33.93 | 40.10 | 38.57 | 50.25 | |
30 | 14.36 | 1.670 | 14.36 | 24.63 | nan | 35.81 | 33.58 | 47.11 | |
40 | 10.64 | 1.620 | 10.64 | 18.93 | nan | 32.90 | 30.04 | 43.82 |
表4 在原始空间聚类结果的NMI值比较
Tab. 4 Comparison of NMI value among clustering results in original space
数据集 | 数据缺失 比例/% | NMI/% | |||||||
---|---|---|---|---|---|---|---|---|---|
0值插补 | 最大值插补 | 最小值插补 | 均值插补 | KNN插补 | EM插补 | MF一次插补 | 本文插补算法 | ||
Hand‑written | 10 | 48.89 | 41.16 | 51.43 | 56.45 | 57.70 | 57.13 | 56.93 | 62.41 |
20 | 41.34 | 26.98 | 41.80 | 58.24 | 57.32 | 57.02 | 56.98 | 59.73 | |
30 | 33.49 | 17.10 | 33.89 | 49.77 | nan | 51.22 | 50.96 | 55.80 | |
40 | 23.96 | 14.06 | 25.66 | 52.18 | nan | 51.76 | 49.71 | 56.39 | |
Scene | 10 | 42.39 | 29.39 | 42.51 | 42.63 | 42.38 | 43.22 | 39.88 | 45.14 |
20 | 41.38 | 21.40 | 41.24 | 42.07 | 41.66 | 43.00 | 40.21 | 45.32 | |
30 | 39.76 | 11.23 | 40.04 | 41.45 | nan | 42.65 | 39.20 | 46.22 | |
40 | 38.64 | 4.710 | 38.77 | 40.72 | nan | 43.49 | 34.86 | 48.69 | |
ORL | 10 | 51.58 | 55.06 | 60.96 | 68.45 | 68.43 | 69.06 | 69.03 | 75.64 |
20 | 41.16 | 36.54 | 46.89 | 66.05 | 65.41 | 68.73 | 68.79 | 76.97 | |
30 | 38.49 | 30.05 | 30.17 | 62.24 | nan | 68.94 | 69.36 | 75.61 | |
40 | 63.14 | 23.45 | 22.08 | 56.34 | nan | 68.99 | 69.19 | 75.09 | |
Yale | 10 | 39.52 | 30.82 | 40.57 | 42.24 | 43.05 | 41.12 | 40.50 | 46.52 |
20 | 36.84 | 24.15 | 36.07 | 39.64 | 41.51 | 40.73 | 40.36 | 46.53 | |
30 | 33.51 | 19.97 | 33.97 | 37.20 | nan | 41.84 | 41.38 | 47.79 | |
40 | 29.04 | 13.82 | 30.67 | 33.15 | nan | 40.10 | 40.86 | 46.40 | |
COIL20 | 10 | 69.76 | 64.25 | 69.69 | 69.29 | 68.70 | 68.62 | 69.14 | 72.11 |
20 | 68.10 | 54.58 | 68.87 | 68.04 | 69.74 | 69.98 | 68.14 | 73.73 | |
30 | 68.48 | 44.10 | 67.85 | 68.12 | nan | 70.04 | 66.57 | 73.43 | |
40 | 66.94 | 33.06 | 67.42 | 68.15 | nan | 70.15 | 66.33 | 72.50 | |
3Sources | 10 | 27.44 | 2.140 | 27.44 | 35.81 | 37.59 | 39.99 | 27.44 | 52.42 |
20 | 19.91 | 1.790 | 19.91 | 30.56 | 33.93 | 40.10 | 38.57 | 50.25 | |
30 | 14.36 | 1.670 | 14.36 | 24.63 | nan | 35.81 | 33.58 | 47.11 | |
40 | 10.64 | 1.620 | 10.64 | 18.93 | nan | 32.90 | 30.04 | 43.82 |
数据集 | 数据缺失 比例/% | ACC/% | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
MF | 0值插补 | 最大值插补 | 最小值插补 | 均值插补 | KNN插补 | EM插补 | MF一次插补 | 本文插补算法 | ||
Hand‑written | 10 | 67.53 | 58.54 | 47.16 | 58.09 | 66.06 | 66.77 | 61.07 | 66.48 | 69.93 |
20 | 66.35 | 46.69 | 28.43 | 47.83 | 67.81 | 65.77 | 63.22 | 67.43 | 68.25 | |
30 | 12.87 | 37.23 | 26.10 | 38.09 | 57.64 | nan | 59.51 | 60.88 | 65.76 | |
40 | 13.23 | 29.71 | 22.55 | 31.40 | 60.48 | nan | 61.14 | 59.29 | 65.76 | |
Scene | 10 | 55.64 | 55.11 | 43.37 | 55.49 | 55.38 | 55.31 | 55.65 | 55.82 | 56.77 |
20 | 55.01 | 54.04 | 21.72 | 54.37 | 55.25 | 55.09 | 55.84 | 54.35 | 56.92 | |
30 | 14.76 | 50.22 | 14.94 | 50.79 | 56.61 | nan | 56.28 | 55.42 | 58.18 | |
40 | 14.57 | 48.88 | 14.47 | 49.09 | 55.47 | nan | 55.90 | 55.89 | 67.31 | |
ORL | 10 | 56.64 | 49.54 | 50.91 | 55.21 | 56.91 | 57.18 | 56.20 | 57.68 | 58.11 |
20 | 58.14 | 34.63 | 43.95 | 45.41 | 56.88 | 56.74 | 56.98 | 59.45 | 60.41 | |
30 | 16.27 | 26.78 | 37.84 | 38.08 | 55.48 | nan | 56.21 | 58.15 | 58.33 | |
40 | 15.39 | 22.44 | 27.95 | 30.01 | 54.34 | nan | 58.58 | 55.86 | 57.98 | |
Yale | 10 | 41.11 | 38.63 | 38.06 | 40.09 | 41.21 | 41.31 | 41.76 | 41.64 | 42.15 |
20 | 39.93 | 37.85 | 38.46 | 39.36 | 39.36 | 38.06 | 41.47 | 41.55 | 42.36 | |
30 | 12.73 | 36.94 | 32.46 | 37.88 | 40.24 | nan | 40.86 | 38.46 | 45.55 | |
40 | 12.35 | 32.97 | 30.69 | 30.18 | 37.55 | nan | 41.11 | 44.76 | 45.33 | |
COIL20 | 10 | 57.63 | 55.67 | 57.42 | 56.33 | 55.58 | 56.27 | 54.26 | 61.29 | 61.83 |
20 | 57.32 | 56.27 | 60.11 | 56.12 | 56.14 | 56.84 | 53.85 | 60.09 | 63.83 | |
30 | 19.68 | 55.11 | 57.06 | 55.50 | 59.44 | nan | 55.47 | 61.98 | 63.51 | |
40 | 16.05 | 56.85 | 55.57 | 55.51 | 57.14 | nan | 54.01 | 58.31 | 60.99 | |
3Sources | 10 | 59.68 | 60.09 | 23.58 | 60.09 | 48.99 | 62.44 | 61.51 | 60.21 | 63.66 |
20 | 52.43 | 46.41 | 23.93 | 46.41 | 45.83 | 49.35 | 62.58 | 58.95 | 63.76 | |
30 | 22.76 | 44.27 | 22.62 | 44.27 | 44.73 | nan | 53.83 | 55.00 | 60.17 | |
40 | 20.45 | 44.90 | 21.41 | 44.90 | 38.81 | nan | 55.87 | 55.11 | 57.95 |
表5 不同插补算法+子空间聚类算法在子空间聚类结果的ACC值比较
Tab. 5 Comparison of ACC value among clustering results in subspace by different imputation algorithms + clustering algorithms
数据集 | 数据缺失 比例/% | ACC/% | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
MF | 0值插补 | 最大值插补 | 最小值插补 | 均值插补 | KNN插补 | EM插补 | MF一次插补 | 本文插补算法 | ||
Hand‑written | 10 | 67.53 | 58.54 | 47.16 | 58.09 | 66.06 | 66.77 | 61.07 | 66.48 | 69.93 |
20 | 66.35 | 46.69 | 28.43 | 47.83 | 67.81 | 65.77 | 63.22 | 67.43 | 68.25 | |
30 | 12.87 | 37.23 | 26.10 | 38.09 | 57.64 | nan | 59.51 | 60.88 | 65.76 | |
40 | 13.23 | 29.71 | 22.55 | 31.40 | 60.48 | nan | 61.14 | 59.29 | 65.76 | |
Scene | 10 | 55.64 | 55.11 | 43.37 | 55.49 | 55.38 | 55.31 | 55.65 | 55.82 | 56.77 |
20 | 55.01 | 54.04 | 21.72 | 54.37 | 55.25 | 55.09 | 55.84 | 54.35 | 56.92 | |
30 | 14.76 | 50.22 | 14.94 | 50.79 | 56.61 | nan | 56.28 | 55.42 | 58.18 | |
40 | 14.57 | 48.88 | 14.47 | 49.09 | 55.47 | nan | 55.90 | 55.89 | 67.31 | |
ORL | 10 | 56.64 | 49.54 | 50.91 | 55.21 | 56.91 | 57.18 | 56.20 | 57.68 | 58.11 |
20 | 58.14 | 34.63 | 43.95 | 45.41 | 56.88 | 56.74 | 56.98 | 59.45 | 60.41 | |
30 | 16.27 | 26.78 | 37.84 | 38.08 | 55.48 | nan | 56.21 | 58.15 | 58.33 | |
40 | 15.39 | 22.44 | 27.95 | 30.01 | 54.34 | nan | 58.58 | 55.86 | 57.98 | |
Yale | 10 | 41.11 | 38.63 | 38.06 | 40.09 | 41.21 | 41.31 | 41.76 | 41.64 | 42.15 |
20 | 39.93 | 37.85 | 38.46 | 39.36 | 39.36 | 38.06 | 41.47 | 41.55 | 42.36 | |
30 | 12.73 | 36.94 | 32.46 | 37.88 | 40.24 | nan | 40.86 | 38.46 | 45.55 | |
40 | 12.35 | 32.97 | 30.69 | 30.18 | 37.55 | nan | 41.11 | 44.76 | 45.33 | |
COIL20 | 10 | 57.63 | 55.67 | 57.42 | 56.33 | 55.58 | 56.27 | 54.26 | 61.29 | 61.83 |
20 | 57.32 | 56.27 | 60.11 | 56.12 | 56.14 | 56.84 | 53.85 | 60.09 | 63.83 | |
30 | 19.68 | 55.11 | 57.06 | 55.50 | 59.44 | nan | 55.47 | 61.98 | 63.51 | |
40 | 16.05 | 56.85 | 55.57 | 55.51 | 57.14 | nan | 54.01 | 58.31 | 60.99 | |
3Sources | 10 | 59.68 | 60.09 | 23.58 | 60.09 | 48.99 | 62.44 | 61.51 | 60.21 | 63.66 |
20 | 52.43 | 46.41 | 23.93 | 46.41 | 45.83 | 49.35 | 62.58 | 58.95 | 63.76 | |
30 | 22.76 | 44.27 | 22.62 | 44.27 | 44.73 | nan | 53.83 | 55.00 | 60.17 | |
40 | 20.45 | 44.90 | 21.41 | 44.90 | 38.81 | nan | 55.87 | 55.11 | 57.95 |
数据集 | 数据缺失 比例/% | NMI/% | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
MF | 0值插补 | 最大值插补 | 最小值插补 | 均值插补 | KNN插补 | EM插补 | MF一次插补 | 本文插补算法 | ||
Hand‑ritten | 10 | 58.87 | 47.15 | 33.81 | 47.10 | 58.30 | 59.10 | 57.47 | 57.69 | 62.41 |
20 | 59.05 | 36.66 | 17.49 | 35.88 | 59.18 | 58.27 | 57.53 | 57.27 | 59.73 | |
30 | 0.82 | 28.96 | 13.70 | 28.75 | 52.13 | nan | 54.09 | 51.14 | 55.80 | |
40 | 0.95 | 22.77 | 11.62 | 23.00 | 49.04 | nan | 53.96 | 45.70 | 56.39 | |
Scene | 10 | 43.66 | 43.01 | 26.79 | 43.16 | 43.47 | 43.48 | 43.50 | 43.85 | 45.14 |
20 | 42.77 | 40.75 | 6.420 | 41.01 | 42.35 | 42.38 | 43.61 | 41.60 | 45.32 | |
30 | 0.43 | 37.62 | 0.410 | 38.05 | 42.36 | nan | 43.87 | 40.29 | 46.22 | |
40 | 0.39 | 36.86 | 0.450 | 37.48 | 41.29 | nan | 43.56 | 42.36 | 48.69 | |
ORL | 10 | 74.38 | 68.09 | 69.27 | 72.67 | 75.25 | 74.89 | 74.31 | 75.43 | 75.64 |
20 | 75.39 | 55.04 | 62.73 | 64.46 | 74.72 | 74.67 | 74.89 | 76.53 | 76.97 | |
30 | 1.17 | 47.89 | 57.07 | 57.35 | 73.49 | nan | 74.67 | 75.16 | 75.61 | |
40 | 0.94 | 44.17 | 47.26 | 50.13 | 72.71 | nan | 73.76 | 74.22 | 75.09 | |
Yale | 10 | 45.05 | 44.38 | 41.97 | 45.98 | 45.16 | 45.40 | 45.56 | 45.14 | 46.52 |
20 | 45.24 | 43.78 | 42.27 | 44.35 | 44.97 | 43.91 | 45.81 | 46.10 | 46.53 | |
30 | 0.63 | 43.31 | 36.08 | 44.13 | 45.07 | nan | 46.10 | 43.35 | 47.79 | |
40 | 0.47 | 38.53 | 33.96 | 37.78 | 42.97 | nan | 44.74 | 48.03 | 46.40 | |
COIL20 | 10 | 71.04 | 70.61 | 70.88 | 70.87 | 70.13 | 70.20 | 68.52 | 71.45 | 72.11 |
20 | 70.89 | 68.39 | 72.00 | 68.28 | 70.70 | 71.18 | 68.49 | 71.04 | 73.73 | |
30 | 23.08 | 68.75 | 69.60 | 68.94 | 72.75 | nan | 69.97 | 72.39 | 73.43 | |
40 | 16.90 | 67.74 | 66.04 | 67.68 | 71.44 | nan | 69.31 | 71.28 | 72.50 | |
3Sources | 10 | 47.29 | 48.88 | 2.52 | 48.88 | 35.29 | 42.49 | 44.75 | 49.17 | 52.42 |
20 | 42.65 | 41.35 | 1.90 | 41.35 | 36.46 | 33.35 | 50.87 | 49.98 | 50.25 | |
30 | 24.63 | 36.04 | 2.23 | 36.04 | 31.48 | nan | 37.19 | 41.08 | 47.11 | |
40 | 18.94 | 33.21 | 1.45 | 33.21 | 27.56 | nan | 37.60 | 40.38 | 43.82 |
表6 不同插补算法+子空间聚类算法在子空间聚类结果的NMI值比较
Tab. 6 Comparison of NMI value among clustering results in subspace by different imputation algorithms + clustering algorithms
数据集 | 数据缺失 比例/% | NMI/% | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
MF | 0值插补 | 最大值插补 | 最小值插补 | 均值插补 | KNN插补 | EM插补 | MF一次插补 | 本文插补算法 | ||
Hand‑ritten | 10 | 58.87 | 47.15 | 33.81 | 47.10 | 58.30 | 59.10 | 57.47 | 57.69 | 62.41 |
20 | 59.05 | 36.66 | 17.49 | 35.88 | 59.18 | 58.27 | 57.53 | 57.27 | 59.73 | |
30 | 0.82 | 28.96 | 13.70 | 28.75 | 52.13 | nan | 54.09 | 51.14 | 55.80 | |
40 | 0.95 | 22.77 | 11.62 | 23.00 | 49.04 | nan | 53.96 | 45.70 | 56.39 | |
Scene | 10 | 43.66 | 43.01 | 26.79 | 43.16 | 43.47 | 43.48 | 43.50 | 43.85 | 45.14 |
20 | 42.77 | 40.75 | 6.420 | 41.01 | 42.35 | 42.38 | 43.61 | 41.60 | 45.32 | |
30 | 0.43 | 37.62 | 0.410 | 38.05 | 42.36 | nan | 43.87 | 40.29 | 46.22 | |
40 | 0.39 | 36.86 | 0.450 | 37.48 | 41.29 | nan | 43.56 | 42.36 | 48.69 | |
ORL | 10 | 74.38 | 68.09 | 69.27 | 72.67 | 75.25 | 74.89 | 74.31 | 75.43 | 75.64 |
20 | 75.39 | 55.04 | 62.73 | 64.46 | 74.72 | 74.67 | 74.89 | 76.53 | 76.97 | |
30 | 1.17 | 47.89 | 57.07 | 57.35 | 73.49 | nan | 74.67 | 75.16 | 75.61 | |
40 | 0.94 | 44.17 | 47.26 | 50.13 | 72.71 | nan | 73.76 | 74.22 | 75.09 | |
Yale | 10 | 45.05 | 44.38 | 41.97 | 45.98 | 45.16 | 45.40 | 45.56 | 45.14 | 46.52 |
20 | 45.24 | 43.78 | 42.27 | 44.35 | 44.97 | 43.91 | 45.81 | 46.10 | 46.53 | |
30 | 0.63 | 43.31 | 36.08 | 44.13 | 45.07 | nan | 46.10 | 43.35 | 47.79 | |
40 | 0.47 | 38.53 | 33.96 | 37.78 | 42.97 | nan | 44.74 | 48.03 | 46.40 | |
COIL20 | 10 | 71.04 | 70.61 | 70.88 | 70.87 | 70.13 | 70.20 | 68.52 | 71.45 | 72.11 |
20 | 70.89 | 68.39 | 72.00 | 68.28 | 70.70 | 71.18 | 68.49 | 71.04 | 73.73 | |
30 | 23.08 | 68.75 | 69.60 | 68.94 | 72.75 | nan | 69.97 | 72.39 | 73.43 | |
40 | 16.90 | 67.74 | 66.04 | 67.68 | 71.44 | nan | 69.31 | 71.28 | 72.50 | |
3Sources | 10 | 47.29 | 48.88 | 2.52 | 48.88 | 35.29 | 42.49 | 44.75 | 49.17 | 52.42 |
20 | 42.65 | 41.35 | 1.90 | 41.35 | 36.46 | 33.35 | 50.87 | 49.98 | 50.25 | |
30 | 24.63 | 36.04 | 2.23 | 36.04 | 31.48 | nan | 37.19 | 41.08 | 47.11 | |
40 | 18.94 | 33.21 | 1.45 | 33.21 | 27.56 | nan | 37.60 | 40.38 | 43.82 |
1 | CHEN L F, JIANG Q S. An extended EM algorithm for subspace clustering[J]. Frontiers of Computer Science in China, 2008, 2(1): 81-86. 10.1007/s11704-008-0007-x |
2 | ERTÖZ L, STEINBACH M, KUMAR V. Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data[C]// Proceedings of the 3rd SIAM International Conference on Data Mining. Philadelphia, PA: SIAM, 2003: 47-58. 10.1137/1.9781611972733.5 |
3 | WU X D, KUMAR V, QUINLAN J R, et al. Top 10 algorithms in data mining[J]. Knowledge and Information Systems, 2008, 14(1): 1-37. 10.1007/s10115-007-0114-2 |
4 | MULLER K R, MIKA S, RATSCH G, et al. An introduction to kernel‑based learning algorithms[J]. IEEE Transactions on Neural Networks, 2001, 12(2): 181-201. 10.1109/72.914517 |
5 | von LUXBURG U. A tutorial on spectral clustering[J]. Statistics and Computing, 2007, 17(4): 395-416. 10.1007/s11222-007-9033-z |
6 | VIDAL R. Subspace clustering[J]. IEEE Signal Processing Magazine, 2011, 28(2): 52-68. 10.1109/msp.2010.939739 |
7 | LUSCOMBE N M, GREENBAUM D, GERSTEIN M. What is bioinformatics? a proposed definition and overview of the field[J]. Methods of Information in Medicine, 2001, 40(4): 346-358. 10.1055/s-0038-1634431 |
8 | RAJENDRAN P, MADHESWARAN M. Hybrid medical image classification using association rule mining with decision tree algorithm[J]. Journal of Computing, 2010, 2(1):127-136. |
9 | BEEFERMAN D, BERGER A. Agglomerative clustering of a search engine query log[C]// Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2000: 407-416. 10.1145/347090.347176 |
10 | SHI J B, MALIK J. Normalized cuts and image segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000, 22(8): 888-905. 10.1109/34.868688 |
11 | SHEPITSEN A, GEMMELL J, MOBASHER B, et al. Personalized recommendation in social tagging systems using hierarchical clustering[C]// Proceedings of the 2008 ACM Conference on Recommender Systems. New York: ACM, 2008: 259-266. 10.1145/1454008.1454048 |
12 | LEE D D, SEUNG H S. Learning the parts of objects by non‑ negative matrix factorization[J]. Nature, 1999, 401(6755): 788-791. 10.1038/44565 |
13 | KALTON G, KISH L. Some efficient random imputation methods[J]. Communications in Statistics ― Theory and Methods, 1984, 13(16): 1919-1939. 10.1080/03610928408828805 |
14 | DEMPSTER A P, LAIRD N M, RUBIN D B. Maximum likelihood from incomplete data via the EM algorithm[J]. Journal of the Royal Statistical Society: Series B (Methodological), 1977, 39(1): 1-22. 10.1111/j.2517-6161.1977.tb01600.x |
15 | LITTLE R J A, RUBIN D B. Statistical Analysis with Missing Data[M]. 2nd ed. Hoboken, NJ: John Wiley & Sons, Inc., 2002:200-220. 10.1002/9781119013563 |
16 | 金勇进. 缺失数据的插补调整[J]. 数理统计与管理, 2001, 20(6): 47-53. 10.3969/j.issn.1002-1566.2001.06.012 |
JIN Y J. Imputation adjustment for missing data[J]. Journal of Applied Statistics and Management, 2001, 20(6): 47-53. 10.3969/j.issn.1002-1566.2001.06.012 | |
17 | 熊巍,潘晗,刘立新. 稳健高效的高维成分数据近似零值插补方法及应用[J]. 统计研究, 2020, 37(5): 104-116. |
XIONG W, PAN H, LIU L X. Robust efficient imputation of rounded zeros in high‑dimensional compositional data and its applications[J]. Statistical Research, 2020, 37(5): 104-116. | |
18 | LUX T C H, WATSON L T, CHANG T H, et al. Interpolation of sparse high‑dimensional data[J]. Numerical Algorithms, 2021, 88(1): 281-313. 10.1007/s11075-020-01040-2 |
19 | 武森,冯小东,单志广. 基于不完备数据聚类的缺失数据填补方法[J]. 计算机学报, 2012, 35(8): 1726-1738. 10.3724/sp.j.1016.2012.01726 |
WU S, FENG X D, SHAN Z G. Missing data imputation approach based on incomplete data clustering[J]. Chinese Journal of Computers, 2012, 35(8): 1726-1738. 10.3724/sp.j.1016.2012.01726 | |
20 | 陈静杰,车洁. 基于标准欧氏距离的燃油流量缺失数据填补算法[J]. 计算机科学, 2017, 44(6A): 109-111, 125. 10.11896/j.issn.1002-137X.2017.6A.023 |
CHEN J J, CHE J. Fuel flow missing‑value imputation method based on standardized Euclidean distance[J]. Computer Science, 2017, 44(6A): 109-111, 125. 10.11896/j.issn.1002-137X.2017.6A.023 | |
21 | DABERDAKU S, TAVAZZI E, DI CAMILLO B. A combined interpolation and weighted K‑nearest neighbours approach for the imputation of longitudinal ICU laboratory data[J]. Journal of Healthcare Informatics Research, 2020, 4(2): 174-188. 10.1007/s41666-020-00069-1 |
22 | 陈帅,赵明,郭栋,等. 基于SVD‑KDR算法的工业监测数据插补技术[J]. 机械工程学报, 2021, 57(2): 30-38. 10.3901/jme.2021.02.030 |
CHEN S, ZHAO M, GUO D, et al. Missing data imputation using SVD‑ KDR algorithm in industrial monitoring data[J]. Journal of Mechanical Engineering, 2021, 57(2): 30-38. 10.3901/jme.2021.02.030 | |
23 | GARCÍA‑LAENCINA P J, SANCHO‑GÓMEZ J L, FIGUEIRAS‑ VIDAL A R, et al. K nearest neighbours with mutual information for simultaneous classification and missing data imputation[J]. Neurocomputing, 2009, 72(7/8/9): 1483-1493. 10.1016/j.neucom.2008.11.026 |
24 | 项亮. 推荐系统实践[M]. 北京:人民邮电出版社, 2012: 64-72. |
XIANG L. Recommender System Practice[M]. Beijing: Posts and Telecommunications Press, 2012: 64-72. | |
25 | DEZA M M, DEZA E. Distances and similarities in data analysis[M]// Encyclopedia of Distances. 2nd ed. Berlin: Springer, 2013: 291-305. 10.1007/978-3-642-30958-8_17 |
26 | XU W, LIU X, GONG Y H. Document clustering based on non‑ negative matrix factorization[C]// Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2003: 267-273. 10.1145/860435.860485 |
[1] | 李顺勇, 李师毅, 胥瑞, 赵兴旺. 基于自注意力融合的不完整多视图聚类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2696-2703. |
[2] | 王清, 赵杰煜, 叶绪伦, 王弄潇. 统一框架的增强深度子空间聚类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 1995-2003. |
[3] | 董瑶, 付怡雪, 董永峰, 史进, 陈晨. 不完整多视图聚类综述[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1673-1682. |
[4] | 蒋小霞, 黄瑞章, 白瑞娜, 任丽娜, 陈艳平. 基于事件表示和对比学习的深度事件聚类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1734-1742. |
[5] | 黄天宇, 李远兴, 陈昊, 郭紫佳, 魏明军. 地空协同场景下加权模糊聚类用户簇划分方法[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1555-1561. |
[6] | 徐童童, 解滨, 张春昊, 张喜梅. 融合转移概率矩阵的多阶最近邻图聚类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1527-1538. |
[7] | 高麟, 周宇, 邝得互. 进化双层自适应局部特征选择[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1408-1414. |
[8] | 丁雨, 张瀚霖, 罗荣, 孟华. 基于信念子簇切割的模糊聚类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1128-1138. |
[9] | 孟圣洁, 于万钧, 陈颖. 最大相关和最大差异的高维数据特征选择算法[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 767-771. |
[10] | 孙林, 刘梦含. 基于自适应布谷鸟优化特征选择的K-means聚类[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 831-841. |
[11] | 张卓, 陈花竹. 基于一致性和多样性的多尺度自表示学习的深度子空间聚类[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 353-359. |
[12] | 杨成昊, 胡节, 王红军, 彭博. 基于注意力机制的不完备多视图聚类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(12): 3784-3789. |
[13] | 朱云华, 孔兵, 周丽华, 陈红梅, 包崇明. 图对比学习引导的多视图聚类网络[J]. 《计算机应用》唯一官方网站, 2024, 44(10): 3267-3274. |
[14] | 尹春勇, 周永成. 双端聚类的自动调整聚类联邦学习[J]. 《计算机应用》唯一官方网站, 2024, 44(10): 3011-3020. |
[15] | 徐雪冉, 杨庚, 黄喻先. 横向联邦学习中差分隐私聚类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 217-222. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||