基于潜在因子模型在子空间上的缺失值注意力聚类算法

doi:10.11772/j.issn.1001-9081.2022121838

《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (12): 3772-3778.DOI: 10.11772/j.issn.1001-9081.2022121838

基于潜在因子模型在子空间上的缺失值注意力聚类算法

王啸飞¹^,², 鲍胜利¹^,²(), 陈炯环¹^,²

^1.中国科学院成都计算机应用研究所，成都 610041
^2.中国科学院大学，北京 100049

收稿日期:2022-12-12 修回日期:2023-02-13 接受日期:2023-02-16 发布日期:2023-03-09 出版日期:2023-12-10
通讯作者: 鲍胜利
作者简介:王啸飞（1997—），男，湖南慈利人，硕士研究生，主要研究方向：机器学习、推荐算法
鲍胜利（1973—），男，安徽黄山人，研究员，博士，主要研究方向：软件工程、大数据；Email： baoshengli@casit.com.cn
陈炯环（1998—），男，山东潍坊人，硕士研究生，主要研究方向：机器学习、大数据。
基金资助:
中国科学院西部青年学者项目(RRJZ2021003)

Missing value attention clustering algorithm based on latent factor model in subspace

Xiaofei WANG¹^,², Shengli BAO¹^,²(), Jionghuan CHEN¹^,²

^1.Chengdu Institute of Computer Application，Chinese Academy of Sciences，Chengdu Sichuan 610041，China
^2.University of Chinese Academy of Sciences，Beijing 100049，China

Received:2022-12-12 Revised:2023-02-13 Accepted:2023-02-16 Online:2023-03-09 Published:2023-12-10
Contact: Shengli BAO
About author:WANG Xiaofei， born in 1997， M. S. candidate. His research interests include machine learning， recommendation algorithm.
CHEN Jionghuan， born in 1998， M. S. candidate. His research interests include machine learning， big data.
Supported by:
Western Young Scholars Project of Chinese Academy of Sciences(RRJZ2021003)

摘要/Abstract

摘要：

针对传统聚类算法在对缺失样本进行数据填充过程中存在样本相似度难度量且填充数据质量差的问题，提出一种基于潜在因子模型（LFM）在子空间上的缺失值注意力聚类算法。首先，通过LFM将原始数据空间映射到低维子空间，降低样本的稀疏程度；其次，通过分解原空间得到的特征矩阵构建不同特征间的注意力权重图，优化子空间样本间的相似度计算方式，使样本相似度的计算更准确、泛化性更好；最后，为了降低样本相似度计算过程中过高的时间复杂度，设计一种多指针的注意力权重图进行优化。在4个按比例随机缺失的数据集上进行实验。在Hand-digits数据集上，相较于面向高维特征缺失数据的K近邻插补子空间聚类（KISC）算法，在数据缺失比例为10%的情况下，所提算法的聚类准确度（ACC）提高了2.33个百分点，归一化互信息（NMI）提高了2.77个百分点，在数据缺失比例为20%的情况下，所提算法的ACC提高了0.39个百分点，NMI提高了1.33个百分点，验证了所提算法的有效性。

关键词: 潜在因子模型, 缺失值, 注意力机制, 聚类算法, 子空间

Abstract:

To solve the problems that traditional clustering algorithms are difficult to measure the sample similarity and have poor quality of filled data in the process of filling missing samples， a missing value attention clustering algorithm based on Latent Factor Model （LFM） in subspace was proposed. First， LFM was used to map the original data space to a low dimensional subspace to reduce the sparsity of samples. Then， the attention weight graph between different features was constructed by decomposing the feature matrix obtained from the original space， and the similarity calculation method between subspace samples was optimized to make the calculation of sample similarity more accurate and more generalized. Finally， to reduce the high time complexity in the process of sample similarity calculation， a multi-pointer attention weight graph was designed for optimization. The algorithm was tested on four proportional random missing datasets. On the Hand-digits dataset， compared with the KISC （K-nearest neighbors Interpolation Subspace Clustering） algorithm for high-dimensional feature missing data， when the missing data was 10%， the Accuracy （ACC） of the proposed algorithm was improved by 2.33 percentage points and the Normalized Mutual Information （NMI） was improved by 2.77 percentage points； when the missing data was 20%， the ACC of the proposed algorithm was improved by 0.39 percentage points， and the NMI was improved by 1.33 percentage points， which verified the effectiveness of the proposed algorithm.

Key words: Latent Factor Model (LFM), missing value, attention mechanism, clustering algorithm, subspace

中图分类号:

TP391.1

王啸飞, 鲍胜利, 陈炯环. 基于潜在因子模型在子空间上的缺失值注意力聚类算法[J]. 计算机应用, 2023, 43(12): 3772-3778.

Xiaofei WANG, Shengli BAO, Jionghuan CHEN. Missing value attention clustering algorithm based on latent factor model in subspace[J]. Journal of Computer Applications, 2023, 43(12): 3772-3778.

图/表 8

表1 符号和定义

Tab.1 Symbols and definitions

符号	定义	符号	定义
$X$	缺失特征样本数据集	$V$	特征矩阵
$x i$	第 $i$ 个样本	$F i j$	特征 $i$ 、 $j$ 的相似度权重
$f i$	第 $i$ 个样本为缺失样本	$C$	样本类别数
$x i j$	第 $i$ 个样本的第 $j$ 个特征	$K i$	样本 $x i$ 的相似样本集
$y i$	第 $i$ 个样本的类别标签	$k$	隐向量维度
$U$	子空间矩阵

表1 符号和定义

Tab.1 Symbols and definitions

符号	定义	符号	定义
$X$	缺失特征样本数据集	$V$	特征矩阵
$x i$	第 $i$ 个样本	$F i j$	特征 $i$ 、 $j$ 的相似度权重
$f i$	第 $i$ 个样本为缺失样本	$C$	样本类别数
$x i j$	第 $i$ 个样本的第 $j$ 个特征	$K i$	样本 $x i$ 的相似样本集
$y i$	第 $i$ 个样本的类别标签	$k$	隐向量维度
$U$	子空间矩阵

图1 本文算法的框架

Fig.1 Framework of proposed algorithm

图2 样本注意力权重矩阵的结构

Fig.2 Structure of sample attention weight matrix

图3 样本相似度计算方式

Fig.3 Calculation method of sample similarity

图4 多指针相似度权重图

Fig.4 Multi-pointer similarity weight graph

表2 数据集信息

Tab. 2 Information of datasets

数据集	样本数	特征维度	类别数
Hand-digits	1 797	64	10
COIL20	1 440	1 024	20
Breast-cancer	569	30	2
Wine	178	13	3

表3 不同算法聚类结果的ACC和NMI比较 (%)

Tab.3 Comparison of ACC and NMI in clustering results between different algorithms

数据集	缺失比例/ %	原始空间												子空间
		0+K-means		Min+K-means		Max+K-mean		Mean+K-means		KISC算法		本文算法		KISC算法		本文算法
		ACC	NMI	ACC	NMI	ACC	NMI	ACC	NMI	ACC	NMI	ACC	NMI	ACC	NMI	ACC	NMI
Hand-digits	10	63.44	59.36	71.51	67.45	71.62	68.81	71.40	68.10	67.82	60.94	74.21	64.54	73.46	67.49	75.79	70.26
	20	64.83	53.29	61.27	49.58	63.22	53.11	64.27	55.36	61.58	50.41	69.09	56.08	71.90	59.94	72.29	61.27
	30	32.50	19.82	39.23	25.29	42.07	26.74	35.73	23.66	55.60	40.55	58.65	43.05	44.07	28.05	49.25	30.27
	40	19.59	7.68	20.09	6.83	19.59	5.66	18.20	6.38	32.38	19.47	35.63	22.29	21.31	6.29	21.65	07.37
COIL20	10	62.04	76.01	61.71	74.59	60.74	74.92	61.16	74.74	63.11	77.27	70.84	78.72	70.81	78.61	71.11	79.10
	20	61.74	75.12	62.65	74.83	61.40	74.41	60.10	73.76	60.33	73.77	68.67	78.02	69.62	79.59	70.03	79.72
	30	60.24	73.54	58.85	72.14	59.58	72.94	60.50	73.21	62.81	74.49	68.31	76.62	70.54	78.28	71.04	76.78
	40	52.22	68.16	54.31	69.10	53.35	69.05	55.77	69.57	57.40	71.64	63.99	74.38	62.71	71.83	64.86	71.31
Breast-cancer	10	83.02	41.42	82.99	41.34	82.99	41.28	82.97	41.30	83.17	41.02	83.69	42.70	85.15	45.78	85.44	46.48
	20	81.16	37.38	81.09	37.24	81.07	37.01	81.14	37.34	80.76	36.69	81.16	37.50	82.41	42.15	84.46	44.19
	30	79.88	34.26	79.86	34.23	79.86	34.32	79.96	34.44	79.72	33.53	80.22	35.66	80.35	39.12	84.45	44.10
	40	78.70	29.74	78.65	30.02	78.63	29.82	78.65	29.92	78.73	30.45	79.28	31.21	80.30	34.17	82.39	35.95
Wine	10	41.42	36.84	41.34	36.97	41.34	37.35	41.30	35.71	41.02	35.79	42.70	37.45	45.78	38.40	46.48	38.78
	20	37.38	29.54	37.24	29.70	37.34	28.61	37.34	29.33	36.69	30.60	37.50	31.42	42.15	32.56	44.19	34.65
	30	34.26	26.84	34.23	26.24	34.32	26.19	34.44	27.06	33.53	27.79	35.66	26.25	39.12	26.60	44.10	27.88
	40	29.74	23.68	30.02	22.69	29.82	22.57	29.92	23.01	30.45	20.46	31.21	22.11	34.17	22.45	35.95	22.29

表4 特征相似度权重图优化前后时间对比

Tab. 4 Time comparison of feature similarity weight graph before and after optimization

数据集	缺失率/%	时间/s
数据集	缺失率/%	结构未优化	结构优化
Wine	10	0.363	0.230
	20	0.491	0.296
	30	0.499	0.302
	40	0.495	0.310
Breast-cancer	10	11.150	3.140
	20	12.370	3.190
	30	12.440	3.230
	40	12.340	3.200
Hand-digits	10~20	3 272.4 $±$ 82.3	617.1 $±$ 43.2
Hand-digits	30~40	3 716.6 $±$ 96.3	670.6 $±$ 51.0
COIL20	10~20	8 961.3 $±$ 203.0	1 816.4 $±$ 62.1
COIL20	30~40	9 200.8 $±$ 122.0	1 900.2 $±$ 55.2

表4 特征相似度权重图优化前后时间对比

Tab. 4 Time comparison of feature similarity weight graph before and after optimization

数据集	缺失率/%	时间/s
数据集	缺失率/%	结构未优化	结构优化
Wine	10	0.363	0.230
	20	0.491	0.296
	30	0.499	0.302
	40	0.495	0.310
Breast-cancer	10	11.150	3.140
	20	12.370	3.190
	30	12.440	3.230
	40	12.340	3.200
Hand-digits	10~20	3 272.4 $±$ 82.3	617.1 $±$ 43.2
Hand-digits	30~40	3 716.6 $±$ 96.3	670.6 $±$ 51.0
COIL20	10~20	8 961.3 $±$ 203.0	1 816.4 $±$ 62.1
COIL20	30~40	9 200.8 $±$ 122.0	1 900.2 $±$ 55.2

参考文献 22

1	AHALYA G， PANDEY H M. Data clustering approaches survey and analysis ［C］// Proceedings of the 2015 International Conference on Futuristic Trends on Computational Analysis and Knowledge Management. Piscataway： IEEE，2015： 532-537. 10.1109/ablaze.2015.7154919
2	XU H， YAO S， LI Q， et al. An improved K-means clustering algorithm［C］// Proceedings of the 2020 IEEE 5th International Symposium on Smart and Wireless Systems within the Conferences on Intelligent Data Acquisition and Advanced Computing Systems. Piscataway： IEEE， 2020： 1-5. 10.1109/idaacs-sws50031.2020.9297060
3	王一妹，刘辉，宋鹏，等.基于高斯混合模型聚类的风电场短期功率预测方法［J］.电力系统自动化，2021，45（7）：37-43. 10.7500/AEPS20200616005
	WANG Y M， LIU H， SONG P， et al. Short-term power forecasting method of wind farm based on Gaussian mixture model clustering ［J］. Power System Automation， 2021，45（7）： 37-43. 10.7500/AEPS20200616005
4	JEBARI S， SMITI A， LOUATI A.AF-DBSCAN： an unsupervised automatic fuzzy clustering method based on DBSCAN approach［C］// Proceedings of the 2019 IEEE International Work Conference on Bioinspired Intelligence. Piscataway： IEEE，2019：1-6. 10.1109/iwobi47054.2019.9114411
5	RONG Y， LIU Y.Staged text clustering algorithm based on K-means and hierarchical agglomeration clustering［C］// Proceedings of the 2020 IEEE International Conference on Artificial Intelligence and Computer Applications. Piscataway： IEEE，2020：124-127. 10.1109/icaica50127.2020.9182394
6	任丽娜，秦永彬，黄瑞章，等.基于多层子空间语义融合的深度文本聚类［J］.计算机应用研究， 2023，40（1）：70-74，79.
	REN L N， QIN Y B， HUANG R Z，et al. Deep document clustering model via multi-layer subspace semantic fusion ［J］. Application Research of Computers， 2023，40（1）：70-74，79.
7	TAN L， GU S， WU C， et al. K-means clustering method based on node similarity in traditional Chinese medicine efficacy［C］// Proceedings of the 2020 39th Chinese Control Conference. Piscataway： IEEE，2020：742-747. 10.23919/ccc50068.2020.9189618
8	YI Y. Design of intelligent recommendation APP for eco-tourism routes based on popular data clustering of points of interest［C］// Proceedings of the 2021 2nd International Conference on Smart Electronics and Communication. Piscataway： IEEE，2021： 1270-1273. 10.1109/icosec51865.2021.9591778
9	SHRUTHI S， JOSE A M， ANUROOP P R. Improvisation of cluster efficiency using min-cut algorithm in social networks［C］// Proceedings of the 2017 International Conference on Communication and Signal Processing. Piscataway： IEEE，2017：1641-1644. 10.1109/iccsp.2017.8286668
10	LIU Q， HAUSWIRTH M. A provenance meta learning framework for missing data handling methods selection［C］// Proceedings of the 2020 11th IEEE Annual Ubiquitous Computing， Electronics & Mobile Communication Conference. Piscataway： IEEE，2020： 349-358. 10.1109/uemcon51285.2020.9298089
11	徐宇明，陈诚，熊赟，等.APT-KNN：一种面向分类问题的高效缺失值填充算法［J］.计算机应用与软件，2011，28（4）：135-139. 10.3969/j.issn.1000-386X.2011.04.040
	XU Y M， CHEN C， XIONG Y，et al. APT-KNN： an efficient missing value imputation method oriented toward classification issue ［J］. Computer Applications and Software， 2011， 28 （4）： 135-139. 10.3969/j.issn.1000-386X.2011.04.040
12	徐鸿艳，孙云山，秦琦琳，等.缺失数据插补方法性能比较分析［J］.软件工程，2021，24（11）：11-14.
	XU H Y， SUN Y S， QIN Q L， et al. Comparative analysis of the performance of interpolation methods for of missing data［J］. Software Engineering， 2021，24 （11）： 11-14.
13	ALBAYRAK M， TURHAN K， KURT B. A missing data imputation approach using clustering and maximum likelihood estimation［C］// Proceedings of the 2017 Medical Technologies National Congress. Piscataway：IEEE， 2017：1-4. 10.1109/tiptekno.2017.8238064
14	PATTANODOM， IAM-ON N， BOONGOEN T. Clustering data with the presence of missing values by ensemble approach［C］// Proceedings of the 2016 Second Asian Conference on Defence Technology. Piscataway： IEEE， 2016： 151-156. 10.1109/acdt.2016.7437660
15	HUAYAN S， YELI L， RUNFEI Z， et al. Accelerating EM missing data filling algorithm based on the K-means［C］// Proceedings of the 2018 4th Annual International Conference on Network and Information Systems for Computers. Piscataway： IEEE， 2018：401-406. 10.1109/icnisc.2018.00088
16	乔永坚，刘晓琳，白亮.面向高维特征缺失数据的K最近邻插补子空间聚类算法［J］.计算机应用，2022，42（11）：3322-3329. 10.11772/j.issn.1001-9081.2021111964
	QIAO Y J， LIU X L， BAI L. K-nearest neighbor interpolation subspace clustering algorithm for high-dimensional data with feature missing ［J］. Journal of Computer Application， 2022，42（11）： 3322-3329. 10.11772/j.issn.1001-9081.2021111964
17	王一棠，庞勇，张立勇，等.面向盾构机不完整数据的模糊聚类与非线性回归填补［J］.机械工程学报， 2023， 59（12）： 28-37.
	WANG Y T， PANG Y， ZHANG L Y，et al. Fuzzy clustering and nonlinear regression filling for incomplete data of shield machine ［J］. Journal of Mechanical Engineering， 2023， 59（12）： 28-37.
18	陈晔，刘志强.基于LFM矩阵分解的推荐算法优化研究［J］.计算机工程与应用，2019，55（2）：116-120.
	CHEN Y， LIU Z Q. Research on improved recommendation algorithm based on LFM matrix factorization ［J］. Computer Engineering and Applications， 2019，55（2）： 116-120.
19	XIONG Y， LI H. Collaborative filtering algorithm in pictures recommendation based on SVD［C］// Proceedings of the 2018 International Conference on Robots & Intelligent System. Piscataway： IEEE， 2018：262-265. 10.1109/icris.2018.00074
20	杨镆. 基于偏置LFM和LSH的混合电影推荐算法研究［D］.武汉：华中师范大学，2021：13-15.
	YANG M. Research on hybrid movie recommendation algorithm based on bias LFM and LSH［D］. Wuhan： Central China Normal University， 2021： 13-15.
21	GUO S， LI C. Hybrid recommendation algorithm based on user behavior［C］// Proceedings of the 2020 IEEE 9th Joint International Information Technology and Artificial Intelligence Conference. Piscataway： IEEE， 2020：2242-2246. 10.1109/itaic49862.2020.9339083
22	龙建武，王强.反向近邻构造连通图的聚类算法［J/OL］.计算机科学与探索：1-15 ［2023-02-07］. . 10.3778/j.issn.1673-9418.2207017
	LONG J W， WANG Q. Clustering algorithm for constructing connected graphs by reverse nearest neighbors ［J/OL］. Journal of Frontiers of Computer Science and Technology： 1-15 ［2023-02-07］. . 10.3778/j.issn.1673-9418.2207017

[1]	张秋余, 温永旺. 用于语音检索的三联体深度哈希方法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2910-2918.
[2]	王宏, 钱清, 王欢, 龙永. 融合大核注意力卷积的轻量化图像篡改定位算法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2692-2699.
[3]	杨昊, 张轶. 基于上下文信息和多尺度融合重要性感知的特征金字塔网络算法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2727-2734.
[4]	袁国龙, 张玉金, 刘洋. 基于残差反馈和自注意力的图像篡改取证网络[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2925-2931.
[5]	段升位, 程欣宇, 王浩舟, 王飞. 基于改进的YOLOv5的大坝表面病害检测算法[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2619-2629.
[6]	刘源, 董永权, 贾瑞, 杨昊霖. 面向个性化课程推荐的分层分期注意力网络模型[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2358-2363.
[7]	王静红, 周志霞, 王辉, 李昊康. 双路自编码器的属性网络表示学习[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2338-2344.
[8]	崔雨萌, 王靖亚, 刘晓文, 闫尚义, 陶知众. 融合注意力和裁剪机制的通用文本分类模型[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2396-2405.
[9]	齐爱玲, 王宣淋. 基于中层细微特征提取与多尺度特征融合细粒度图像识别[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2556-2563.
[10]	金泽熙, 李磊, 刘继. 基于改进领域分离网络的迁移学习模型[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2382-2389.
[11]	梁美佳, 刘昕武, 胡晓鹏. 基于改进YOLOv3的列车运行环境图像小目标检测算法[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2611-2618.
[12]	拓雨欣, 薛涛. 融合指针网络与关系嵌入的三元组联合抽取模型[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2116-2124.
[13]	秦源源, 张鸿. 基于注意力特征金字塔网络的肺结节检测算法[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2311-2318.
[14]	梁敏, 刘佳艺, 李杰. 融合迭代反馈与注意力机制的图像超分辨重建方法[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2280-2287.
[15]	叶坤佩, 熊熙, 丁哲. 基于领域融合和时间权重的招工推荐模型[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2133-2139.

基于潜在因子模型在子空间上的缺失值注意力聚类算法

Missing value attention clustering algorithm based on latent factor model in subspace

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 8

参考文献 22

相关文章 15

编辑推荐

Metrics