《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (11): 3330-3336.DOI: 10.11772/j.issn.1001-9081.2021111961
所属专题: 第九届CCF大数据学术会议(CCF Bigdata 2021)
收稿日期:2021-11-17
修回日期:2021-12-13
接受日期:2021-12-23
发布日期:2022-01-04
出版日期:2022-11-10
通讯作者:
刘勇
作者简介:王梅(1976—),女,河北保定人,教授,博士,CCF会员,主要研究方向:机器学习、核方法、模型选择基金资助:
Mei WANG1,2, Xiaohui SONG1, Yong LIU3,4(
), Chuanhai XU1
Received:2021-11-17
Revised:2021-12-13
Accepted:2021-12-23
Online:2022-01-04
Published:2022-11-10
Contact:
Yong LIU
About author:WANG Mei, born in 1976, Ph. D., professor. Her research interests include machine learning, kernel methods, model selection.Supported by:摘要:
针对K-Means聚类算法利用均值更新聚类中心,导致聚类结果受样本分布影响的问题,提出了神经正切核K-Means聚类算法(NTKKM)。首先通过神经正切核(NTK)将输入空间的数据映射到高维特征空间,然后在高维特征空间中进行K-Means聚类,并采用兼顾簇间与簇内距离的方法更新聚类中心,最后得到聚类结果。在car和breast-tissue数据集上,对NTKKM聚类算法的准确率、调整兰德系数(ARI)及FM指数这3个评价指标进行统计。实验结果表明,NTKKM聚类算法的聚类效果以及稳定性均优于K?Means聚类算法和高斯核K-Means聚类算法。NTKKM聚类算法与传统的K-Means聚类算法相比,准确率分别提升了14.9%和9.4%,ARI分别提升了9.7%和18.0%,FM指数分别提升了12.0%和12.0%,验证了NTKKM聚类算法良好的聚类性能。
中图分类号:
王梅, 宋晓晖, 刘勇, 许传海. 神经正切核K‑Means聚类[J]. 计算机应用, 2022, 42(11): 3330-3336.
Mei WANG, Xiaohui SONG, Yong LIU, Chuanhai XU. Neural tangent kernel K‑Means clustering[J]. Journal of Computer Applications, 2022, 42(11): 3330-3336.
| 数据集名称 | 维度 | 类别个数 | 样本数量 |
|---|---|---|---|
| car | 6 | 4 | 1 728 |
| breast‑tissue | 9 | 6 | 106 |
| winequality‑red | 12 | 6 | 1 599 |
| iris | 4 | 3 | 150 |
表1 实验数据集信息
Tab. 1 Experimental dataset information
| 数据集名称 | 维度 | 类别个数 | 样本数量 |
|---|---|---|---|
| car | 6 | 4 | 1 728 |
| breast‑tissue | 9 | 6 | 106 |
| winequality‑red | 12 | 6 | 1 599 |
| iris | 4 | 3 | 150 |
| 数据集 | K‑Means | GKKM | NTKKM |
|---|---|---|---|
| car | 0.706 | 0.731 | 0.811 |
| breast tissue | 0.854 | 0.884 | 0.934 |
| winequality‑red | 0.731 | 0.637 | 0.800 |
| iris | 0.889 | 0.904 | 0.924 |
表2 三种算法的准确率
Tab. 2 Accuracies of three algorithms
| 数据集 | K‑Means | GKKM | NTKKM |
|---|---|---|---|
| car | 0.706 | 0.731 | 0.811 |
| breast tissue | 0.854 | 0.884 | 0.934 |
| winequality‑red | 0.731 | 0.637 | 0.800 |
| iris | 0.889 | 0.904 | 0.924 |
| 数据集 | K‑Means | GKKM | NTKKM |
|---|---|---|---|
| car | 0.729 | 0.745 | 0.800 |
| breast tissue | 0.712 | 0.761 | 0.840 |
| winequality‑red | 0.758 | 0.501 | 0.801 |
| iris | 0.867 | 0.710 | 0.822 |
表3 三种算法的ARI
Tab. 3 Adjusted Rand indexes of three algorithms
| 数据集 | K‑Means | GKKM | NTKKM |
|---|---|---|---|
| car | 0.729 | 0.745 | 0.800 |
| breast tissue | 0.712 | 0.761 | 0.840 |
| winequality‑red | 0.758 | 0.501 | 0.801 |
| iris | 0.867 | 0.710 | 0.822 |
| 数据集 | K‑Means | GKKM | NTKKM |
|---|---|---|---|
| car | 0.581 | 0.561 | 0.651 |
| breast tissue | 0.750 | 0.753 | 0.840 |
| winequality‑red | 0.509 | 0.453 | 0.611 |
| iris | 0.751 | 0.753 | 0.768 |
表4 三种算法精确率和召回率的几何平均数
Tab. 4 Fowlkes and Mallows Indexes of three algorithms
| 数据集 | K‑Means | GKKM | NTKKM |
|---|---|---|---|
| car | 0.581 | 0.561 | 0.651 |
| breast tissue | 0.750 | 0.753 | 0.840 |
| winequality‑red | 0.509 | 0.453 | 0.611 |
| iris | 0.751 | 0.753 | 0.768 |
| 1 | FAHIM A M, SALEM A M, TORKEY F A, et al. An efficient enhanced k‑means clustering algorithm[J]. Journal of Zhejiang University — SCIENCE A (Applied Physics and Engineering), 2006, 7(10): 1626-1633. 10.1631/jzus.2006.a1626 |
| 2 | 汪敏,武禹伯,闵帆. 基于多种聚类算法和多元线性回归的多分类主动学习算法[J]. 计算机应用, 2020, 40(12):3437-3444. 10.11772/j.issn.1001-9081.2020060921 |
| WANG M, WU Y B, MIN F. Multi‑category active learning algorithm based on multiple clustering algorithms and multiple linear regression[J]. Journal of Computer applications, 2020, 40(12):3437-3444. 10.11772/j.issn.1001-9081.2020060921 | |
| 3 | LAWRENCE L O. A Primer on Cluster Analysis by James C. Bezdck[J]. IEEE Systems, Man, and Cybernetics Magazine, 2018, 4(1):48-50. 10.1109/msmc.2017.2769202 |
| 4 | 于佐军,秦欢. 基于改进蜂群算法的K‑means算法[J]. 控制与决策, 2018, 33(1):181-185. |
| YU Z J, QIN H. K‑means algorithm based on improved artificial bee swarm algorithm[J]. Control and Decision, 2018, 33(1):181-185. | |
| 5 | 覃华,詹娟娟,苏一丹. 基于概率无向图模型的近邻传播聚类算法[J]. 控制与决策, 2017, 32(10):1796-1802. 10.13195/j.kzyjc.2016.0861 |
| QIN H, ZHAN J J, SU Y D. Affinity propagation clustering algorithm based on probabilistic undirected graphical model[J]. Control and Decision, 2017, 32(10): 1796-1802. 10.13195/j.kzyjc.2016.0861 | |
| 6 | 周涛,陆惠玲. 数据挖掘中聚类算法研究进展[J]. 计算机工程与应用, 2012, 48(12):100-111. 10.3778/j.issn.1002-8331.2012.12.021 |
| ZHOU T, LU H L. Clustering algorithm research advances on data mining[J]. Computer Engineering and Applications, 2012, 48(12):100-111. 10.3778/j.issn.1002-8331.2012.12.021 | |
| 7 | GIROLAMI M. Mercer kernel‑based clustering in feature space[J]. IEEE Transactions on Neural Networks, 2002, 13(3):780-784. 10.1109/tnn.2002.1000150 |
| 8 | 张莉,周伟达,焦李成. 核聚类算法[J]. 计算机学报, 2002, 25(6): 587-590. 10.3321/j.issn:0254-4164.2002.06.005 |
| ZHANG L, ZHOU W D, JIAO L C. Kernel clustering algorithm[J]. Chinese Journal of Computers, 2002, 25(6): 587-590. 10.3321/j.issn:0254-4164.2002.06.005 | |
| 9 | BEN‑HUR A, HORN D, SIEGELMANN H T, et al. Support vector clustering[J]. Journal of Machine Learning Research, 2001, 2:125-137. |
| 10 | 徐小来,房晓丽. 基于改进的直觉模糊核聚类的图像分割方法[J]. 计算机工程与应用, 2019, 55(17):227-231. 10.3778/j.issn.1002-8331.1904-0307 |
| XU X L, FANG X L. Image segmentation method based on improved intuitive fuzzy kernel c‑means clustering algorithms[J]. Computer Engineering and Applications, 2019, 55(17):227-231. 10.3778/j.issn.1002-8331.1904-0307 | |
| 11 | 杨飞,朱志祥. 基于特征和空间信息的核模糊C-均值聚类算法[J]. 电子科技, 2016, 29(2):16-19. |
| YANG F, ZHU Z X. Kernelized fuzzy C‑means clustering algorithm based on feature and spatial information[J]. Electronic Science and Technology, 2016, 29(2):16-19. | |
| 12 | XIANG L Y, ZHAO G H, LI Q, et al. A fast and effective multiple kernel clustering method on incomplete data[J]. Computers, Materials & Continua, 2021, 67(1):267-284. 10.32604/cmc.2021.013488 |
| 13 | LIU X W, ZHU E, LIU J Y, et al. SimpleMKKM: simple multiple kernel k‑means[EB/OL]. (2020-05-12) [2021-07-20]. . 10.1109/tpami.2022.3198638 |
| 14 | LIU Y, DING L. Nearly optimal risk bounds for kernel K‑means[EB/OL]. (2020-03-09) [2021-07-01].. |
| 15 | 孔锐,张国宣,施泽生,等. 基于核的K-均值聚类[J]. 计算机工程, 2004, 30(11):12-13, 80. 10.3969/j.issn.1000-3428.2004.11.005 |
| KONG R, ZHANG G X, SHI Z S, et al. Kernel‑based K‑means clustering[J]. Computer Engineering, 2004, 30(11):12-13, 80. 10.3969/j.issn.1000-3428.2004.11.005 | |
| 16 | NEAL R M. Bayesian Learning for Neural Networks[M]. New York: Springer, 1996: 29-53. 10.1007/978-1-4612-0745-0 |
| 17 | LEE J, BAHRI Y, NOVAK R, et al. Deep neural networks as Gaussian processes[EB/OL]. (2018-03-03) [2020-05-19].. |
| 18 | MATTHEWS A G D G, ROWLAND M, HRON J, et al. Gaussian process behaviour in wide deep neural networks[EB/OL]. (2018-08-16) [2020-05-19].. |
| 19 | JACOT A, GABRIEL F, HONGLER C. Neural tangent kernel: convergence and generalization in neural networks[C]// Proceedings of the 32nd International Conference on Neural Information Processing Systems. Red Hook, NY: Curran Associates Inc., 2018:8580-8589. |
| 20 | ARORA S, DU S S, HU W, et al. On exact computation with an infinitely wide neural net[C/OL]// Proceedings of the 33rd Conference on Neural Information Processing Systems. [2020-12-24].. |
| 21 | 王梅,许传海,刘勇. 基于神经正切核的多核学习方法[J]. 计算机应用, 2021, 41(12):3462-3467. 10.11772/j.issn.1001-9081.2021060998 |
| WANG M, XU C H, LIU Y. Multi‑kernel learning methods based on neural positive tangent kernel[J]. Journal of Computer Applications, 2021, 41(12):3462-3467. 10.11772/j.issn.1001-9081.2021060998 | |
| 22 | ARORA S, DU S S, LI Z Y, et al. Harnessing the power of infinitely wide deep nets on small‑data tasks[EB/OL]. (2019-10-27) [2021-01-08].. |
| 23 | SCHÖLKOPF B, SEBASTIAN MIKA S, BURGES C J C, et al. Input space versus feature space in kernel‑based methods[J]. IEEE Transactions on Neural Networks, 1999, 10(5): 1000-1017. 10.1109/72.788641 |
| 24 | 王守志,何东健,李文,等. 基于核K-均值聚类算法的植物叶部病害识别[J]. 农业机械学报, 2009, 40(3):152-155. |
| WANG S Z, HE D J, LI W, et al. Plant leaf disease identification based on the nuclear K‑means clustering algorithm[J]. Transactions of the Chinese Society for Agricultural Machinery, 2009, 40(3): 152-155. | |
| 25 | 王宇,李晓利. 核k-凝聚聚类算法[J]. 大连理工大学学报, 2007, 47(5):763-766. |
| WANG Y, LI X L. Kernel k‑aggregate clustering algorithm[J]. Journal of Dalian University of Technology, 2007, 47(5): 763-766. | |
| 26 | HUO J, BI Y H, LÜ D R, et al. Cloud classification and distribution of cloud types in Beijing using Ka‑band radar data[J]. Advances in Atmospheric Sciences, 2019, 36(8):793-803. 10.1007/s00376-019-8272-1 |
| 27 | TZORTZIS G, LIKAS A. The MinMax k‑Means clustering algorithm[J]. Pattern Recognition, 2014, 47(7):2505-2516. 10.1016/j.patcog.2014.01.015 |
| 28 | 翟东海,鱼江,高飞,等. 最大距离法选取初始簇中心的K‑means文本聚类算法的研究[J]. 计算机应用研究, 2014, 31(3):713-715, 719. |
| ZHAI D H, YU J, GAO F, et al. K‑means text clustering algorithm based on initial cluster centers selection according to maximum distance[J]. Application Research of Computers, 2014, 31(3):713-715, 719. | |
| 29 | DENG Z H, CHOI K S, CHUNG F L, et al. Enhanced soft subspace clustering integrating within‑cluster and between‑cluster information[J]. Pattern Recognition, 2010, 43(3):767-781. 10.1016/j.patcog.2009.09.010 |
| 30 | HUANG X H, YE Y M, ZHANG H J. Extensions of kmeans‑type algorithms: a new clustering framework by integrating intra‑cluster compactness and inter‑cluster separation[J]. IEEE Transactions on Neural Networks and Learning Systems, 2014, 25(8): 1433-1446. 10.1109/tnnls.2013.2293795 |
| 31 | PANG N, ZHAO X, WANG W, et al. Few‑shot text classification by leveraging bi‑directional attention and cross‑class knowledge[J]. Science China Information Sciences, 2021, 64(3): No.130103. 10.1007/s11432-020-3055-1 |
| 32 | 黄学雨,程世超. KNN优化的密度峰值聚类算法[J]. 通信技术, 2021, 54(7):1608-1618. 10.3969/j.issn.1002-0802.2021.07.010 |
| HUANG X Y, CHENG S C. KNN optimized density peak clustering algorithm[J]. Communication Technology, 2021, 54(7): 1608-1618. 10.3969/j.issn.1002-0802.2021.07.010 | |
| 33 | 王芙银,张德生,张晓. 结合鲸鱼优化算法的自适应密度峰值聚类算法[J]. 计算机工程与应用, 2021, 57(3):94-102. 10.3778/j.issn.1002-8331.2007-0205 |
| WANG F Y, ZHANG D S, ZHANG X. Adaptive density peak clustering algorithm combining with whale optimization algorithm[J]. Computer Engineering and Applications, 2021, 57(3): 94-102. 10.3778/j.issn.1002-8331.2007-0205 |
| [1] | 邱云志, 汪廷华, 戴小路. 双重特征加权模糊支持向量机[J]. 《计算机应用》唯一官方网站, 2022, 42(3): 683-687. |
| [2] | 祁祥洲, 邢红杰. 基于中心核对齐的多核单类支持向量机[J]. 《计算机应用》唯一官方网站, 2022, 42(2): 349-356. |
| [3] | 王梅, 许传海, 刘勇. 基于神经正切核的多核学习方法[J]. 《计算机应用》唯一官方网站, 2021, 41(12): 3462-3467. |
| [4] | 効琦, 尹增山, 高爽. 基于检测与跟踪相互迭代的极暗弱目标搜索算法[J]. 计算机应用, 2021, 41(10): 3017-3024. |
| [5] | 陈浩, 秦志光, 丁熠. 基于同一特征空间的多模态脑肿瘤分割方法[J]. 计算机应用, 2020, 40(7): 2104-2109. |
| [6] | 李飞, 杜亮, 任超宏. 基于全局融合的多核概念分解算法[J]. 计算机应用, 2019, 39(4): 1021-1026. |
| [7] | 孙石磊, 王超, 赵元棣. 基于轮廓系数的参数无关空中交通轨迹聚类方法[J]. 计算机应用, 2019, 39(11): 3293-3297. |
| [8] | 范君, 王新, 徐慧. 粒子群优化混合核极限学习机的构造煤厚度预测方法[J]. 计算机应用, 2018, 38(6): 1820-1825. |
| [9] | 忽丽莎, 王素贞, 陈益强, 胡春雨, 蒋鑫龙, 陈振宇, 高兴宇. 基于目标均衡度量的核增量学习跌倒检测方法[J]. 计算机应用, 2018, 38(4): 928-934. |
| [10] | 张乐园, 李佳烨, 李鹏清. 低秩约束的非线性属性选择算法[J]. 计算机应用, 2018, 38(12): 3444-3449. |
| [11] | 南敬昌, 崔洪艳. 基于新型二维核函数动态X参数的功放建模[J]. 计算机应用, 2017, 37(8): 2421-2426. |
| [12] | 李斌, 狄岚, 王少华, 于晓瞳. 基于改进核模糊C均值类间极大化聚类算法[J]. 计算机应用, 2016, 36(7): 1981-1987. |
| [13] | 郭倩, 杨红菊, 梁新彦. 基于新的空间关系特征的图像检索方法[J]. 计算机应用, 2016, 36(7): 1918-1922. |
| [14] | 李华, 李德玉, 王素格, 张晶. 多标记数据特征提取方法的核改进[J]. 计算机应用, 2015, 35(7): 1939-1944. |
| [15] | 王伟东, 刘兵, 管红杰, 周勇, 夏士雄. 基于核函数的谱嵌入聚类算法[J]. 计算机应用, 2015, 35(3): 761-765. |
| 阅读次数 | ||||||
|
全文 |
|
|||||
|
摘要 |
|
|||||