Journal of Computer Applications ›› 2022, Vol. 42 ›› Issue (11): 3330-3336.DOI: 10.11772/j.issn.1001-9081.2021111961

• CCF Bigdata 2021 • Previous Articles     Next Articles

Neural tangent kernel K‑Means clustering

Mei WANG1,2, Xiaohui SONG1, Yong LIU3,4(), Chuanhai XU1   

  1. 1.School of Computer and Information Technology,Northeast Petroleum University,Daqing Heilongjiang 163318,China
    2.Heilongjiang Key Laboratory of Petroleum Big Data and Intelligent Analysis (Northeast Petroleum University),Daqing Heilongjiang 163318,China
    3.Gaoling School of Artificial Intelligence,Renmin University of China,Beijing 100872,China
    4.Beijing Key Laboratory of Big Data Management and Analysis Method (Renmin University of China),Beijing 100872,China
  • Received:2021-11-17 Revised:2021-12-13 Accepted:2021-12-23 Online:2022-01-04 Published:2022-11-10
  • Contact: Yong LIU
  • About author:WANG Mei, born in 1976, Ph. D., professor. Her research interests include machine learning, kernel methods, model selection.
    SONG Xiaohui, born in 1998, M. S. candidate. Her research interests include deep kernel learning.
    XU Chuanhai, born in 1998, M. S. candidate. His research interests include deep kernel learning.
    First author contact:LIU Yong, born in 1986, Ph. D., research associate. His research interests include large‑scale machine learning, automatic machine learning, statistical machine learning theory.
  • Supported by:
    National Natural Science Foundation of China(51774090);Postdoctoral Research Startup Fund of Heilongjiang Province(LBH?Q20080);Natural Science Foundation of Heilongjiang Province(LH2020F003);Higher Education Teaching Reform Key Entrusted Project of Heilongjiang Province(SJGZ20190011)

神经正切核K‑Means聚类

王梅1,2, 宋晓晖1, 刘勇3,4(), 许传海1   

  1. 1.东北石油大学 计算机与信息技术学院, 黑龙江 大庆 163318
    2.黑龙江省石油大数据与智能分析重点实验室(东北石油大学), 黑龙江 大庆 163318
    3.中国人民大学 高瓴人工智能学院, 北京 100872
    4.大数据管理与分析方法研究北京市重点实验室(中国人民大学), 北京 100872
  • 通讯作者: 刘勇
  • 作者简介:王梅(1976—),女,河北保定人,教授,博士,CCF会员,主要研究方向:机器学习、核方法、模型选择
    宋晓晖(1998—),女,山东济南人,硕士研究生,CCF会员,主要研究方向:深度核学习
    刘勇(1986—),男,湖南益阳人,副研究员,博士,CCF会员,主要研究方向:大规模机器学习、自动机器学习、统计机器学习理论 liuyonggsai@ruc.edu.cn
    许传海(1998—),男,黑龙江鸡西人,硕士研究生,CCF会员,主要研究方向:深度核学习。
  • 基金资助:
    国家自然科学基金资助项目(51774090);黑龙江省博士后科研启动金资助项目(LBH?Q20080);黑龙江省自然科学基金资助项目(LH2020F003);黑龙江省高等教育教学改革重点委托项目(SJGZ20190011)

Abstract:

Aiming at the problem that the clustering results of K-Means clustering algorithm are affected by the sample distribution because of using the mean to update the cluster centers, a Neural Tangent Kernel K-Means (NTKKM) clustering algorithm was proposed. Firstly, the data of the input space were mapped to the high-dimensional feature space through the Neural Tangent Kernel (NTK), then the K-Means clustering was performed in the high-dimensional feature space, and the cluster centers were updated by taking into account the distance between clusters and within clusters at the same time. Finally, the clustering results were obtained. On the car and breast-tissue datasets, three evaluation indexes including accuracy, Adjusted Rand Index (ARI) and FM index of NTKKM clustering algorithm and comparison algorithms were counted. Experimental results show that the effect of clustering and the stability of NTKKM clustering algorithm are better than those of K-Means clustering algorithm and Gaussian kernel K?Means clustering algorithm. Compared with the traditional K?Means clustering algorithm, NTKKM clustering algorithm has the accuracy increased by 14.9% and 9.4% respectively, the ARI increased by 9.7% and 18.0% respectively, and the FM index increased by 12.0% and 12.0% respectively, indicating the excellent clustering performance of NTKKM clustering algorithm.

Key words: Neural Tangent Kernel (NTK), K?Means, kernel clustering, feature space, kernel function

摘要:

针对K-Means聚类算法利用均值更新聚类中心,导致聚类结果受样本分布影响的问题,提出了神经正切核K-Means聚类算法(NTKKM)。首先通过神经正切核(NTK)将输入空间的数据映射到高维特征空间,然后在高维特征空间中进行K-Means聚类,并采用兼顾簇间与簇内距离的方法更新聚类中心,最后得到聚类结果。在car和breast-tissue数据集上,对NTKKM聚类算法的准确率、调整兰德系数(ARI)及FM指数这3个评价指标进行统计。实验结果表明,NTKKM聚类算法的聚类效果以及稳定性均优于K?Means聚类算法和高斯核K-Means聚类算法。NTKKM聚类算法与传统的K-Means聚类算法相比,准确率分别提升了14.9%和9.4%,ARI分别提升了9.7%和18.0%,FM指数分别提升了12.0%和12.0%,验证了NTKKM聚类算法良好的聚类性能。

关键词: 神经正切核, K?Means, 核聚类, 特征空间, 核函数

CLC Number: