Journal of Computer Applications ›› 2023, Vol. 43 ›› Issue (2): 398-406.DOI: 10.11772/j.issn.1001-9081.2022010082

• Data science and technology • Previous Articles    

Parameter calculation algorithm of structural graph clustering driven by instance clusters

Chuanyu ZONG, Chao XIAN(), Xiufeng XIA   

  1. School of Computer Science,Shenyang Aerospace University,Shenyang Liaoning 110136,China
  • Received:2022-01-21 Revised:2022-04-06 Accepted:2022-04-11 Online:2022-04-25 Published:2023-02-10
  • Contact: Chao XIAN
  • About author:ZONG Chuanyu, born in 1985, Ph. D., associate professor. His research interests include data cleaning, data provenance, query processing and optimization.
    XIA Xiufeng, born in 1964, Ph. D., professor. His research interests include database, data warehouse.
  • Supported by:
    National Natural Science Foundation of China(61802268)

实例簇驱动的图结构聚类参数计算算法

宗传玉, 宪超(), 夏秀峰   

  1. 沈阳航空航天大学 计算机学院,沈阳 110136
  • 通讯作者: 宪超
  • 作者简介:宗传玉(1985—),男,山东潍坊人,副教授,博士,CCF会员,主要研究方向:数据清洗、数据溯源、查询处理与优化
    夏秀峰(1964—),男,山东胶南人,教授,博士,CCF会员,主要研究方向:数据库、数据仓库。
  • 基金资助:
    国家自然科学基金资助项目(61802268)

Abstract:

Clustering results of the pSCAN (pruned Structural Clustering Algorithm for Network) algorithm are influenced by the density constraint parameter and the similarity threshold parameter. If the requirements cannot be satisfied by the clustering results obtained by the clustering parameters provided by the user, then the user’s own clustering requirements can be expressed through instance clusters. Aiming at the problem of instance clusters expressing clustering query requirements, an instance cluster-driven structural graph clustering parameter calculation algorithm PART and its improved algorithm ImPART were proposed. Firstly, the influences of two clustering parameters on the clustering results were analyzed, and correlation subgraph of instance cluster was extracted. Secondly, the feasible interval of the density constraint parameter was obtained by analyzing the correlation subgraph, and the nodes in the instance cluster were divided into core nodes and non-core nodes according to the current density constraint parameter and the structural similarity between nodes. Finally, according to the node division result, the optimal similarity threshold parameter corresponding to the current density constraint parameter was calculated, and the obtained parameters were verified and optimized on the relevant subgraph until the clustering parameters that satisfy the requirements of the instance cluster were obtained. Experimental results on real datasets show that a set of effective parameters can be returned for user instance clusters by using the proposed algorithm, and the proposed improved algorithm ImPART is more than 20% faster than the basic algorithm PART, and can return the optimal clustering parameters that satisfy the requirements of instance clusters quickly and effectively for the user.

Key words: structural graph clustering, instance cluster, parameter calculation, node division, optimal parameter

摘要:

pSCAN算法的聚类结果受密度约束参数和相似度阈值参数的影响,如果用户提供的聚类参数得到的聚类结果无法满足需求,那么用户可以通过实例簇表达自己的聚类需求。针对实例簇表达聚类查询需求的问题,提出一种实例簇驱动的图结构聚类参数计算算法PART及其改进算法ImPART。首先,分析两个聚类参数对聚类结果的影响,并提取实例簇的相关子图;其次,对相关子图进行分析得到密度约束参数的可行区间,并根据当前密度约束参数和节点之间的结构相似度将实例簇内节点划分为核心节点和非核心节点;最后,依据节点划分结果计算出当前密度约束参数对应的最优相似度阈值参数,并在相关子图上对得到的参数进行验证和优化,直到得到满足实例簇需求的聚类参数。在真实数据集上的实验结果表明,所提算法能够为用户实例簇返回一组有效参数,且所提改进算法ImPART的运行时间比PART缩短了20%以上,能够快速有效地为用户返回满足实例簇要求的最优聚类参数。

关键词: 图结构聚类, 实例簇, 参数计算, 节点划分, 最优参数

CLC Number: