Journal of Computer Applications ›› 2020, Vol. 40 ›› Issue (8): 2248-2254.DOI: 10.11772/j.issn.1001-9081.2020010057

• Data science and technology • Previous Articles     Next Articles

Clustering tendency analysis algorithm based on data stream

FAN Zhongxin   

  1. National Experimental Teaching Demonstration Center for Atmospheric Science and Environmental Meteorology(Nanjing University of Information Science and Technology), Nanjing Jiangsu 210044, China
  • Received:2020-01-19 Revised:2020-03-17 Online:2020-08-10 Published:2020-03-24
  • Supported by:
    This work is partially supported by the National Key Research and Development Program of China (2018YFC1505804).

基于数据流的聚类趋势分析算法

樊仲欣   

  1. 大气科学与环境气象国家级实验教学示范中心(南京信息工程大学), 江苏 南京 210044
  • 通讯作者: 樊仲欣(1981-),男,江苏南京人,工程师,硕士,主要研究方向:数据挖掘、神经网络。fan_zhong_xin@qq.com
  • 基金资助:
    国家重点研发计划项目(2018YFC1505804)。

Abstract: Focusing on the issues that clustering tendency analysis algorithms based on sampling have instability and one-sidedness in clustering tendecy index, and clustering tendency parameters need to be computed repeatedly because the algorithms do not suit the batch incremental property of data stream, an improved Clustering Tendency Index analysis algorithm based on Minimum Distance Connected Graph (MDCG) was proposed, namely MDCG-CTI, which performs overall analysis on all data. First, MDCG was built with complexity optimization by using stack depth-first traversal to update the nearest path of incremental data; then clustering tendency index was computed to determine the judgment threshold of clustering; finally, the proposed algorithm was integrated with batch incremental Density-Based Spatial Clustering of Applications with Noise (DBSCAN). Experimental results on self-built datasets show that the proposed algorithm has higher accuracy of clusterable determination than existing algorithms for single cluster and data with a large number of noises. And on large datasets pendigits and avila, the proposed algorithm has the time consumption reduced by 38% and 42% compared to Spectral Visual Assessment of cluster Tendency (SpecVAT); meanwhile, the proposed algorithm combined with batch incremental DBSCAN has average accuracy of clustering increased by 6% and 11% and time consumption of clustering reduced by 7% and 8% compared to SpecVAT combined with batch incremental DBSCAN. It can be seen that the proposed algorithm not only determines clustering tendency nonparametrically and accurately, but also improves effectiveness and operational efficiency of incremental clustering.

Key words: clustering tendency, Minimum Distance Connected Graph (MDCG), data stream clustering, batch incremental clustering, Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

摘要: 聚类趋势分析算法基于抽样原理导致聚类趋势指标不稳定和片面,而且不适应数据流的批量增量特性,因而需要重复进行聚类趋势指数计算。为此,基于全体数据进行整体分析,提出一种基于最小距离连通图(MDCG)的聚类趋势分析算法MDCG-CTI。首先,利用栈的深度优先遍历法更新增量数据的最邻近路径从而降低MDCG的建立复杂度;然后,计算聚类趋势指数并确定可聚类性的判定阈值;最后,将所提算法和批量增量的具有噪声的基于密度的聚类方法(DBSCAN)相结合。在自定义数据集上的实验表明,该算法比现有算法对单簇和含大量噪点的数据的可聚类性判断更为精确;而在大数据集pendigits和avila上,所提算法比基于谱方法的聚类趋势可视化分析(SpecVAT)累计耗时降低了38%和42%,且相较SpecVAT结合批量增量DBSCAN,该算法结合批量增量DBSCAN的聚类平均准确率分别提高了6%和11%,聚类累计耗时则分别降低了7%和8%。实验结果表明该算法可以准确无参地判断聚类趋势,并明显提高增量聚类的有效性和运行效率。

关键词: 聚类趋势, 最小距离连通图, 数据流聚类, 批量增量聚类, 具有噪声的基于密度的聚类方法

CLC Number: