计算机应用 ›› 2013, Vol. 33 ›› Issue (01): 202-206.DOI: 10.3724/SP.J.1087.2013.00202

• 人工智能 • 上一篇    下一篇

基于关联函数的数据流聚类算法

潘丽娜,王治和,党辉   

  1. 西北师范大学 计算机科学与工程学院, 兰州 730070
  • 收稿日期:2012-07-24 修回日期:2012-08-27 出版日期:2013-01-01 发布日期:2013-01-09
  • 通讯作者: 潘丽娜
  • 作者简介:潘丽娜(1984-),女,湖北武汉人,硕士研究生,主要研究方向:数据挖掘;王治和(1965-),男,甘肃武威人,教授,主要研究方向:数据库技术、数据挖掘;党辉(1988-),女,甘肃刘家峡人,硕士研究生,主要研究方向:数据挖掘。
  • 基金资助:

    甘肃省科技支援计划项目(090GKCA075);2012年度教育部人文社会科学研究项目(12YJCZH282)

Data stream clustering algorithm based on dependent function

PAN Lina,WANG Zhihe,DANG Hui   

  1. School of Computer Science and Engineering, Northwest Normal University, Lanzhou Gansu 730070, China
  • Received:2012-07-24 Revised:2012-08-27 Online:2013-01-01 Published:2013-01-09
  • Contact: PAN Lina

摘要: 传统数据流聚类算法大多基于距离或密度,聚类质量和处理效率都不高。针对以上问题,提出了一种基于关联函数的数据流聚类算法。首先,将数据点以物元的形式模型化,建立解决问题所需要的关联函数;其次,计算关联函数的值,以此值的大小来判断数据点属于某簇的程度;然后,将所提方法运用到数据流聚类的在线-离线框架中;最后,采用真实数据集KDD-CUP99和随机生成的人工数据集进行算法的测试。实验结果表明,所提方法的聚类纯度在92%以上,每秒能处理约6300条记录,与传统算法相比,处理效率有了较大的提高,在维度和簇数目方面的可扩展性较强,适用于处理大规模的动态数据集。

关键词: 数据流, 聚类, 物元, 关联函数, 经典域, 节域

Abstract: The traditional data stream clustering algorithms are mostly based on distance or density, so their clustering quality and processing efficiency are weak. To address the above problems, this paper proposed a data stream clustering algorithm based on dependent function. Firstly, the data points were modeled in the form of matter-element and dependent function was established to solve the problem. Secondly, the value of the dependent function was calculated. According to this value, the degree that data point belongs to a certain cluster was judged. Then, the proposed method was applied to online-offline framework of the data stream clustering. Finally, the proposed algorithm was tested by using the real data set KDD-CUP99 and randomly generated artificial data sets. The experimental results show that clustering purity of the proposed method is over 92%, and it can deal with about 6300 records per second. Compared with the traditional algorithm, the processing efficiency of the algorithm is greatly improved. In the aspects of dimension and the number of cluster, the algorithm shows stronger scalability, and it is suitable for processing large dynamic data set.

Key words: data stream, clustering, matter-element, dependent function, classical domain, joint domain

中图分类号: