Journal of Computer Applications ›› 2012, Vol. 32 ›› Issue (08): 2193-2197.DOI: 10.3724/SP.J.1087.2012.02193

• Database technology • Previous Articles     Next Articles

Outlier mining algorithm based on data-partitioning and grid

TANG Cheng-long,XING Chang-zheng   

  1. College of Electronics and Information Engineering, Liaoning Technical University, Huludao Liaoning 125105, China
  • Received:2012-02-06 Revised:2012-03-18 Online:2012-08-28 Published:2012-08-01
  • Contact: TANG Cheng-long

基于数据分区和网格的离群点挖掘算法

唐成龙,邢长征   

  1. 辽宁工程技术大学 电子与信息工程学院,辽宁 葫芦岛 125105
  • 通讯作者: 唐成龙
  • 作者简介:唐成龙(1985-),男,山东临沂人,硕士研究生,主要研究方向:数据挖掘、数据流聚类;
    邢长征(1967-),男,辽宁阜新人,教授,博士生导师,主要研究方向:数据库、数据挖掘、数据流聚类。

Abstract: To solve the problems of inefficiency and bad-adaptability for the existing outlier mining algorithms based on grid, this paper proposed an outlier mining algorithm based on data partitioning and grid. Firstly, the technology of data partitioning was applied. Secondly, the non-outliers were filtered out by cell and the intermediate results were temporarily stored. Thirdly, the structure of the improved Cell Dimension Tree (CD-Tree) was created to maintain the spatial information of the reserved data. Afterwards, the non-outliers were filtered out by micro-cell and were operated efficiently through two optimization strategies. Finally, followed by mining by data point, the outlier set was obtained. The theoretical analysis and experimental results show that the method is feasible and effective, and has better scalability for dealing with massive and high dimensional data.

Key words: data mining, outlier data, grid, data partitioning, cell, micro-cell, Cell Dimension Tree (CD-Tree)

摘要: 针对已有的基于网格的离群点挖掘算法挖掘效率低和对于大数据集适应性差的问题,提出基于数据分区和网格的离群点挖掘算法。算法首先将数据进行分区,以单元为单位筛选非离群点,并把中间结果暂存起来;然后采用改进的维单元树结构维护数据点的空间信息,以微单元为单位进行非离群点筛选,并通过两个优化策略进行高效操作;最后以数据点为单位挖掘离群点,从而得到离群数据集合。理论分析和实验结果表明了该方法是有效可行的,对大数据集和高维数据具有更好的伸缩性。

关键词: 数据挖掘, 离群数据, 网格, 数据分区, 单元, 微单元, 维单元树

CLC Number: