Abstract:To solve the problems of inefficiency and bad-adaptability for the existing outlier mining algorithms based on grid, this paper proposed an outlier mining algorithm based on data partitioning and grid. Firstly, the technology of data partitioning was applied. Secondly, the non-outliers were filtered out by cell and the intermediate results were temporarily stored. Thirdly, the structure of the improved Cell Dimension Tree (CD-Tree) was created to maintain the spatial information of the reserved data. Afterwards, the non-outliers were filtered out by micro-cell and were operated efficiently through two optimization strategies. Finally, followed by mining by data point, the outlier set was obtained. The theoretical analysis and experimental results show that the method is feasible and effective, and has better scalability for dealing with massive and high dimensional data.
CHEN Z,TANG J,FU A W-C. Modeling and efficient mining of intentional knowledge of outliers[C]// IDEAS'03: Proceedings of the 7th International Database Engineering and Applications Symposium. Washington, DC: IEEE Computer Society,2003:44-53.
SUN HUANLIAN,BAO YUBIN,ZHAO FAXIN, et al. CD-Trees:an efficient index structure for outlier detection [C]// Proceedings of the 5th International Conference on Web-Age Information Management, LNCS 3129. Berlin: Springer-Verlag, 2004:600-609.
AN YOU, BIAN FULING. Cell-based outlier detection algorithm:a fast outlier detection algorithm for large datasets[C]// PAKDD'08: Proceedings of the 12th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. Berlin: Springer-Verlag, 2008:1042-1048.
[16]
ESTER M,KRIEGEL H-P,SANDER J, et al. A density-based algorithm for discovering clusters in large spatial databases with noise [C]// KDD'96: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. Menlo Park: AAAI Press, 1996: 226-231.