计算机应用 ›› 2013, Vol. 33 ›› Issue (11): 3057-3061.

• 数据库技术 • 上一篇    下一篇

加速大规模数据集的离群点检测

薛安荣,闻丹丹,刘彬   

  1. 江苏大学 计算机科学与通信工程学院,江苏 镇江 212013
  • 收稿日期:2013-05-29 修回日期:2013-07-19 出版日期:2013-11-01 发布日期:2013-12-04
  • 通讯作者: 闻丹丹
  • 作者简介:薛安荣(1964-),男,江苏镇江人,教授,博士,CCF高级会员,主要研究方向:数据挖掘、机器学习、数据库;闻丹丹(1986-),女,河南商丘人,硕士研究生,主要研究方向:数据挖掘、离群点检测;刘彬(1987-),女,河南洛阳人,硕士研究生,主要研究方向:数据挖掘、隐私保护。

Speeding up outlier detection in large-scale datasets

XUE Anrong,WEN Dandan,LIU Bin   

  1. School of Computer Science and Communications Engineering, Jiangsu University, Zhenjiang Jiangsu 212013, China
  • Received:2013-05-29 Revised:2013-07-19 Online:2013-12-04 Published:2013-11-01
  • Contact: WEN Dandan

摘要: 针对现有基于距离的离群点检测算法在处理大规模数据时效率低的问题,提出一种基于聚类和索引的分布式离群点检测(DODCI) 算法。首先利用聚类方法将大数据集划分成簇;然后在分布式环境中的各节点处并行创建各个簇的索引;最后使用两个优化策略和两条剪枝规则以循环的方式在各节点处进行离群点检测。在合成数据集和整理后的KDD CUP数据集上的实验结果显示,在数据量较大时该算法比Orca和iDOoR算法快近一个数量级。理论和实验分析表明,该算法可以有效提高大规模数据中离群点的检测效率。

关键词: 离群点, 聚类, 索引, 分布式, 优化策略, 剪枝规则

Abstract: The existing distance-based outlier detection algorithms suffer from low efficiency when dealing with large-scale datasets. To relieve this problem, a distributed outlier detection algorithm based on clustering and indexing (DODCI) was presented. The algorithm partitioned the original dataset into clusters by employing a certain clustering method. Then the index of each cluster was built in parallel on each distributed node. Afterwards, detection of outliers was implemented on each node looply using two optimization strategies and two pruning rules. The experimental results on synthetic dataset and preprocessed KDD CUP datasets show that the proposed algorithm is almost up to an order-of-magnitude faster than the two existing algorithms (Orca and iDOoR) when the dataset is large enough. The theoretical and experimental analyses show that the proposed algorithm can effectively raise the speed of outlier detection in large-scale datasets.

Key words: outlier, clustering, index, distributed, optimization strategy, pruning rule

中图分类号: