加速大规模数据集的离群点检测

计算机应用 ›› 2013, Vol. 33 ›› Issue (11): 3057-3061.

加速大规模数据集的离群点检测

薛安荣,闻丹丹,刘彬

江苏大学计算机科学与通信工程学院，江苏镇江 212013

收稿日期:2013-05-29 修回日期:2013-07-19 出版日期:2013-11-01 发布日期:2013-12-04
通讯作者: 闻丹丹
作者简介:薛安荣(1964-)，男，江苏镇江人，教授，博士，CCF高级会员，主要研究方向：数据挖掘、机器学习、数据库；闻丹丹(1986-)，女，河南商丘人，硕士研究生，主要研究方向：数据挖掘、离群点检测；刘彬（1987-），女，河南洛阳人，硕士研究生，主要研究方向：数据挖掘、隐私保护。

Speeding up outlier detection in large-scale datasets

XUE Anrong,WEN Dandan,LIU Bin

School of Computer Science and Communications Engineering, Jiangsu University, Zhenjiang Jiangsu 212013, China

Received:2013-05-29 Revised:2013-07-19 Online:2013-12-04 Published:2013-11-01
Contact: WEN Dandan

摘要/Abstract

摘要： 针对现有基于距离的离群点检测算法在处理大规模数据时效率低的问题，提出一种基于聚类和索引的分布式离群点检测(DODCI) 算法。首先利用聚类方法将大数据集划分成簇；然后在分布式环境中的各节点处并行创建各个簇的索引；最后使用两个优化策略和两条剪枝规则以循环的方式在各节点处进行离群点检测。在合成数据集和整理后的KDD CUP数据集上的实验结果显示，在数据量较大时该算法比Orca和iDOoR算法快近一个数量级。理论和实验分析表明，该算法可以有效提高大规模数据中离群点的检测效率。

关键词: 离群点, 聚类, 索引, 分布式, 优化策略, 剪枝规则

Abstract: The existing distance-based outlier detection algorithms suffer from low efficiency when dealing with large-scale datasets. To relieve this problem, a distributed outlier detection algorithm based on clustering and indexing (DODCI) was presented. The algorithm partitioned the original dataset into clusters by employing a certain clustering method. Then the index of each cluster was built in parallel on each distributed node. Afterwards, detection of outliers was implemented on each node looply using two optimization strategies and two pruning rules. The experimental results on synthetic dataset and preprocessed KDD CUP datasets show that the proposed algorithm is almost up to an order-of-magnitude faster than the two existing algorithms (Orca and iDOoR) when the dataset is large enough. The theoretical and experimental analyses show that the proposed algorithm can effectively raise the speed of outlier detection in large-scale datasets.

Key words: outlier, clustering, index, distributed, optimization strategy, pruning rule

中图分类号:

薛安荣闻丹丹刘彬. 加速大规模数据集的离群点检测[J]. 计算机应用, 2013, 33(11): 3057-3061.

XUE Anrong WEN Dandan LIU Bin. Speeding up outlier detection in large-scale datasets[J]. Journal of Computer Applications, 2013, 33(11): 3057-3061.

[1]	王周恺, 张炯, 马维纲, 王怀军. 面向高速列车监测数据的并行解压缩算法[J]. 计算机应用, 2021, 41(9): 2586-2593.
[2]	陈恒恒, 倪志伟, 朱旭辉, 金媛媛, 陈千. 基于聚类分析的差分隐私高维数据发布方法[J]. 计算机应用, 2021, 41(9): 2578-2585.
[3]	曾祥银, 郑伯川, 刘丹. 基于深度卷积神经网络和聚类的左右轨道线检测[J]. 计算机应用, 2021, 41(8): 2324-2329.
[4]	祝承, 赵晓琦, 赵丽萍, 焦玉宏, 朱亚飞, 陈建英, 周伟, 谭颖. 基于谱聚类半监督特征选择的功能磁共振成像数据分类[J]. 计算机应用, 2021, 41(8): 2288-2293.
[5]	赵全, 汤小春, 朱紫钰, 毛安琪, 李战怀. 大规模短时间任务的低延迟集群调度框架[J]. 计算机应用, 2021, 41(8): 2396-2405.
[6]	卿欣艺, 陈玉玲, 周正强, 涂园超, 李涛. 基于中国剩余定理的区块链存储扩展模型[J]. 计算机应用, 2021, 41(7): 1977-1982.
[7]	吴悦, 雒江涛, 刘锐, 胡钟尹. 基于感知哈希和切块的视频相似度检测方法[J]. 计算机应用, 2021, 41(7): 2070-2075.
[8]	尹春勇, 张帼杰. 面向分布式漂移数据流的集成分类模型[J]. 计算机应用, 2021, 41(7): 1947-1955.
[9]	戴嫣然, 戴国庆, 袁玉波. 基于肤色学习的多人脸前景抽取方法[J]. 计算机应用, 2021, 41(6): 1659-1666.
[10]	王家瑞, 谭国平, 周思源. 高速车联网场景下分簇式无线联邦学习算法[J]. 计算机应用, 2021, 41(6): 1546-1550.
[11]	马建红, 曹文斌, 刘元刚, 夏爽. 基于功效特征的专利聚类方法[J]. 计算机应用, 2021, 41(5): 1361-1366.
[12]	王治和, 常筱卿, 杜辉. 基于万有引力的自适应近邻传播聚类算法[J]. 计算机应用, 2021, 41(5): 1337-1342.
[13]	李国荣, 冶继民, 甄远婷. 基于新的鲁棒相似性度量的时间序列聚类[J]. 计算机应用, 2021, 41(5): 1343-1347.
[14]	龙超奇, 蒋瑜, 谢雨. 基于峰值网格改进的小波聚类算法[J]. 计算机应用, 2021, 41(4): 1122-1127.
[15]	李杏峰, 黄玉清, 任珍文, 李毅红. 基于自适应邻域的鲁棒多视图聚类算法[J]. 计算机应用, 2021, 41(4): 1093-1099.