基于数据分区和网格的离群点挖掘算法

doi:10.3724/SP.J.1087.2012.02193

计算机应用 ›› 2012, Vol. 32 ›› Issue (08): 2193-2197.DOI: 10.3724/SP.J.1087.2012.02193

基于数据分区和网格的离群点挖掘算法

唐成龙,邢长征

辽宁工程技术大学电子与信息工程学院，辽宁葫芦岛 125105

收稿日期:2012-02-06 修回日期:2012-03-18 发布日期:2012-08-28 出版日期:2012-08-01
通讯作者: 唐成龙
作者简介:唐成龙(1985-)，男，山东临沂人，硕士研究生，主要研究方向：数据挖掘、数据流聚类;
邢长征(1967-)，男，辽宁阜新人，教授，博士生导师，主要研究方向：数据库、数据挖掘、数据流聚类。

Outlier mining algorithm based on data-partitioning and grid

TANG Cheng-long,XING Chang-zheng

College of Electronics and Information Engineering, Liaoning Technical University, Huludao Liaoning 125105, China

Received:2012-02-06 Revised:2012-03-18 Online:2012-08-28 Published:2012-08-01
Contact: TANG Cheng-long

摘要/Abstract

摘要： 针对已有的基于网格的离群点挖掘算法挖掘效率低和对于大数据集适应性差的问题，提出基于数据分区和网格的离群点挖掘算法。算法首先将数据进行分区，以单元为单位筛选非离群点，并把中间结果暂存起来;然后采用改进的维单元树结构维护数据点的空间信息，以微单元为单位进行非离群点筛选，并通过两个优化策略进行高效操作;最后以数据点为单位挖掘离群点，从而得到离群数据集合。理论分析和实验结果表明了该方法是有效可行的，对大数据集和高维数据具有更好的伸缩性。

关键词: 数据挖掘, 离群数据, 网格, 数据分区, 单元, 微单元, 维单元树

Abstract: To solve the problems of inefficiency and bad-adaptability for the existing outlier mining algorithms based on grid, this paper proposed an outlier mining algorithm based on data partitioning and grid. Firstly, the technology of data partitioning was applied. Secondly, the non-outliers were filtered out by cell and the intermediate results were temporarily stored. Thirdly, the structure of the improved Cell Dimension Tree (CD-Tree) was created to maintain the spatial information of the reserved data. Afterwards, the non-outliers were filtered out by micro-cell and were operated efficiently through two optimization strategies. Finally, followed by mining by data point, the outlier set was obtained. The theoretical analysis and experimental results show that the method is feasible and effective, and has better scalability for dealing with massive and high dimensional data.

Key words: data mining, outlier data, grid, data partitioning, cell, micro-cell, Cell Dimension Tree (CD-Tree)

中图分类号:

TP311.13

唐成龙邢长征. 基于数据分区和网格的离群点挖掘算法[J]. 计算机应用, 2012, 32(08): 2193-2197.

TANG Cheng-long XING Chang-zheng. Outlier mining algorithm based on data-partitioning and grid[J]. Journal of Computer Applications, 2012, 32(08): 2193-2197.

参考文献

[1]薛安荣，鞠时光，何伟华，等.局部离群点挖掘算法研究[J].计算机学报，2007，30(8):1455-1463. [2]HAN J，KAMBER M. Data mining: concepts and techniques [M].2nd ed. San Francisco: Morgan Kaufmann, 2006: 451-459. [3]薛安荣，姚林，鞠时光，等.离群点挖掘方法综述[J].计算机科学，2008，35(11):13-18. [4]CHEN Z，TANG J，FU A W-C. Modeling and efficient mining of intentional knowledge of outliers[C]// IDEAS'03: Proceedings of the 7th International Database Engineering and Applications Symposium. Washington, DC: IEEE Computer Society，2003:44-53. [5]于浩，王斌，肖刚，等.基于距离的不确定离群点检测[J].计算机研究与发展，2010，47(3):474-484. [6]江峰，杜军威，眭跃飞，等.基于边界和距离的离群点检测[J].电子学报，2010，38(3):700-704. [7]赵科平，周水庚，关佶红，等.一种新的离群数据对象发现方法 [C]// 中国人工智能学会第10届全国学术年会.北京:北京邮电大学出版社，2003:470-475. [8]李存华，孙志挥，陈耿.基于网格上近似的大规模数据集离群点检测算法GROUT[J].计算机应用研究，2003，20(9):34-136. [9]徐翔，刘建伟，罗雄麟.离群点挖掘研究[J].计算机应用研究，2009，26(1):34-40. [10]崔贯勋，李梁，王勇，等.快速的基于单元格的离群数据挖掘算法[J].JOCA，2009,29(12):3000-3302. [11]SUN HUANLIAN，BAO YUBIN，ZHAO FAXIN, et al. CD-Trees:an efficient index structure for outlier detection [C]// Proceedings of the 5th International Conference on Web-Age Information Management, LNCS 3129. Berlin: Springer-Verlag, 2004:600-609. [12]周水庚，周傲英，曹晶.基于数据分区的DBSCAN算法[J].计算机研究与发展，2000，37(10):1153-1159. [13]张净，孙志辉，杨明，等.基于网格和密度的海量数据增量式离群点挖掘算法[J].计算机研究与发展， 2011，48(5):823-826. [14]黄添强，秦小麟，叶飞跃.基于方形邻域的离群点查找新方法[J].控制与决策，2006，21(5):541-545. [15]AN YOU, BIAN FULING. Cell-based outlier detection algorithm:a fast outlier detection algorithm for large datasets[C]// PAKDD'08: Proceedings of the 12th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. Berlin: Springer-Verlag, 2008:1042-1048. [16]ESTER M，KRIEGEL H-P，SANDER J, et al. A density-based algorithm for discovering clusters in large spatial databases with noise [C]// KDD'96: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. Menlo Park: AAAI Press, 1996: 226-231.

基于数据分区和网格的离群点挖掘算法

Outlier mining algorithm based on data-partitioning and grid

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	刘子辰, 李小娟, 韦伟. 基于循环神经网络的专利价格自动评估[J]. 计算机应用, 2021, 41(9): 2532-2538.
[2]	曾祥银, 郑伯川, 刘丹. 基于深度卷积神经网络和聚类的左右轨道线检测[J]. 计算机应用, 2021, 41(8): 2324-2329.
[3]	刘世泽, 秦艳君, 王晨星, 苏琳, 柯其学, 罗海勇, 孙艺, 王宝会. 基于深度残差长短记忆网络交通流量预测算法[J]. 计算机应用, 2021, 41(6): 1566-1572.
[4]	张全龙, 王怀彬. 基于膨胀卷积和门控循环单元组合的入侵检测模型[J]. 计算机应用, 2021, 41(5): 1372-1377.
[5]	赖雪梅, 唐宏, 陈虹羽, 李珊珊. 基于注意力机制的特征融合-双向门控循环单元多模态情感分析[J]. 计算机应用, 2021, 41(5): 1268-1274.
[6]	李旭娟, 皮建勇, 黄飞翔, 贾海朋. 基于自生成深度神经网络的4D航迹预测[J]. 计算机应用, 2021, 41(5): 1492-1499.
[7]	龙超奇, 蒋瑜, 谢雨. 基于峰值网格改进的小波聚类算法[J]. 计算机应用, 2021, 41(4): 1122-1127.
[8]	陈朗, 王让定, 严迪群, 林昱臻. 融合残差网络和极限梯度提升的音频隐写检测模型[J]. 计算机应用, 2021, 41(2): 449-455.
[9]	温超东, 曾诚, 任俊伟, 张. 结合ALBERT和双向门控循环单元的专利文本分类[J]. 计算机应用, 2021, 41(2): 407-412.
[10]	李莉, 杨鸿飞, 董秀则. 基于身份多条件代理重加密的文件分级访问控制方案[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3251-3256.
[11]	陈凯, 于彦伟, 赵金东, 宋鹏. 基于城市交通监控大数据的工作位置推理方法[J]. 计算机应用, 2021, 41(1): 177-184.
[12]	孟鑫禹, 王睿涵, 张喜平, 王明杰, 丘刚, 王政霞. 基于经验模态分解与多分支神经网络的超短期风功率预测[J]. 计算机应用, 2021, 41(1): 237-242.
[13]	孙敏, 李旸, 庄正飞, 余大为. 基于并行混合网络融入注意力机制的情感分析[J]. 计算机应用, 2020, 40(9): 2543-2548.
[14]	杨云龙, 孙建强, 宋国超. 基于门控循环单元和胶囊特征的文本情感分析[J]. 计算机应用, 2020, 40(9): 2531-2535.
[15]	陈修凯, 陆志华, 周宇. 基于卷积编解码器和门控循环单元的语音分离算法[J]. 计算机应用, 2020, 40(7): 2137-2141.