基于邻域值差异度量的离群点检测算法

doi:10.11772/j.issn.1001-9081.2017123028

计算机应用 ›› 2018, Vol. 38 ›› Issue (7): 1905-1909.DOI: 10.11772/j.issn.1001-9081.2017123028

基于邻域值差异度量的离群点检测算法

袁钟, 冯山

四川师范大学数学与软件科学学院, 成都 610068

收稿日期:2017-12-25 修回日期:2018-02-07 发布日期:2018-07-12 出版日期:2018-07-10
通讯作者: 冯山
作者简介:袁钟(1991-),男,四川井研人,硕士研究生,主要研究方向:粗糙集、数据挖掘;冯山(1967-),男,重庆丰都人,教授,博士,主要研究方向:粗糙集、数据挖掘。
基金资助:
国家自然科学基金资助项目（61673285）；四川省青年科技基金资助项目（2017JQ0046）；四川省教育厅自然科学重点基金资助项目（15ZB0029）。

Outlier detection algorithm based on neighborhood value difference metric

YUAN Zhong, FENG Shan

College of Mathematics and Software Science, Sichuan Normal University, Chengdu Sichuan 610068, China

Received:2017-12-25 Revised:2018-02-07 Online:2018-07-12 Published:2018-07-10
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61673258), the Sichuan Youth Science and Technology Foundation (2017JQ0046), the Scientific Research Project of Sichuan Provincial Education Department (15ZB0029).

摘要/Abstract

摘要： 针对离群点检测中传统距离法不能有效处理符号型属性和经典粗糙集方法不能有效处理数值型属性的问题，利用邻域粗糙集的粒化特征提出了改进的邻域值差异度量（NVDM）方法进行离群点检测。首先，将属性取值归一化并以混合欧氏重叠度量（HEOM）和具有自适应特征的邻域半径构建邻域信息系统（NIS）；其次，以NVDM构造对象的邻域离群因子（NOF）；最后，设计并实现了基于邻域值差异度量的离群点检测（NVDMOD）算法，该算法在计算单属性邻域覆盖（SANC）的方式上充分利用有序二分和近邻搜索思想改进了传统的无序逐一计算模式。在UCI标准数据集上与现有离群点检测算法——邻域离群点检测（NED）算法、基于距离的离群点检测（DIS）算法和K最近邻（KNN）算法进行了实验对比、分析。实验结果表明，NVDMOD算法具有更好的适应性和有效性，为混合型属性数据集的离群点检测提供了一条更有效的新途径。

关键词: 离群点检测, 邻域粗糙集, 邻域值差异度量, 混合型属性, 数据挖掘

Abstract: Aiming at the problems that symbolic attribute data set can not be processed effectively with traditional distance measure method and numerical attribute data set can not be processed effectively by classical rough set method, an improved method of Neighborhood Value Difference Metric (NVDM) was proposed for outlier detection by utilizing the granulation features of neighborhood rough set. Firstly, with attribute values being normalized, the Neighborhood Information System (NIS) was constructed based on optimized Heterogeneous Euclidian-Overlap Metric (HEOM) and neighborhood radius with adaptive characteristic. Secondly, Neighborhood Outlier Factor (NOF) of data object was constructed based on the NVDM. Finally, a Neighborhood Value Difference Metric-based Outlier Detection (NVDMOD) algorithm was designed and implemented, which improves the traditional unordered one by one model via making full use of the idea of ordered binary and nearest neighbor search in computing Single Attribute Neighborhood Cover (SANC). The NVDMOD algorithm was analyzed and compared with existing outlier detection algorithms including NEighborhood outlier Detection (NED) algorithm, DIStance-based outlier detection (DIS) algorithm and K-Nearest Neighbor (KNN) algorithm on UCI standard data sets. The experimental results show that NVDMOD algorithm has much higher adaptability and effectiveness, and it provides a more effective new method for outlier detection of mixed attribute data sets.

Key words: outlier detection, neighborhood rough set, Neighborhood Value Difference Metric (NVDM), mixed attribute, data mining

中图分类号:

TP274

袁钟, 冯山. 基于邻域值差异度量的离群点检测算法[J]. 计算机应用, 2018, 38(7): 1905-1909.

YUAN Zhong, FENG Shan. Outlier detection algorithm based on neighborhood value difference metric[J]. Journal of Computer Applications, 2018, 38(7): 1905-1909.

参考文献

[1] HAWKINS D. Identification of Outliers[M]. London:Chapman and Hall, 1980:1-2.
[2] 王习特,申德荣,白梅,等.BOD:一种高效的分布式离群点检测算法[J].计算机学报,2016,39(1):36-51.(WANG X T, SHEN D R, BAI M, et al. BOD:an efficient algorithm for distributed outlier detection[J]. Chinese Journal of Computers, 2016, 39(1):36-51).
[3] 邹云峰,张昕,宋世渊,等.基于局部密度的快速离群点检测算法[J].计算机应用,2017,37(10):2932-2937.(ZOU Y F, ZHANG X, SONG S Y, et al. Fast outlier detection algorithm based on local density[J]. Journal of Computer Applications, 2017, 37(10):2932-2937.)
[4] HAN J W, KAMBER M, PEI J. Data Mining:Concepts and Techniques[M]. 3rd ed. San Francisco:Morgan Kaufmann, 2011:543-583.
[5] ROUSSEEUW P J, LEROY A M. Robust Regression and Outlier Detection[M]. Hoboken:John Wiley and Sons, 1987:1-18.
[6] KNORR E M, NG R T, TUCAKOV V. Distance-based outliers:algorithms and applications[J]. The VLDB Journal, 2000, 8(3):237-253.
[7] KNORR E, NG R. A unified notion of outliers:properties and computation[C]//Proceedings of the 1997 International Conference on Knowledge Discovery & Data Mining. Menlo Park, CA:AAAI Press, 1997:219-222.
[8] BREUNIG M M, KRIEGEL H P, NG R T, et al. LOF:identifying density-based local outliers[C]//Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. New York:ACM, 2000:93-104.
[9] JAIN A K, MURTY M N, FLYNN P J. Data clustering:a review[J]. ACM Computing Surveys, 1999, 31(3):264-323.
[10] JIANG F, SUI Y F, CAO C G. An information entropy-based approach to outlier detection in rough sets[J]. Expert Systems with Applications, 2010, 37(9):6338-6344.
[11] LIN T Y. Neighborhood systems-application to qualitative fuzzy and rough sets[C]//Advances in Machine Intelligence and Soft-Computing. Durham:Department of Electrical Engineering, 1997:132-155.
[12] HU Q H, YU D R, LIU J F, et al. Neighborhood rough set based heterogeneous feature subset selection[J]. Information Sciences, 2008, 178(18):3577-3594.
[13] CHEN Y M, MIAO D Q, ZHANG H Y. Neighborhood outlier detection[J]. Expert Systems with Applications, 2010, 37(12):8745-8749.
[14] WILSON D R, MARTINEZ T R. Improved heterogeneous distance functions[J]. Journal of Artificial Intelligence Research, 1997, 6(1):1-34.
[15] STANFILL C, WALTZ D. Toward memory-based reasoning[J]. Communications of the ACM, 1986, 29(12):1213-1228.
[16] WILLIAMS J W J. Algorithm 232(heapsort)[J]. Communications of the ACM, 1964, 7(6):347-348.
[17] RAMASWAMY S, RASTOGI R, SHIM K. Efficient algorithms for mining outliers from large datasets[C]//Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. New York:ACM, 2000:427-438.
[18] BAY S D. The UCI KDD repository[EB/OL].[2017-05-12]. http://kdd.ics.uci.edu.
[19] AGGARWAL C C, YU P S. Outlier detection for high dimensional data[C]//Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data. New York:ACM, 2001:37-46.
[20] HARKINS, HE H X, WILLIAMS G J, et al. Outlier detection using replicator neural networks[C]//Proceedings of the 4th International Conference on Data Warehousing and Knowledge Discovery. Berlin:Springer, 2002:170-180.

基于邻域值差异度量的离群点检测算法

Outlier detection algorithm based on neighborhood value difference metric

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	李欢欢, 黄添强, 丁雪梅, 罗海峰, 黄丽清. 基于多尺度时空图卷积网络的交通出行需求预测[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2065-2072.
[2]	董瑶, 付怡雪, 董永峰, 史进, 陈晨. 不完整多视图聚类综述[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1673-1682.
[3]	杨克帅, 武优西, 耿萌, 刘靖宇, 李艳. 一次性条件下top-k高平均效用序列模式挖掘算法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 477-484.
[4]	郑浩东, 马华, 谢颖超, 唐文胜. 融合遗忘因素与记忆门的图神经网络知识追踪模型[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2747-2752.
[5]	蒋华, 李星, 王慧娇, 韦静海. 基于数据索引结构的跨级高效用项集挖掘算法[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2200-2208.
[6]	黄硕, 李艳辉, 曹建秋. 本地化差分隐私下的频繁序列模式挖掘算法PrivSPM[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2057-2064.
[7]	祁超帅, 何文思, 焦毅, 马英红, 蔡伟, 任素萍. 无人机飞行数据异常检测算法综述[J]. 《计算机应用》唯一官方网站, 2023, 43(6): 1833-1841.
[8]	李元江, 权金升, 谭阳奕, 杨田. 基于相似和差异双视角的高维数据属性约简[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1467-1472.
[9]	邵小萌, 张猛. 融合注意力机制的时间卷积知识追踪模型[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 343-348.
[10]	李文全, 毛伊敏, 彭新东. 基于犹豫模糊集的凝聚式层次聚类算法[J]. 《计算机应用》唯一官方网站, 2023, 43(12): 3755-3763.
[11]	孙林, 马天娇, 薛占熬. 基于Fisher score与模糊邻域熵的多标记特征选择算法[J]. 《计算机应用》唯一官方网站, 2023, 43(12): 3779-3789.
[12]	吴军, 欧阳艾嘉, 张琳. 基于影响度的统计显著序列模式挖掘算法[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2713-2721.
[13]	郭一阳, 于炯, 杜旭升, 杨少智, 曹铭. 基于自编码器与集成学习的离群点检测算法[J]. 《计算机应用》唯一官方网站, 2022, 42(7): 2078-2087.
[14]	孙林, 赵婧, 徐久成, 王欣雅. 基于邻域粗糙集和帝王蝶优化的特征选择算法[J]. 《计算机应用》唯一官方网站, 2022, 42(5): 1355-1366.
[15]	余顺坤, 闫泓序. 基于确定性因子的启发式属性值约简模型[J]. 《计算机应用》唯一官方网站, 2022, 42(2): 469-474.