Abstract:Aiming at the problems that symbolic attribute data set can not be processed effectively with traditional distance measure method and numerical attribute data set can not be processed effectively by classical rough set method, an improved method of Neighborhood Value Difference Metric (NVDM) was proposed for outlier detection by utilizing the granulation features of neighborhood rough set. Firstly, with attribute values being normalized, the Neighborhood Information System (NIS) was constructed based on optimized Heterogeneous Euclidian-Overlap Metric (HEOM) and neighborhood radius with adaptive characteristic. Secondly, Neighborhood Outlier Factor (NOF) of data object was constructed based on the NVDM. Finally, a Neighborhood Value Difference Metric-based Outlier Detection (NVDMOD) algorithm was designed and implemented, which improves the traditional unordered one by one model via making full use of the idea of ordered binary and nearest neighbor search in computing Single Attribute Neighborhood Cover (SANC). The NVDMOD algorithm was analyzed and compared with existing outlier detection algorithms including NEighborhood outlier Detection (NED) algorithm, DIStance-based outlier detection (DIS) algorithm and K-Nearest Neighbor (KNN) algorithm on UCI standard data sets. The experimental results show that NVDMOD algorithm has much higher adaptability and effectiveness, and it provides a more effective new method for outlier detection of mixed attribute data sets.
[1] HAWKINS D. Identification of Outliers[M]. London:Chapman and Hall, 1980:1-2. [2] 王习特,申德荣,白梅,等.BOD:一种高效的分布式离群点检测算法[J].计算机学报,2016,39(1):36-51.(WANG X T, SHEN D R, BAI M, et al. BOD:an efficient algorithm for distributed outlier detection[J]. Chinese Journal of Computers, 2016, 39(1):36-51). [3] 邹云峰,张昕,宋世渊,等.基于局部密度的快速离群点检测算法[J].计算机应用,2017,37(10):2932-2937.(ZOU Y F, ZHANG X, SONG S Y, et al. Fast outlier detection algorithm based on local density[J]. Journal of Computer Applications, 2017, 37(10):2932-2937.) [4] HAN J W, KAMBER M, PEI J. Data Mining:Concepts and Techniques[M]. 3rd ed. San Francisco:Morgan Kaufmann, 2011:543-583. [5] ROUSSEEUW P J, LEROY A M. Robust Regression and Outlier Detection[M]. Hoboken:John Wiley and Sons, 1987:1-18. [6] KNORR E M, NG R T, TUCAKOV V. Distance-based outliers:algorithms and applications[J]. The VLDB Journal, 2000, 8(3):237-253. [7] KNORR E, NG R. A unified notion of outliers:properties and computation[C]//Proceedings of the 1997 International Conference on Knowledge Discovery & Data Mining. Menlo Park, CA:AAAI Press, 1997:219-222. [8] BREUNIG M M, KRIEGEL H P, NG R T, et al. LOF:identifying density-based local outliers[C]//Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. New York:ACM, 2000:93-104. [9] JAIN A K, MURTY M N, FLYNN P J. Data clustering:a review[J]. ACM Computing Surveys, 1999, 31(3):264-323. [10] JIANG F, SUI Y F, CAO C G. An information entropy-based approach to outlier detection in rough sets[J]. Expert Systems with Applications, 2010, 37(9):6338-6344. [11] LIN T Y. Neighborhood systems-application to qualitative fuzzy and rough sets[C]//Advances in Machine Intelligence and Soft-Computing. Durham:Department of Electrical Engineering, 1997:132-155. [12] HU Q H, YU D R, LIU J F, et al. Neighborhood rough set based heterogeneous feature subset selection[J]. Information Sciences, 2008, 178(18):3577-3594. [13] CHEN Y M, MIAO D Q, ZHANG H Y. Neighborhood outlier detection[J]. Expert Systems with Applications, 2010, 37(12):8745-8749. [14] WILSON D R, MARTINEZ T R. Improved heterogeneous distance functions[J]. Journal of Artificial Intelligence Research, 1997, 6(1):1-34. [15] STANFILL C, WALTZ D. Toward memory-based reasoning[J]. Communications of the ACM, 1986, 29(12):1213-1228. [16] WILLIAMS J W J. Algorithm 232(heapsort)[J]. Communications of the ACM, 1964, 7(6):347-348. [17] RAMASWAMY S, RASTOGI R, SHIM K. Efficient algorithms for mining outliers from large datasets[C]//Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. New York:ACM, 2000:427-438. [18] BAY S D. The UCI KDD repository[EB/OL].[2017-05-12]. http://kdd.ics.uci.edu. [19] AGGARWAL C C, YU P S. Outlier detection for high dimensional data[C]//Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data. New York:ACM, 2001:37-46. [20] HARKINS, HE H X, WILLIAMS G J, et al. Outlier detection using replicator neural networks[C]//Proceedings of the 4th International Conference on Data Warehousing and Knowledge Discovery. Berlin:Springer, 2002:170-180.