计算机应用 ›› 2018, Vol. 38 ›› Issue (7): 1905-1909.DOI: 10.11772/j.issn.1001-9081.2017123028

• 数据科学与技术 • 上一篇    下一篇

基于邻域值差异度量的离群点检测算法

袁钟, 冯山   

  1. 四川师范大学 数学与软件科学学院, 成都 610068
  • 收稿日期:2017-12-25 修回日期:2018-02-07 出版日期:2018-07-10 发布日期:2018-07-12
  • 通讯作者: 冯山
  • 作者简介:袁钟(1991-),男,四川井研人,硕士研究生,主要研究方向:粗糙集、数据挖掘;冯山(1967-),男,重庆丰都人,教授,博士,主要研究方向:粗糙集、数据挖掘。
  • 基金资助:
    国家自然科学基金资助项目(61673285);四川省青年科技基金资助项目(2017JQ0046);四川省教育厅自然科学重点基金资助项目(15ZB0029)。

Outlier detection algorithm based on neighborhood value difference metric

YUAN Zhong, FENG Shan   

  1. College of Mathematics and Software Science, Sichuan Normal University, Chengdu Sichuan 610068, China
  • Received:2017-12-25 Revised:2018-02-07 Online:2018-07-10 Published:2018-07-12
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61673258), the Sichuan Youth Science and Technology Foundation (2017JQ0046), the Scientific Research Project of Sichuan Provincial Education Department (15ZB0029).

摘要: 针对离群点检测中传统距离法不能有效处理符号型属性和经典粗糙集方法不能有效处理数值型属性的问题,利用邻域粗糙集的粒化特征提出了改进的邻域值差异度量(NVDM)方法进行离群点检测。首先,将属性取值归一化并以混合欧氏重叠度量(HEOM)和具有自适应特征的邻域半径构建邻域信息系统(NIS);其次,以NVDM构造对象的邻域离群因子(NOF);最后,设计并实现了基于邻域值差异度量的离群点检测(NVDMOD)算法,该算法在计算单属性邻域覆盖(SANC)的方式上充分利用有序二分和近邻搜索思想改进了传统的无序逐一计算模式。在UCI标准数据集上与现有离群点检测算法——邻域离群点检测(NED)算法、基于距离的离群点检测(DIS)算法和K最近邻(KNN)算法进行了实验对比、分析。实验结果表明,NVDMOD算法具有更好的适应性和有效性,为混合型属性数据集的离群点检测提供了一条更有效的新途径。

关键词: 离群点检测, 邻域粗糙集, 邻域值差异度量, 混合型属性, 数据挖掘

Abstract: Aiming at the problems that symbolic attribute data set can not be processed effectively with traditional distance measure method and numerical attribute data set can not be processed effectively by classical rough set method, an improved method of Neighborhood Value Difference Metric (NVDM) was proposed for outlier detection by utilizing the granulation features of neighborhood rough set. Firstly, with attribute values being normalized, the Neighborhood Information System (NIS) was constructed based on optimized Heterogeneous Euclidian-Overlap Metric (HEOM) and neighborhood radius with adaptive characteristic. Secondly, Neighborhood Outlier Factor (NOF) of data object was constructed based on the NVDM. Finally, a Neighborhood Value Difference Metric-based Outlier Detection (NVDMOD) algorithm was designed and implemented, which improves the traditional unordered one by one model via making full use of the idea of ordered binary and nearest neighbor search in computing Single Attribute Neighborhood Cover (SANC). The NVDMOD algorithm was analyzed and compared with existing outlier detection algorithms including NEighborhood outlier Detection (NED) algorithm, DIStance-based outlier detection (DIS) algorithm and K-Nearest Neighbor (KNN) algorithm on UCI standard data sets. The experimental results show that NVDMOD algorithm has much higher adaptability and effectiveness, and it provides a more effective new method for outlier detection of mixed attribute data sets.

Key words: outlier detection, neighborhood rough set, Neighborhood Value Difference Metric (NVDM), mixed attribute, data mining

中图分类号: