Journal of Computer Applications ›› 2020, Vol. 40 ›› Issue (5): 1322-1328.DOI: 10.11772/j.issn.1001-9081.2019101708

• Data science and technology • Previous Articles     Next Articles

Outlier detection algorithm based on graph random walk

DU Xusheng1, YU Jiong1,2, YE Lele3, CHEN Jiaying2   

  1. 1.School of Software, Xinjiang University, UrumqiXinjiang 830008, China
    2.College of Information Science and Engineering, Xinjiang University, UrumqiXinjiang 830046, China
    3.School of Software Engineering, Xi’an Jiaotong University, Xi’an Shannxi 710049, China
  • Received:2019-10-10 Revised:2019-12-12 Online:2020-05-10 Published:2020-05-15
  • Contact: YU Jiong, born in 1964, Ph. D., professor. His research interests include green computing, machine learning, data mining.
  • About author:DU Xusheng, born in 1995, M. S. candidate. His research interests include machine learning, data mining.YU Jiong, born in 1964, Ph. D., professor. His research interests include green computing, machine learning, data mining.YE Lele, born in 1993, M. S. candidate. His research interests include machine learning, data mining.CHEN Jiaying, born in 1988, Ph. D. candidate. Her research interests include recommendation system, data mining.
  • Supported by:

    This work is partially supported by the National Natural Science Foundation of China (61862060,61462079,61562086,61562078).

基于图上随机游走的离群点检测算法

杜旭升1, 于炯1,2, 叶乐乐3, 陈嘉颖2   

  1. 1.新疆大学 软件学院,乌鲁木齐 830008
    2.新疆大学 信息科学与工程学院,乌鲁木齐 830046
    3.西安交通大学 软件学院, 西安 710049
  • 通讯作者: 于炯(1964—)
  • 作者简介:杜旭升(1995—),男,甘肃庆阳人,硕士研究生,CCF会员,主要研究方向:机器学习、数据挖掘; 于炯(1964—),男,北京人,教授,博士生导师,博士,主要研究方向:绿色计算、机器学习、数据挖掘; 叶乐乐(1993—),男,湖北随州人,硕士研究生,主要研究方向:机器学习、数据挖掘; 陈嘉颖(1988—),女,新疆沙湾人,博士研究生,主要研究方向:推荐系统、数据挖掘。
  • 基金资助:

    国家自然科学基金资助项目(61862060,61462079,61562086,61562078)。

Abstract:

Outlier detection algorithms are widely used in various fields such as network intrusion detection, and medical aided diagnosis. Local Distance-Based Outlier Factor (LDOF), Cohesiveness-Based Outlier Factor (CBOF) and Local Outlier Factor (LOF) algorithms are classic algorithms for outlier detection with long execution time and low detection rate on large-scale datasets and high dimensional datasets. Aiming at these problems, an outlier detection algorithm Based on Graph Random Walk (BGRW) was proposed. Firstly, the iterations, damping factor and outlier degree for every object in the dataset were initialized. Then, the transition probability of the rambler between objects was deduced based on the Euclidean distance between the objects. And the outlier degree of every object in the dataset was calculated by iteration. Finally, the objects with highest outlier degree were output as outliers. On UCI (University of California, Irvine) real datasets and synthetic datasets with complex distribution, comparison between BGRW and LDOF, CBOF, LOF algorithms about detection rate, execution time and false positive rate were carried out. The experimental results show that BGRW is able to decrease execution time and false positive rate, and has higher detection rate.

Key words: data mining, outlier detection, Markov chain, random walk, Local Distance-based Outlier Factor (LDOF), Cohesiveness-Based Outlier Factor (CBOF), Local Outlier Factor (LOF)

摘要:

离群点检测算法在网络入侵检测、医疗辅助诊断等领域具有十分广泛的应用。针对LDOF、CBOF及LOF算法在大规模数据集和高维数据集的检测过程中存在的执行时间长及检测率较低的问题,提出了基于图上随机游走(BGRW)的离群点检测算法。首先初始化迭代次数、阻尼因子以及数据集中每个对象的离群值;其次根据对象之间的欧氏距离推导出漫步者在各对象之间的转移概率;然后通过迭代计算得到数据集中每个对象的离群值;最后将数据集中离群值最高的对象判定为离群点并输出。在UCI真实数据集与复杂分布的合成数据集上进行实验,将BGRW算法与LDOF、CBOF和LOF算法在执行时间、检测率和误报率指标上进行对比。实验结果表明,BGRW算法能够有效降低执行时间并在检测率及误报率指标上优于对比算法。

关键词: 数据挖掘, 离群点检测, 马尔可夫链, 随机游走, LDOF, CBOF, LOF

CLC Number: