Journal of Computer Applications ›› 2023, Vol. 43 ›› Issue (6): 1705-1712.DOI: 10.11772/j.issn.1001-9081.2022060930

• The 37 CCF National Conference of Computer Applications (CCF NCCA 2022) • Previous Articles     Next Articles

Outlier detection algorithm based on hologram stationary distribution factor

Zhongping ZHANG1,2,3(), Xin GUO1, Yuting ZHANG1, Ruibo ZHANG4   

  1. 1.School of Information Science and Engineering,Yanshan University,Qinhuangdao Hebei 066004,China
    2.Key Laboratory for Computer Virtual Technology and System Integration of Hebei Province (Yanshan University),Qinhuangdao Hebei 066004,China
    3.Key Laboratory of Software Engineering of Hebei Province (Yanshan University),Qinhuangdao Hebei 066004,China
    4.School of International Education,Wuhan University of Technology,Wuhan Hubei 430070,China
  • Received:2022-06-28 Revised:2022-08-10 Accepted:2022-08-12 Online:2022-09-22 Published:2023-06-10
  • Contact: Zhongping ZHANG
  • About author:GUO Xin, born in 1997, M. S. candidate. Her research interests include data mining.
    ZHANG Yuting, born in 1996, M. S. His research interests include data mining.
    ZHANG Ruibo, born in 2002. His research interests include data mining, energy big data.
  • Supported by:
    National Natural Science Foundation of China(61972334)

基于全息图平稳分布因子的离群点检测算法

张忠平1,2,3(), 郭鑫1, 张玉停1, 张睿博4   

  1. 1.燕山大学 信息科学与工程学院, 河北 秦皇岛 066004
    2.河北省计算机虚拟技术与系统集成重点实验室(燕山大学), 河北 秦皇岛 066004
    3.河北省软件工程重点实验室(燕山大学), 河北 秦皇岛 066004
    4.武汉理工大学 国际教育学院, 武汉 430070
  • 通讯作者: 张忠平
  • 作者简介:张忠平(1972—),男,吉林松原人,教授,博士,CCF会员,主要研究方向:大数据、数据挖掘、半结构化数据;Email:979935240@qq.com
    郭鑫(1997—),女,山西长治人,硕士研究生,CCF会员,主要研究方向:数据挖掘
    张玉停(1996—),男,安徽阜阳人,硕士,主要研究方向:数据挖掘
    张睿博(2002—),男,吉林松原人,主要研究方向:数据挖掘、能源大数据。

Abstract:

Constructing the transition probability matrix for outlier detection by using traditional graph-based methods requires the use of the overall distribution of the data, and the local information of the data is easily ignored, resulting in the problem of low detection accuracy, and using the local information of the data may lead to “suspended link” problem. Aiming at these problems, an Outlier Detection algorithm based on Hologram Stationary Distribution Factor (HSDFOD) was proposed. Firstly, a local information graph was constructed by adaptively obtaining the set of neighbors of each data point through the similarity matrix. Then, a global information graph was constructed by the minimum spanning tree. Finally, the local information graph and the global information graph were integrated into a hologram to construct a transition probability matrix for Markov random walk, and the outliers were detected through the generated stationary distribution. On the synthetic datasets A1 to A4, HDFSOD has higher precision than SOD (Outlier Detection in axis-parallel Subspaces of high dimensional data), SUOD (accelerating large-Scale Unsupervised heterogeneous Outlier Detection), IForest (Isolation Forest) and HBOS (Histogram-Based Outlier Score); and AUC (Area Under Curve) also better than the four comparison algorithms generally. On the real datasets, the precision of HSDFOD is higher than 80%, and the AUC of HSDFOD is higher than those of SOD, SUOD, IForest and HBOS. It can be seen that the proposed algorithm has a good application prospect in outlier detection.

Key words: outlier, hologram, transition probability matrix, Markov random walk, stationary distribution factor

摘要:

使用传统的基于图的方法进行离群点检测构造转移概率矩阵需要使用数据的整体分布,容易忽略数据的局部信息,导致检测精度低,而使用数据的局部信息可能导致“悬空链接”的问题。针对这些问题,提出一个基于全息图平稳分布因子的离群点检测算法(HSDFOD)。首先,使用相似度矩阵自适应地获取每个数据点的邻居集合构造一个局部信息图;然后,引入最小生成树构造一个全局信息图;最后,利用局部信息图和全局信息图融合为一个全息图构造转移概率矩阵进行马尔可夫随机游走,并通过生成的平稳分布检测离群点。在人工数据集A1~A4上,HSDFOD的精确率均高于SOD(Outlier Detection in axis-parallel Subspaces of high dimensional data)、SUOD(accelerating large-Scale Unsupervised heterogeneous Outlier Detection)、IForest (Isolation Forest)和HBOS (Histogram-Based Outlier Score);曲线下面积(AUC)整体上也优于这4个对比算法。在真实数据集上,HSDFOD的精确率均高于80%,AUC均高于SOD、SUOD、IForest和HBOS。可见,所提算法在离群点检测上有较好的应用前景。

关键词: 离群点, 全息图, 转移概率矩阵, 马尔可夫随机游走, 平稳分布因子

CLC Number: