计算机应用 ›› 2021, Vol. 41 ›› Issue (6): 1679-1685.DOI: 10.11772/j.issn.1001-9081.2020091436

所属专题: 数据科学与技术

• 数据科学与技术 • 上一篇    下一篇

基于随机子空间的扩展隔离林算法

谢雨, 蒋瑜, 龙超奇   

  1. 成都信息工程大学 软件工程学院, 成都 610225
  • 收稿日期:2020-09-15 修回日期:2020-11-27 出版日期:2021-06-10 发布日期:2020-12-09
  • 通讯作者: 蒋瑜
  • 作者简介:谢雨(1996-),男,四川内江人,硕士研究生,主要研究方向:数据挖掘、智能计算、异常检测;蒋瑜(1980-),男,四川邻水人,副教授,硕士,主要研究方向:入侵检测、粗糙集、数据挖掘、智能计算;龙超奇(1996-),男,四川德阳人,硕士研究生,主要研究方向:数据挖掘、智能计算、小波聚类。

Extended isolation forest algorithm based on random subspace

XIE Yu, JIANG Yu, LONG Chaoqi   

  1. School of Software Engineering, Chengdu University of Information Technology, Chengdu Sichuan 610225, China
  • Received:2020-09-15 Revised:2020-11-27 Online:2021-06-10 Published:2020-12-09

摘要: 针对扩展隔离林(EIF)算法时间开销过大的问题,提出了一种基于随机子空间的扩展隔离林(RS-EIF)算法。首先,在原数据空间确定多个随机子空间;然后,在不同的随机子空间中通过计算每个节点的截距向量与斜率来构建扩展孤立树,并将多棵扩展孤立树集成为子空间扩展隔离林;最后,通过计算数据点在扩展隔离林中的平均遍历深度来确定数据点是否异常。在离群值检测数据库(ODDS)中的9个真实数据集与呈多元分布的7个人工数据集上的实验结果表明,所提RS-EIF算法对局部异常很敏感,相较EIF算法减少了约60%的时间开销;在样本数量较多的ODDS数据集上,该算法识别精度高出孤立森林(iForest)算法、轻型在线异常检测(LODA)算法和基于连接函数的异常检测(COPOD)算法2~12个百分点。RS-EIF算法在样本数量大的数据集中识别效率更高。

关键词: 异常检测, 随机子空间, 扩展隔离林算法, 扩展孤立树, 平均遍历深度

Abstract: Aiming at the problem of excessive time overhead of the Extended Isolation Forest (EIF) algorithm, a new algorithm named Extended Isolation Forest based on Random Subspace (RS-EIF) was proposed. Firstly, multiple random subspaces were determined in the original data space. Then, in each random subspace, the extended isolated tree was constructed by calculating the intercept vector and slope of each node, and multiple extended isolated trees were integrated into a subspace extended isolation forest. Finally, the average traversal depth of data point in the extended isolation forest was calculated to determine whether the data point was abnormal. Experimental results on 9 real datasets in Outliter Detection DataSet (ODDS) and 7 synthetic datasets with multivariate distribution show that, the RS-EIF algorithm is sensitive to local anomalies and reduces the time overhead by about 60% compared with the EIF algorithm; on the ODDS datasets with many samples, its recognition accuracy is 2 percentage points to 12 percentage points higher than those of the isolation Forest (iForest) algorithm, Lightweight On-line Detection of Anomalies (LODA) algorithm and COPula-based Outlier Detection (COPOD) algorithm. The RS-EIF algorithm has the higher recognition efficiency in the dataset with a large number of samples.

Key words: anomaly detection, random subspace, Extended Isolation Forest (EIF) algorithm, extended isolated tree, average traversal depth

中图分类号: