基于随机子空间的扩展隔离林算法

doi:10.11772/j.issn.1001-9081.2020091436

计算机应用 ›› 2021, Vol. 41 ›› Issue (6): 1679-1685.DOI: 10.11772/j.issn.1001-9081.2020091436

所属专题：数据科学与技术

基于随机子空间的扩展隔离林算法

谢雨, 蒋瑜, 龙超奇

成都信息工程大学软件工程学院, 成都 610225

收稿日期:2020-09-15 修回日期:2020-11-27 出版日期:2021-06-10 发布日期:2020-12-09
通讯作者: 蒋瑜
作者简介:谢雨(1996-),男,四川内江人,硕士研究生,主要研究方向:数据挖掘、智能计算、异常检测;蒋瑜(1980-),男,四川邻水人,副教授,硕士,主要研究方向:入侵检测、粗糙集、数据挖掘、智能计算;龙超奇(1996-),男,四川德阳人,硕士研究生,主要研究方向:数据挖掘、智能计算、小波聚类。

Extended isolation forest algorithm based on random subspace

XIE Yu, JIANG Yu, LONG Chaoqi

School of Software Engineering, Chengdu University of Information Technology, Chengdu Sichuan 610225, China

Received:2020-09-15 Revised:2020-11-27 Online:2021-06-10 Published:2020-12-09

摘要/Abstract

摘要： 针对扩展隔离林（EIF）算法时间开销过大的问题，提出了一种基于随机子空间的扩展隔离林（RS-EIF）算法。首先，在原数据空间确定多个随机子空间；然后，在不同的随机子空间中通过计算每个节点的截距向量与斜率来构建扩展孤立树，并将多棵扩展孤立树集成为子空间扩展隔离林；最后，通过计算数据点在扩展隔离林中的平均遍历深度来确定数据点是否异常。在离群值检测数据库（ODDS）中的9个真实数据集与呈多元分布的7个人工数据集上的实验结果表明，所提RS-EIF算法对局部异常很敏感，相较EIF算法减少了约60%的时间开销；在样本数量较多的ODDS数据集上，该算法识别精度高出孤立森林（iForest）算法、轻型在线异常检测（LODA）算法和基于连接函数的异常检测（COPOD）算法2~12个百分点。RS-EIF算法在样本数量大的数据集中识别效率更高。

关键词: 异常检测, 随机子空间, 扩展隔离林算法, 扩展孤立树, 平均遍历深度

Abstract: Aiming at the problem of excessive time overhead of the Extended Isolation Forest (EIF) algorithm, a new algorithm named Extended Isolation Forest based on Random Subspace (RS-EIF) was proposed. Firstly, multiple random subspaces were determined in the original data space. Then, in each random subspace, the extended isolated tree was constructed by calculating the intercept vector and slope of each node, and multiple extended isolated trees were integrated into a subspace extended isolation forest. Finally, the average traversal depth of data point in the extended isolation forest was calculated to determine whether the data point was abnormal. Experimental results on 9 real datasets in Outliter Detection DataSet (ODDS) and 7 synthetic datasets with multivariate distribution show that, the RS-EIF algorithm is sensitive to local anomalies and reduces the time overhead by about 60% compared with the EIF algorithm; on the ODDS datasets with many samples, its recognition accuracy is 2 percentage points to 12 percentage points higher than those of the isolation Forest (iForest) algorithm, Lightweight On-line Detection of Anomalies (LODA) algorithm and COPula-based Outlier Detection (COPOD) algorithm. The RS-EIF algorithm has the higher recognition efficiency in the dataset with a large number of samples.

Key words: anomaly detection, random subspace, Extended Isolation Forest (EIF) algorithm, extended isolated tree, average traversal depth

中图分类号:

TP181

谢雨, 蒋瑜, 龙超奇. 基于随机子空间的扩展隔离林算法[J]. 计算机应用, 2021, 41(6): 1679-1685.

XIE Yu, JIANG Yu, LONG Chaoqi. Extended isolation forest algorithm based on random subspace[J]. Journal of Computer Applications, 2021, 41(6): 1679-1685.

参考文献

[1] 陈斌, 陈松灿, 潘志松, 等. 异常检测综述[J]. 山东大学学报(工学版), 2009, 39(6):13-23.(CHEN B,CHEN S C,PAN Z S,et al. Survey of outlier detection technologies[J]. Journal of Shandong University (Engineering Science), 2009, 39(6):13-23.)
[2] CHANDOLA V,BANERJEE A,KUMAR V. Anomaly detection:a survey[J]. ACM Computing Surveys, 2009, 41(3):Article No. 15.
[3] WANG H,BAH M J,HAMMAD M. Progress in outlier detection techniques:a survey[J]. IEEE Access, 2019, 7:107964-108000.
[4] KRIEGEL H P,SCHUBERT M,ZIMEK A. Angle-based outlier detection in high-dimensional data[C]//Proceedings of the 2008 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York:ACM,2008:444-452.
[5] LI Z,ZHAO Y,BOTTA N,et al. COPOD:copula-based outlier detection[C]//Proceedings of the2020 IEEE International Conference on Data Mining. Piscataway:IEEE,2020:1118-1123.
[6] RAMASWAMY S,RASTOGI R,SHIM K. Efficient algorithms for mining outliers from large data sets[J]. ACM SIGMOD Record, 2000,29(2):427-438.
[7] BREUNIG M M,KRIEGEL H P,NG R T,et al. LOF:identifying density-based local outliers[C]//Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data. New York:ACM,2000:93-104.
[8] CHEN J,SATHE S,AGGARWAL C,et al. Outlier detection with autoencoder ensembles[C]//Proceedings of the 2017 SIAM International Conference on Data Mining. Philadelphia:SIAM, 2017:90-98.
[9] LAZAREVIC A,KUMAR V. Feature bagging for outlier detection[C]//Proceedings of the 2005 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York:ACM,2005:157-166.
[10] RAYANA S, AKOGLU L. Less is more:building selective anomaly ensembles[J]. ACM Transactions on Knowledge Discovery from Data,2016,10(4):Article No. 42.
[11] LIU F T,TING K M,ZHOU Z. Isolation forest[C]//Proceedings of the 2008 8th IEEE International Conference on Data Mining. Piscataway:IEEE,2008:413-422.
[12] LIU F T,TING K M,ZHOU Z. Isolation-based anomaly detection[J]. ACM Transactions on Knowledge Discovery from Data, 2012,6(1):Article No. 3.
[13] 杨先圣, 姜磊, 彭雄, 等. 基于大数据的异常检测方法研究[J]. 计算机工程与科学, 2018, 40(7):1180-1186.(YANG X S, JIANG L,PENG X,et al. A new outlier detection method based on large data[J]. Computer Engineering and Science,2018,40(7):1180-1186.)
[14] BANDARAGODA T R, TING K M, ALBRECHT D, et al. Isolation-based anomaly detection using nearest-neighbor ensembles[J]. Computational Intelligence, 2018, 34(4):968-998.
[15] 杨晓晖, 张圣昌. 基于多粒度级联孤立森林算法的异常检测模型[J]. 通信学报, 2019, 40(8):133-142.(YANG X H,ZHANG S C. Anomaly detection model based on multi-grained cascade isolation forest algorithm[J]. Journal on Communications,2019, 40(8):133-142.)
[16] 王茹雪, 张丽翠, 刘姝岐. 基于瀑布型混合技术的异常检测算法[J]. 吉林大学学报(信息科学版), 2017, 35(5):544-550. (WANG R X, ZHANG L C, LIU S Q. Anomaly detection algorithm based on waterfall hybrid technology[J]. Journal of Jilin University (Information Science Edition), 2017, 35(5):544-550.)
[17] HARIRI S,KIND M C,BRUNNER R J. Extended isolation forest[EB/OL].[2020-09-01]. https://arxiv.org/pdf/1811.02141.pdf.
[18] 于玲, 吴铁军. 集成学习:Boosting算法综述[J]. 模式识别与人工智能,2004,17(1):52-59. (YU L,WU T J. Assemble learning:a survey of Boosting algorithms[J]. Pattern Recognition and Artificial Intelligence,2004,17(1):52-59.)
[19] 李建中, 刘显敏. 大数据的一个重要方面:数据可用性[J]. 计算机研究与发展, 2013, 50(6):1147-1162.(LI J Z,LIU X M. An important aspect of big data:data usability[J]. Journal of Computer Research and Development, 2013, 50(6):1147-1162.)
[20] HARMAN R, LACKO V. On decompositional algorithms for uniform sampling from n-spheres and n-balls[J]. Journal of Multivariate Analysis,2011,101(10):2297-2304.
[21] 李倩, 韩斌, 汪旭祥. 基于模糊孤立森林算法的多维数据异常检测方法[J]. 计算机与数字工程, 2020, 48(4):862-866.(LI Q, HAN B,WANG X X. Multidimensional data anomaly detection method based on fuzzy isolated forest algorithm[J]. Computer and Digital Engineering,2020,48(4):862-866.)
[22] RAYANA S. ODDS library[DS/OL].[2020-09-01]. http://odds.cs.stonybrook.edu.
[23] PEVNÝ T. Loda:lightweight on-line detector of anomalies[J]. Machine Learning,2016,102(2):275-304.
[24] ZHAO Y,NASRULLAH Z,LI Z. PyOD:a Python toolbox for scalable outlier detection[J]. Journal of Machine Learning Research,2019,20:1-7.

基于随机子空间的扩展隔离林算法

Extended isolation forest algorithm based on random subspace

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	孟凡, 陈广, 王勇, 高阳, 高德群, 贾文龙. 基于多粒度时序结构表示的异常检测算法在储层含油性检测中应用[J]. 计算机应用, 2021, 41(8): 2453-2459.
[2]	胡天杰, 胡文军, 王士同. 分布熵惩罚的支持向量数据描述[J]. 计算机应用, 2021, 41(8): 2212-2218.
[3]	李衍志, 范勇, 高琳. 基于形态流的石油钻井水流异常检测[J]. 计算机应用, 2021, 41(6): 1842-1848.
[4]	姚杰, 程春玲, 韩静, 刘峥. 云工作流中基于多任务时序卷积网络的异常检测方法[J]. 计算机应用, 2021, 41(6): 1701-1708.
[5]	张晨曦, 唐曙, 唐珂. 迁移学习下的火箭发动机参数异常检测策略[J]. 计算机应用, 2020, 40(9): 2774-2780.
[6]	王磊. 改进粗糙集属性约简结合K-means聚类的网络入侵检测方法[J]. 计算机应用, 2020, 40(7): 1996-2002.
[7]	胡珉, 白雪, 徐伟, 吴秉键. 多维时间序列异常检测算法综述[J]. 计算机应用, 2020, 40(6): 1553-1564.
[8]	仇媛, 常相茂, 仇倩, 彭程, 苏善婷. 基于长短期记忆网络和滑动窗口的流数据异常检测方法[J]. 计算机应用, 2020, 40(5): 1335-1339.
[9]	霍纬纲, 王慧芳. 基于自编码器和隐马尔可夫模型的时间序列异常检测方法[J]. 计算机应用, 2020, 40(5): 1329-1334.
[10]	夏彬, 白宇轩, 殷俊杰. 基于生成对抗网络的系统日志级异常检测算法[J]. 计算机应用, 2020, 40(10): 2960-2966.
[11]	王伟, 谢耀滨, 尹青. 针对不平衡数据的决策树改进方法[J]. 计算机应用, 2019, 39(3): 623-628.
[12]	陶涛, 周喜, 马博, 赵凡. 基于双向LSTM的Seq2Seq模型在加油站时序数据异常检测中的应用[J]. 计算机应用, 2019, 39(3): 924-929.
[13]	刘子豪, 李凌, 叶枫. 基于SparkR的水文传感器数据的异常检测方法[J]. 计算机应用, 2019, 39(2): 436-440.
[14]	丁景全, 马博, 李晓. 基于融合时空数据的车辆加油行为多视图深度异常检测框架[J]. 计算机应用, 2019, 39(11): 3370-3375.
[15]	何春, 李琦, 吴让好, 刘邦欣. 基于故障传播的模块化BP神经网络电路故障诊断[J]. 计算机应用, 2018, 38(2): 602-609.