Random forest based on double features and relaxation boundary for anomaly detection

doi:10.11772/j.issn.1001-9081.2018091966

Journal of Computer Applications ›› 2019, Vol. 39 ›› Issue (4): 956-962.DOI: 10.11772/j.issn.1001-9081.2018091966

Previous Articles Next Articles

Random forest based on double features and relaxation boundary for anomaly detection

HU Miao^1,2, WANG Kaijun^1,2

1. College of Mathematics and Informatics, Fujian Normal University, Fuzhou Fujian 350117, China;
2. Digit Fujian Internet-of-Things Laboratory of Environmental Monitoring, Fujian Normal University, Fuzhou Fujian 350117, China

Received:2018-09-25 Revised:2018-11-06 Online:2019-04-10 Published:2019-04-10
Supported by:
This work is partially supported by the National Natural Science Foundation of China(61672157), the Natural Science Foundation of Fujian Province (2018J01778).

基于双特征和松弛边界的随机森林进行异常点检测

胡淼^1,2, 王开军^1,2

1. 福建师范大学数学与信息学院, 福州 350117;
2. 福建师范大学数字福建环境监测物联网实验室, 福州 350117

通讯作者: 王开军
作者简介:胡淼(1994-),男,安徽太和人,硕士研究生,主要研究方向:机器学习、数据挖掘;王开军(1965-),男,福建福州人,副教授,博士,主要研究方向:机器学习、数据挖掘。
基金资助:
国家自然科学基金资助项目（61672157）；福建省自然科学基金资助项目（2018J01778）。

Abstract

Abstract: Aiming at the low performance of existing anomaly detection algorithms based on random forest, a random forest algorithm combining double features and relaxation boundary was proposed for anomaly detection. Firstly, in the process of constructing binary decision tree of random forest with normal class data only, the range of two features (each feature had a corresponding eigenvalue range) were recorded in each node of the binary decision tree, and the double-feature eigenvalue ranges were used as the basis for abnormal point judgment. Secondly, during the anomaly detection, if a sample did not satisfy the double-feature eigenvalue range in the decision tree node, the sample would be marked as a candidate exception class; otherwise, the sample would enter the lower nodes of the decision tree and continue the comparision with the corresponding double-feature eigenvalue range. The sample would be marked as candidate normal class if there were no lower nodes. Finally, the discriminative mechanism in random forest algorithm was used to distinguish the class of the samples. Experimented results on five UCI datasets show that the proposed method has better performance than the existing random forest algorithms for anomaly detection, and its comprehensive performance is equivalent to or better than isolation Forest (iForest) and One-Class SVM (OCSVM), and stable at a high level.

Key words: anomaly detection, Random Forest (RF), double-feature filtering, relaxation boundary

摘要： 针对现有基于随机森林的异常检测算法性能不高的问题，提出一种结合双特征和松弛边界的随机森林算法用于异常点检测。首先，在只使用正常类数据构建随机森林的分类决策树过程中，在二叉决策树的每个节点里记录两个特征的取值范围（每个特征对应一个值域），以此双特征值域作为异常点判断的依据。然后，在进行异常检测时，当某样本不满足决策树节点中的双特征值域时，该样本被标记为候选异常类；否则，该样本进入决策树的下层树节点继续作特征值域的比较，若无下层节点则被标记为候选正常类。最后，由随机森林算法中的判别机制决定该样本的类别。在5个UCI数据集上进行的异常点检测实验结果表明，所提方法比现有的异常检测随机森林算法性能更好，其综合性能与孤立森林（iForest）和一类支持向量机（OCSVM）方法相当或更好，且稳定于较高水平。

关键词: 异常点检测, 随机森林, 双特征过滤, 松弛边界

CLC Number:

TP311

HU Miao, WANG Kaijun. Random forest based on double features and relaxation boundary for anomaly detection[J]. Journal of Computer Applications, 2019, 39(4): 956-962.

胡淼, 王开军. 基于双特征和松弛边界的随机森林进行异常点检测[J]. 计算机应用, 2019, 39(4): 956-962.

References

[1] HAWKINS D M. Identification of outliers[M]. London:Chapman and Hall, 1980:1-2.
[2] DOMINGUES R, FILIPPONE M, MICHIARDI P, et al. A comparative evaluation of outlier detection algorithms:experiments and analyses[J]. Pattern Recognition, 2018, 74:406-421.
[3] WANG Y, WONG J, MINER A. Anomaly intrusion detection using one class SVM[C]//Proceedings from the Fifth Annual IEEE SMC Information Assurance Workshop. Piscataway, NJ:IEEE, 2004:358-364.
[4] SCHOLKOPF B, WILLIAMSON R, SMOLA A, et al. Support vector method for novelty detection[J]. Advances in Neural Information Processing Systems, 2000, 12(3):582-588.
[5] 张晓惠, 林柏钢. 基于特征选择和多分类支持向量机的异常检测[J]. 通信学报, 2009, 30(增刊1):68-73. (ZHANG X H, LIN B G. Anomaly detection based on feature selection and multi-class support vector machines[J]. Journal on Communications, 2009, 30(S1):68-73.
[6] ERFANI S M, RAJASEGARAR S, KARUNASEKERA S, et al. High-dimensional and large-scale anomaly detection using a linear one-class SVM with deep learning[J]. Pattern Recognition, 2016, 58:121-134.
[7] PAULA E L, LADEIRA M, CARVALHO R N, et al. Deep learning anomaly detection as support fraud investigation in brazilian exports and anti-money laundering[C]//Proceedings of the 2016 IEEE International Conference on Machine Learning and Applications. Piscataway, NJ:IEEE, 2016:954-960.
[8] LIU F T, TING K M, ZHOU Z H. Isolation-based anomaly detection[J]. ACM Transactions on Knowledge Discovery from Data, 2012, 6(1):1-39.
[9] SHEN Y, LIU H, WANG Y, et al. A novel isolation-based outlier detection method[C]//PRICAI 2016:Proceedings of the 2016 Pacific Rim International Conference on Artificial Intelligence. Berlin:Springer, 2016:446-456.
[10] 邱一卉, 林成德. 基于随机森林方法的异常样本检测方法[J]. 福建工程学院学报, 2007, 5(4):392-396. (QIU Y H, LIN C D. Outlier detection based on random forest[J]. Journal of Fujian University of Technology, 2007, 5(4):392-396.)
[11] ZHOU Q F, ZHOU H, NING Y P, et al. Two approaches for novelty detection using random forest[J]. Expert Systems with Applications, 2015, 42(10):4840-4850.
[12] 李贞贵.随机森林改进的若干研究[D]. 厦门:厦门大学, 2013:28-30. (LI Z G. Several research on random forest improve[D]. Xiamen:Xiamen University, 2013:28-30.)
[13] 胡淼, 王开军, 李海超, 等.模糊树节点的随机森林与异常点检测[J]. 南京大学学报(自然科学版), 2018, 54(6):1141-1151. (HU M, WANG K J, LI H C, et al. A random forest algorithm based on fuzzy tree node for anomaly detection[J]. Journal of Nanjing University (Natural Science), 2018, 54(6):1141-1151.)
[14] BREIMAN L, FRIEDMAN J, OLSHEN R, et al. Classification and Regression Trees[M]. New York:Champman & Hall,1984:18-55.
[15] 李航. 统计学习方法[M]. 北京:清华大学出版社, 2012:67-71. (LI H. Statistical Learning Method[M]. Beijing:Tsinghua University Press, 2012:67-71.)
[16] BREIMAN L. Bagging predictors[J]. Machine Learning, 1996, 24(2):123-140.
[17] BREIMAN L. Random forest[J]. Machine Learning, 2001, 45(1):5-32.
[18] 周志华.机器学习[M]. 北京:清华大学出版社, 2016:179-181. (ZHOU Z H. Machine Learning[M]. Beijing:Tsinghua University Press, 2016:179-181.)
[19] BLAKE C L, M C J. UCI repository of machine learning databases[EB/OL].[2018-05-10]. http://mlearn.ics.uci.edu/MLRepository.html.
[20] CHANG C C, LIN C J. LIBSVM:a library for support vector machines[EB/OL].[2018-05-10]. http://www.csie.ntu.edu.tw/~cjlin/libsvm/.
[21] LIU F T, TING K M, ZHOU Z H. Isolation-based anomaly detection[EB/OL].[2018-05-10]. http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html.
[22] HAN J W, KAMBER M. 数据挖掘:概念与技术[M]. 范明, 孟小峰, 译.3版.北京:机械工业出版社, 2012:236-240. (HAN J W, KAMBER M. Data Mining:Concepts and Techniques[M]. FAN M, MENG X F, translated. 3rd ed. Beijing:China Machine Press, 2012:236-240.)

Random forest based on double features and relaxation boundary for anomaly detection

基于双特征和松弛边界的随机森林进行异常点检测

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

[1]	HU Tianjie, HU Wenjun, WANG Shitong. Distribution entropy penalized support vector data description [J]. Journal of Computer Applications, 2021, 41(8): 2212-2218.
[2]	XIE Yu, JIANG Yu, LONG Chaoqi. Extended isolation forest algorithm based on random subspace [J]. Journal of Computer Applications, 2021, 41(6): 1679-1685.
[3]	LI Yanzhi, FAN Yong, GAO Lin. Anomaly detection of oil drilling water flow based on shape flow [J]. Journal of Computer Applications, 2021, 41(6): 1842-1848.
[4]	YAO Jie, CHENG Chunling, HAN Jing, LIU Zheng. Anomaly detection method based on multi-task temporal convolutional network in cloud workflow [J]. Journal of Computer Applications, 2021, 41(6): 1701-1708.
[5]	ZHANG Zenghui, JIANG Gaoxia, WANG Wenjian. Label noise filtering method based on local probability sampling [J]. Journal of Computer Applications, 2021, 41(1): 67-73.
[6]	ZHOU Xiang, ZHAI Junhai, HUANG Yajie, SHEN Ruicai, HOU Yingzhen. Instance selection algorithm for big data based on random forest and voting mechanism [J]. Journal of Computer Applications, 2021, 41(1): 74-80.
[7]	WANG Lei. Network intrusion detection method based on improved rough set attribute reduction and K-means clustering [J]. Journal of Computer Applications, 2020, 40(7): 1996-2002.
[8]	HU Min, BAI Xue, XU Wei, WU Bingjian. Review of anomaly detection algorithms for multidimensional time series [J]. Journal of Computer Applications, 2020, 40(6): 1553-1564.
[9]	QIU Yuan, Chang Xiangmao, QIU Qian, PENG Cheng, SU Shanting. Stream data anomaly detection method based on long short-term memory network and sliding window [J]. Journal of Computer Applications, 2020, 40(5): 1335-1339.
[10]	HUO Weigang, WANG Huifang. Time series anomaly detection method based on autoencoder and HMM [J]. Journal of Computer Applications, 2020, 40(5): 1329-1334.
[11]	CHEN Yu, MAO Yingchi. Automatic tuning of Ceph parameters based on random forest and genetic algorithm [J]. Journal of Computer Applications, 2020, 40(2): 347-351.
[12]	XIA Bin, BAI Yuxuan, YIN Junjie. Generative adversarial network-based system log-level anomaly detection algorithm [J]. Journal of Computer Applications, 2020, 40(10): 2960-2966.
[13]	LANG Dapeng, DING Wei, JIANG Haocheng, CHEN Zhiyuang. Malicious code classification algorithm based on multi-feature fusion [J]. Journal of Computer Applications, 2019, 39(8): 2333-2338.
[14]	WANG Wei, XIE Yaobin, YIN Qing. Decision tree improvement method for imbalanced data [J]. Journal of Computer Applications, 2019, 39(3): 623-628.
[15]	LIU Zihao, LI Ling, YE Feng. Anomaly detection method for hydrologic sensor data based on SparkR [J]. Journal of Computer Applications, 2019, 39(2): 436-440.