基于双特征和松弛边界的随机森林进行异常点检测

doi:10.11772/j.issn.1001-9081.2018091966

计算机应用 ›› 2019, Vol. 39 ›› Issue (4): 956-962.DOI: 10.11772/j.issn.1001-9081.2018091966

基于双特征和松弛边界的随机森林进行异常点检测

胡淼^1,2, 王开军^1,2

1. 福建师范大学数学与信息学院, 福州 350117;
2. 福建师范大学数字福建环境监测物联网实验室, 福州 350117

收稿日期:2018-09-25 修回日期:2018-11-06 出版日期:2019-04-10 发布日期:2019-04-10
通讯作者: 王开军
作者简介:胡淼(1994-),男,安徽太和人,硕士研究生,主要研究方向:机器学习、数据挖掘;王开军(1965-),男,福建福州人,副教授,博士,主要研究方向:机器学习、数据挖掘。
基金资助:
国家自然科学基金资助项目（61672157）；福建省自然科学基金资助项目（2018J01778）。

Random forest based on double features and relaxation boundary for anomaly detection

HU Miao^1,2, WANG Kaijun^1,2

1. College of Mathematics and Informatics, Fujian Normal University, Fuzhou Fujian 350117, China;
2. Digit Fujian Internet-of-Things Laboratory of Environmental Monitoring, Fujian Normal University, Fuzhou Fujian 350117, China

Received:2018-09-25 Revised:2018-11-06 Online:2019-04-10 Published:2019-04-10
Supported by:
This work is partially supported by the National Natural Science Foundation of China(61672157), the Natural Science Foundation of Fujian Province (2018J01778).

摘要/Abstract

摘要： 针对现有基于随机森林的异常检测算法性能不高的问题，提出一种结合双特征和松弛边界的随机森林算法用于异常点检测。首先，在只使用正常类数据构建随机森林的分类决策树过程中，在二叉决策树的每个节点里记录两个特征的取值范围（每个特征对应一个值域），以此双特征值域作为异常点判断的依据。然后，在进行异常检测时，当某样本不满足决策树节点中的双特征值域时，该样本被标记为候选异常类；否则，该样本进入决策树的下层树节点继续作特征值域的比较，若无下层节点则被标记为候选正常类。最后，由随机森林算法中的判别机制决定该样本的类别。在5个UCI数据集上进行的异常点检测实验结果表明，所提方法比现有的异常检测随机森林算法性能更好，其综合性能与孤立森林（iForest）和一类支持向量机（OCSVM）方法相当或更好，且稳定于较高水平。

关键词: 异常点检测, 随机森林, 双特征过滤, 松弛边界

Abstract: Aiming at the low performance of existing anomaly detection algorithms based on random forest, a random forest algorithm combining double features and relaxation boundary was proposed for anomaly detection. Firstly, in the process of constructing binary decision tree of random forest with normal class data only, the range of two features (each feature had a corresponding eigenvalue range) were recorded in each node of the binary decision tree, and the double-feature eigenvalue ranges were used as the basis for abnormal point judgment. Secondly, during the anomaly detection, if a sample did not satisfy the double-feature eigenvalue range in the decision tree node, the sample would be marked as a candidate exception class; otherwise, the sample would enter the lower nodes of the decision tree and continue the comparision with the corresponding double-feature eigenvalue range. The sample would be marked as candidate normal class if there were no lower nodes. Finally, the discriminative mechanism in random forest algorithm was used to distinguish the class of the samples. Experimented results on five UCI datasets show that the proposed method has better performance than the existing random forest algorithms for anomaly detection, and its comprehensive performance is equivalent to or better than isolation Forest (iForest) and One-Class SVM (OCSVM), and stable at a high level.

Key words: anomaly detection, Random Forest (RF), double-feature filtering, relaxation boundary

中图分类号:

TP311

胡淼, 王开军. 基于双特征和松弛边界的随机森林进行异常点检测[J]. 计算机应用, 2019, 39(4): 956-962.

HU Miao, WANG Kaijun. Random forest based on double features and relaxation boundary for anomaly detection[J]. Journal of Computer Applications, 2019, 39(4): 956-962.

参考文献

[1] HAWKINS D M. Identification of outliers[M]. London:Chapman and Hall, 1980:1-2.
[2] DOMINGUES R, FILIPPONE M, MICHIARDI P, et al. A comparative evaluation of outlier detection algorithms:experiments and analyses[J]. Pattern Recognition, 2018, 74:406-421.
[3] WANG Y, WONG J, MINER A. Anomaly intrusion detection using one class SVM[C]//Proceedings from the Fifth Annual IEEE SMC Information Assurance Workshop. Piscataway, NJ:IEEE, 2004:358-364.
[4] SCHOLKOPF B, WILLIAMSON R, SMOLA A, et al. Support vector method for novelty detection[J]. Advances in Neural Information Processing Systems, 2000, 12(3):582-588.
[5] 张晓惠, 林柏钢. 基于特征选择和多分类支持向量机的异常检测[J]. 通信学报, 2009, 30(增刊1):68-73. (ZHANG X H, LIN B G. Anomaly detection based on feature selection and multi-class support vector machines[J]. Journal on Communications, 2009, 30(S1):68-73.
[6] ERFANI S M, RAJASEGARAR S, KARUNASEKERA S, et al. High-dimensional and large-scale anomaly detection using a linear one-class SVM with deep learning[J]. Pattern Recognition, 2016, 58:121-134.
[7] PAULA E L, LADEIRA M, CARVALHO R N, et al. Deep learning anomaly detection as support fraud investigation in brazilian exports and anti-money laundering[C]//Proceedings of the 2016 IEEE International Conference on Machine Learning and Applications. Piscataway, NJ:IEEE, 2016:954-960.
[8] LIU F T, TING K M, ZHOU Z H. Isolation-based anomaly detection[J]. ACM Transactions on Knowledge Discovery from Data, 2012, 6(1):1-39.
[9] SHEN Y, LIU H, WANG Y, et al. A novel isolation-based outlier detection method[C]//PRICAI 2016:Proceedings of the 2016 Pacific Rim International Conference on Artificial Intelligence. Berlin:Springer, 2016:446-456.
[10] 邱一卉, 林成德. 基于随机森林方法的异常样本检测方法[J]. 福建工程学院学报, 2007, 5(4):392-396. (QIU Y H, LIN C D. Outlier detection based on random forest[J]. Journal of Fujian University of Technology, 2007, 5(4):392-396.)
[11] ZHOU Q F, ZHOU H, NING Y P, et al. Two approaches for novelty detection using random forest[J]. Expert Systems with Applications, 2015, 42(10):4840-4850.
[12] 李贞贵.随机森林改进的若干研究[D]. 厦门:厦门大学, 2013:28-30. (LI Z G. Several research on random forest improve[D]. Xiamen:Xiamen University, 2013:28-30.)
[13] 胡淼, 王开军, 李海超, 等.模糊树节点的随机森林与异常点检测[J]. 南京大学学报(自然科学版), 2018, 54(6):1141-1151. (HU M, WANG K J, LI H C, et al. A random forest algorithm based on fuzzy tree node for anomaly detection[J]. Journal of Nanjing University (Natural Science), 2018, 54(6):1141-1151.)
[14] BREIMAN L, FRIEDMAN J, OLSHEN R, et al. Classification and Regression Trees[M]. New York:Champman & Hall,1984:18-55.
[15] 李航. 统计学习方法[M]. 北京:清华大学出版社, 2012:67-71. (LI H. Statistical Learning Method[M]. Beijing:Tsinghua University Press, 2012:67-71.)
[16] BREIMAN L. Bagging predictors[J]. Machine Learning, 1996, 24(2):123-140.
[17] BREIMAN L. Random forest[J]. Machine Learning, 2001, 45(1):5-32.
[18] 周志华.机器学习[M]. 北京:清华大学出版社, 2016:179-181. (ZHOU Z H. Machine Learning[M]. Beijing:Tsinghua University Press, 2016:179-181.)
[19] BLAKE C L, M C J. UCI repository of machine learning databases[EB/OL].[2018-05-10]. http://mlearn.ics.uci.edu/MLRepository.html.
[20] CHANG C C, LIN C J. LIBSVM:a library for support vector machines[EB/OL].[2018-05-10]. http://www.csie.ntu.edu.tw/~cjlin/libsvm/.
[21] LIU F T, TING K M, ZHOU Z H. Isolation-based anomaly detection[EB/OL].[2018-05-10]. http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html.
[22] HAN J W, KAMBER M. 数据挖掘:概念与技术[M]. 范明, 孟小峰, 译.3版.北京:机械工业出版社, 2012:236-240. (HAN J W, KAMBER M. Data Mining:Concepts and Techniques[M]. FAN M, MENG X F, translated. 3rd ed. Beijing:China Machine Press, 2012:236-240.)

基于双特征和松弛边界的随机森林进行异常点检测

Random forest based on double features and relaxation boundary for anomaly detection

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	张杨, 董士程. 面向并发程序中锁机制的智能化推荐方法[J]. 计算机应用, 2021, 41(6): 1597-1603.
[2]	余东昌, 赵文芳, 聂凯, 张舸. 基于LightGBM算法的能见度预测模型[J]. 计算机应用, 2021, 41(4): 1035-1041.
[3]	张增辉, 姜高霞, 王文剑. 基于局部概率抽样的标签噪声过滤方法[J]. 计算机应用, 2021, 41(1): 67-73.
[4]	周翔, 翟俊海, 黄雅婕, 申瑞彩, 侯璎真. 基于随机森林和投票机制的大数据样例选择算法[J]. 计算机应用, 2021, 41(1): 74-80.
[5]	肖跃雷, 张云娇. 基于特征选择和超参数优化的恐怖袭击组织预测方法[J]. 计算机应用, 2020, 40(8): 2262-2267.
[6]	聂茜婵, 张阳, 余敦辉, 张兴盛. 面向全局优化的时空众包任务分配算法[J]. 计算机应用, 2020, 40(7): 1950-1958.
[7]	余敦辉, 袁旭, 张万山, 王晨旭. 基于动态阈值的时空众包在线分配算法[J]. 计算机应用, 2020, 40(3): 658-664.
[8]	陈禹, 毛莺池. 基于随机森林和遗传算法的Ceph参数自动调优[J]. 计算机应用, 2020, 40(2): 347-351.
[9]	王治忠, 钱龙龙, 韩闯, 师丽. 基于统计特征和熵特征融合的心肌梗死辅助诊断方法[J]. 计算机应用, 2020, 40(2): 608-615.
[10]	郎大鹏, 丁巍, 姜昊辰, 陈志远. 基于多特征融合的恶意代码分类算法[J]. 计算机应用, 2019, 39(8): 2333-2338.
[11]	何新宇, 张晓龙. 基于深度神经网络的肺炎图像识别模型[J]. 计算机应用, 2019, 39(6): 1680-1684.
[12]	田臣, 周丽娟. 基于带多数类权重的少数类过采样技术和随机森林的信用评估方法[J]. 计算机应用, 2019, 39(6): 1707-1712.
[13]	李鲜, 王艳, 罗勇, 周激流. 基于随机森林特征选择算法的鼻咽肿瘤分割[J]. 计算机应用, 2019, 39(5): 1485-1489.
[14]	邓旭, 徐新, 董浩. 单极化合成孔径雷达图像颜色特征编码与分类[J]. 计算机应用, 2018, 38(7): 2056-2063.
[15]	周颖, 方勇, 黄诚, 刘亮. 面向PHP应用程序的SQL注入行为检测[J]. 计算机应用, 2018, 38(1): 201-206.