计算机应用 ›› 2019, Vol. 39 ›› Issue (4): 956-962.DOI: 10.11772/j.issn.1001-9081.2018091966

• 人工智能 • 上一篇    下一篇

基于双特征和松弛边界的随机森林进行异常点检测

胡淼1,2, 王开军1,2   

  1. 1. 福建师范大学 数学与信息学院, 福州 350117;
    2. 福建师范大学 数字福建环境监测物联网实验室, 福州 350117
  • 收稿日期:2018-09-25 修回日期:2018-11-06 出版日期:2019-04-10 发布日期:2019-04-10
  • 通讯作者: 王开军
  • 作者简介:胡淼(1994-),男,安徽太和人,硕士研究生,主要研究方向:机器学习、数据挖掘;王开军(1965-),男,福建福州人,副教授,博士,主要研究方向:机器学习、数据挖掘。
  • 基金资助:
    国家自然科学基金资助项目(61672157);福建省自然科学基金资助项目(2018J01778)。

Random forest based on double features and relaxation boundary for anomaly detection

HU Miao1,2, WANG Kaijun1,2   

  1. 1. College of Mathematics and Informatics, Fujian Normal University, Fuzhou Fujian 350117, China;
    2. Digit Fujian Internet-of-Things Laboratory of Environmental Monitoring, Fujian Normal University, Fuzhou Fujian 350117, China
  • Received:2018-09-25 Revised:2018-11-06 Online:2019-04-10 Published:2019-04-10
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China(61672157), the Natural Science Foundation of Fujian Province (2018J01778).

摘要: 针对现有基于随机森林的异常检测算法性能不高的问题,提出一种结合双特征和松弛边界的随机森林算法用于异常点检测。首先,在只使用正常类数据构建随机森林的分类决策树过程中,在二叉决策树的每个节点里记录两个特征的取值范围(每个特征对应一个值域),以此双特征值域作为异常点判断的依据。然后,在进行异常检测时,当某样本不满足决策树节点中的双特征值域时,该样本被标记为候选异常类;否则,该样本进入决策树的下层树节点继续作特征值域的比较,若无下层节点则被标记为候选正常类。最后,由随机森林算法中的判别机制决定该样本的类别。在5个UCI数据集上进行的异常点检测实验结果表明,所提方法比现有的异常检测随机森林算法性能更好,其综合性能与孤立森林(iForest)和一类支持向量机(OCSVM)方法相当或更好,且稳定于较高水平。

关键词: 异常点检测, 随机森林, 双特征过滤, 松弛边界

Abstract: Aiming at the low performance of existing anomaly detection algorithms based on random forest, a random forest algorithm combining double features and relaxation boundary was proposed for anomaly detection. Firstly, in the process of constructing binary decision tree of random forest with normal class data only, the range of two features (each feature had a corresponding eigenvalue range) were recorded in each node of the binary decision tree, and the double-feature eigenvalue ranges were used as the basis for abnormal point judgment. Secondly, during the anomaly detection, if a sample did not satisfy the double-feature eigenvalue range in the decision tree node, the sample would be marked as a candidate exception class; otherwise, the sample would enter the lower nodes of the decision tree and continue the comparision with the corresponding double-feature eigenvalue range. The sample would be marked as candidate normal class if there were no lower nodes. Finally, the discriminative mechanism in random forest algorithm was used to distinguish the class of the samples. Experimented results on five UCI datasets show that the proposed method has better performance than the existing random forest algorithms for anomaly detection, and its comprehensive performance is equivalent to or better than isolation Forest (iForest) and One-Class SVM (OCSVM), and stable at a high level.

Key words: anomaly detection, Random Forest (RF), double-feature filtering, relaxation boundary

中图分类号: