Journal of Computer Applications ›› 2019, Vol. 39 ›› Issue (3): 623-628.DOI: 10.11772/j.issn.1001-9081.2018071513

    Next Articles

Decision tree improvement method for imbalanced data

WANG Wei, XIE Yaobin, YIN Qing   

  1. State Key Laboratory of Mathematic Engineering and Advanced Computing(Information Engineering University), Zhengzhou Henan 450000, China
  • Received:2018-07-23 Revised:2018-09-06 Online:2019-03-10 Published:2019-03-11
  • Supported by:

    This work is partially supported by the National Natural Science Foundation of China (61802431).

针对不平衡数据的决策树改进方法

王伟, 谢耀滨, 尹青   

  1. 数学工程与先进计算国家重点实验室(战略支援部队信息工程大学), 郑州 450000
  • 通讯作者: 谢耀滨
  • 作者简介:王伟(1993-),男,浙江温州人,硕士研究生,主要研究方向:工控安全、机器学习;谢耀滨(1981-),男,福建龙海人,副教授,硕士,主要研究方向:工控安全;尹青(1968-),女,江苏徐州人,教授,博士,主要研究方向:信息安全、形式化验证、逆向分析。
  • 基金资助:

    国家自然科学基金资助项目(61802431)。

Abstract:

Focusing on the problem that serious imbalance between abnormal data and normal data in anomaly detection will lead to performance degradation of decision tree, three improved methods for C4.5 decision tree were proposed, which are C4.5+δ, UDE (Uniform Distribution Entropy) and IDEF (Improved Distribution Entropy Function). Firstly, it was deduced that the attribute selection criterion of C4.5 tends to choose the ones with imbalanced splitting. Secondly, why imbalanced splitting decreases the accuracy of anomaly (minority) detection was analyzed. Thirdly, the attribute selection criterion-information gain ratio of C4.5 was improved by introducing relaxation factor and uniform distribution entropy, or substituting distribution entropy function. Finally, three improved decision trees were verified on WEKA platform and NSL-KDD dataset. Experimental results show that three proposed improved methods can increase the accuracy of anomaly detection. Compared with C4.5, the accuracies of C4.5+7, UDE and IDEF on KDDTest-21 dataset are improved by 3.16, 3.02 and 3.12 percentage points respectively, which are better than the methods using Rényi entropy or Tsallis entropy as splitting criterion. Furthermore, using improved decision trees to detect anomalies in the industrial control system can not only improve the recall ratio of anomalies, but also reduce false positive rate.

Key words: imbalanced data, anomaly detection, decision tree, C4.5, information gain ratio

摘要:

针对异常检测中异常数据与正常数据的比例严重不平衡导致决策树性能下降的问题,提出了C4.5决策树的三种改进方法——C4.5+δ、均匀分布熵(UDE)和改进分布熵函数(IDEF)。首先,推导了C4.5算法中属性选择准则会倾向于选择偏斜划分的属性;然后,分析了偏斜划分使得异常(少数类)检测精度下降的原因;其次,分别通过引入缓和因子、均匀分布熵或替换分布熵函数改进了C4.5算法的属性选择准则——信息增益率;最后,利用WEKA平台和NSL-KDD数据集对改进的决策树进行验证。实验结果表明,三种改进方法均能提高异常检测精度。其中,相比于C4.5,C4.5+7、UDE和IDEF算法在KDDTest-21数据集上的少数类检测精度(灵敏度)分别提高了3.16、3.02和3.12个百分点,均优于采用Rényi熵和Tsallis熵作为分裂准则的方法。此外,利用三种改进的决策树检测工业控制系统中的异常,不仅可以提高异常的查全率还能减小误报率。

关键词: 不平衡数据, 异常检测, 决策树, C4.5, 信息增益率

CLC Number: