Journal of Computer Applications ›› 2019, Vol. 39 ›› Issue (3): 623-628.DOI: 10.11772/j.issn.1001-9081.2018071513

Decision tree improvement method for imbalanced data

WANG Wei, XIE Yaobin, YIN Qing   

  1. State Key Laboratory of Mathematic Engineering and Advanced Computing(Information Engineering University), Zhengzhou Henan 450000, China
  • Received:2018-07-23 Revised:2018-09-06 Online:2019-03-11 Published:2019-03-10
  • Supported by:

    This work is partially supported by the National Natural Science Foundation of China (61802431).


王伟, 谢耀滨, 尹青   

  1. 数学工程与先进计算国家重点实验室(战略支援部队信息工程大学), 郑州 450000
  • 通讯作者: 谢耀滨
  • 作者简介:王伟(1993-),男,浙江温州人,硕士研究生,主要研究方向:工控安全、机器学习;谢耀滨(1981-),男,福建龙海人,副教授,硕士,主要研究方向:工控安全;尹青(1968-),女,江苏徐州人,教授,博士,主要研究方向:信息安全、形式化验证、逆向分析。
  • 基金资助:



Focusing on the problem that serious imbalance between abnormal data and normal data in anomaly detection will lead to performance degradation of decision tree, three improved methods for C4.5 decision tree were proposed, which are C4.5+δ, UDE (Uniform Distribution Entropy) and IDEF (Improved Distribution Entropy Function). Firstly, it was deduced that the attribute selection criterion of C4.5 tends to choose the ones with imbalanced splitting. Secondly, why imbalanced splitting decreases the accuracy of anomaly (minority) detection was analyzed. Thirdly, the attribute selection criterion-information gain ratio of C4.5 was improved by introducing relaxation factor and uniform distribution entropy, or substituting distribution entropy function. Finally, three improved decision trees were verified on WEKA platform and NSL-KDD dataset. Experimental results show that three proposed improved methods can increase the accuracy of anomaly detection. Compared with C4.5, the accuracies of C4.5+7, UDE and IDEF on KDDTest-21 dataset are improved by 3.16, 3.02 and 3.12 percentage points respectively, which are better than the methods using Rényi entropy or Tsallis entropy as splitting criterion. Furthermore, using improved decision trees to detect anomalies in the industrial control system can not only improve the recall ratio of anomalies, but also reduce false positive rate.

Key words: imbalanced data, anomaly detection, decision tree, C4.5, information gain ratio



关键词: 不平衡数据, 异常检测, 决策树, C4.5, 信息增益率

