计算机应用 ›› 2015, Vol. 35 ›› Issue (12): 3348-3351.DOI: 10.11772/j.issn.1001-9081.2015.12.3348

• 网络与通信 • 上一篇    下一篇

基于改进的旋转森林算法的不平衡网络流量分类方法

丁要军   

  1. 甘肃政法学院信息工程学院, 兰州 730070
  • 收稿日期:2015-05-18 修回日期:2015-07-19 出版日期:2015-12-10 发布日期:2015-12-10
  • 通讯作者: 丁要军(1980-),男,河南许昌人,副教授,博士,CCF会员,主要研究方向:网络安全、机器学习
  • 作者简介:丁要军(1980-),男,河南许昌人,副教授,博士,CCF会员,主要研究方向:网络安全、机器学习。
  • 基金资助:
    甘肃政法学院重点基金资助项目(GZF2014XZDLW15)。

Imbalanced network traffic classification method based on improved forest rotation algorithm

DING Yaojun   

  1. School of Information Engineering, Gansu Institute of Political Science and Law, Lanzhou Gansu 730070, China
  • Received:2015-05-18 Revised:2015-07-19 Online:2015-12-10 Published:2015-12-10

摘要: 针对不平衡网络流量分类精度不高的问题,在旋转森林算法的基础上结合Bagging算法的Bootstrap抽样和基于分类精度排序的基分类器选择算法,提出一种改进的旋转森林算法。首先,对原始训练集按特征进行子集划分并分别使用Bagging进行样本抽样,通过主成分分析(PCA)生成主成分系数矩阵;然后,在原始训练集和主成分系数矩阵的基础上进行特征转换,生成新的训练子集,再次使用Bagging对子集进行抽样,提升训练集的差异性,并使用训练子集训练C4.5基分类器;最后,使用测试集评价基分类器,依据总体分类精度进行排序筛选,保留分类精度较高的分类器并生成一致分类结果。在不平衡网络流量数据集上进行测试实验,依据准确率和召回率两个标准对C4.5、Bagging、旋转森林和改进的旋转森林四种算法评价,依据模型训练时间和测试时间评价四种算法的时间效率。实验结果表明改进的旋转森林算法对万维网(WWW)协议、Mail协议、Attack协议、对等网(P2P)协议的分类准确度达到99.5%以上,召回率也高于旋转森林、Bagging、C4.5三种算法,可用于网络入侵取证、维护网络安全、提升网络服务质量。

关键词: 主成分分析, 集成学习, 不平衡网络流量, 旋转森林, 决策树

Abstract: Aiming at the problem of not high accuracy of the unbalanced network traffic classification, on the basis of rotation forest algorithm, an improved rotation forest algorithm by combining the Bootstrap sampling of Bagging algorithm and the base classifier selection algorithm based on sorting of accuracy was proposed. Firstly, the subset was divided from the original training set according to the characteristics, the Bagging was used for sampling, and the coefficient matrix of principal components was computed by Principal Component Analysis (PCA). Then, features of subset were converted based on the original training set and coefficient matrix of principal components to generate new training subsets. In order to enhance the difference of training set and train base classifier of C4.5 by the training subset, the Bagging was used again for sampling subsets. Finally, the testing set was used to evaluate the base classifiers, and the classifiers were sorted and filtered by the overall classification accuracy.The classifiers with high accuracy were chosen to generate consistent classifier results. The imbalanced network traffic data set was chosen for the test experiment, and the precision and recall were used for evaluating the classifiers of C4.5, Bagging, rotation forest and the improved rotation forest. The time efficiency of the four algorithms were evaluated by the training time and testing time of models. The experimental results show that, the classification accuracy of the improved rotation forest algorithm is above 99.5% on the protocols of World Wide Web (WWW), Mail, Attack, Peer-to-Peer (P2P), and the recall rate is also higher than rotation forest, Bagging and C4.5. The proposed algorithm can be used for network intrusion forensics, maintaining network security and improving the quality of network service.

Key words: Principal Component Analysis (PCA), ensemble learning, imbalanced network traffic, rotation forest, decision tree

中图分类号: