Journal of Computer Applications ›› 2020, Vol. 40 ›› Issue (8): 2327-2333.DOI: 10.11772/j.issn.1001-9081.2019122241

• Network and communications • Previous Articles     Next Articles

Classification model for class imbalanced traffic data

LIU Dan1,2, YAO Lishuang1,2, WANG Yunfeng1,2, PEI Zuofei1,2   

  1. 1. School of Communication and Information Engineering, Chongqing University of Posts and Communications, Chongqing 400065, China;
    2. Chongqing Key Lab of Mobile Communications Technology(Chongqing University of Posts and Communications), Chongqing 400065, China
  • Received:2020-01-07 Revised:2020-03-30 Online:2020-08-10 Published:2020-05-14
  • Supported by:
    This work is partially supported by the Program for Changjiang Scholars and Innovative Research Team in University (IRT_16R72).

面向类不平衡流量数据的分类模型

刘丹1,2, 姚立霜1,2, 王云锋1,2, 裴作飞1,2   

  1. 1. 重庆邮电大学 通信与信息工程学院, 重庆 400065;
    2. 移动通信技术重庆市重点实验室(重庆邮电大学), 重庆 400065
  • 通讯作者: 刘丹(1995-),女,四川绵阳人,硕士研究生,主要研究方向:机器学习、网络管理;951737517@qq.com
  • 作者简介:姚立霜(1994-),女,重庆人,硕士研究生,主要研究方向:机器学习、网络管理;王云锋(1992-),男,辽宁鞍山人,硕士研究生,主要研究方向:机器学习、数据挖掘;裴作飞(1994-),男,江苏盐城人,硕士研究生,主要研究方向:机器学习、数据挖掘。
  • 基金资助:
    长江学者和创新团队发展计划(IRT_16R72)。

Abstract: In the process of network traffic classification, the traditional model has poor classification on minority classes and cannot be updated frequently and timely. In order to solve the problems, a network Traffic Classification Model based on Ensemble Learning (ELTCM) was proposed. First, in order to reduce the impact of class imbalance problem, feature metrics biased towards minority classes were defined according to the class distribution information, and the weighted symmetric uncertainty and Approximate Markov Blanket (AMB) were used to reduce the dimensionality of network traffic features. Then, early concept drift detection was introduced to enhance the model's ability to cope with the changes in traffic features as the network changed. At the same time, incremental learning was used to improve the flexibility of model update training. Experimental results on real traffic datasets show that compared with the Internet Traffic Classification based on C4.5 Decision Tree (DTITC) and Classification Model for Concept Drift Detection based on ErrorRate (ERCDD), the proposed ELTCM has the average overall accuracy increased by 1.13% and 0.26% respectively, and the classification performance of minority classes all higher than those of the models. ELTCM has high generalization ability, and can effectively improve the classification performance of minority classes without sacrificing the overall classification accuracy.

Key words: traffic classification, class imbalance, feature selection, incremental learning, ensemble learning

摘要: 针对网络流量分类过程中,传统模型在小类别上的分类性能较差和难以实现频繁、及时更新的问题,提出一种基于集成学习的网络流量分类模型(ELTCM)。首先,根据类别分布信息定义了偏向于小类别的特征度量,利用加权对称不确定性和近似马尔可夫毯(AMB)对网络流量特征进行降维,减小类不平衡问题带来的影响;然后,引入早期概念漂移检测增强模型应对流量特征随网络变化而变化的能力,并通过增量学习的方式提高模型更新训练的灵活性。利用真实流量数据集进行实验,仿真结果表明,与基于C4.5决策树的分类模型(DTITC)和基于错误率的概念漂移检测分类模型(ERCDD)相比,ELTCM的平均整体精确率分别提高了1.13%和0.26%,且各小类别的分类性能皆优于对比模型。ELTCM有较好的泛化能力,能在不牺牲整体分类精度的情况下有效提高小类别的分类性能。

关键词: 流量分类, 类不平衡, 特征选择, 增量学习, 集成学习

CLC Number: