Journal of Computer Applications ›› 2021, Vol. 41 ›› Issue (6): 1709-1714.DOI: 10.11772/j.issn.1001-9081.2020091378

Special Issue: 网络空间安全

• Cyber security • Previous Articles     Next Articles

Oversampling method for intrusion detection based on clustering and instance hardness

WANG Yao, SUN Guozi   

  1. School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing Jiangsu 210023, China
  • Received:2020-09-07 Revised:2020-12-31 Online:2021-06-10 Published:2021-02-02
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61906099), the Open Fund of Key Laboratory of Urban Land Resources Monitoring and Simulation, Ministry of Natural Resources (KF-2019-04-065).


王垚, 孙国梓   

  1. 南京邮电大学 计算机学院, 南京 210023
  • 通讯作者: 孙国梓
  • 作者简介:王垚(1995-),男,江苏宿迁人,硕士研究生,主要研究方向:网络空间安全、数据挖掘;孙国梓(1972-),男,安徽天长人,教授,博士,主要研究方向:网络空间安全、电子数据取证。
  • 基金资助:

Abstract: Aiming at the problem of low detection efficiency of intrusion detection models due to the imbalance of network traffic data, a new Clustering and instance Hardness-based Oversampling method for intrusion detection (CHO) was proposed. Firstly, the hardness values of the minority data were measured as input by calculating the proportion of the majority class samples in the neighbors of minority class samples. Secondly, the Canopy clustering approach was used to pre-cluster the minority data, and the obtained cluster values were taken as the clustering parameter of K-means++ clustering approach to cluster again. Then, the average hardness and the standard deviation of different clusters were calculated, and the former was taken as the "investigation cost" in the optimum allocation theory of statistics, and the amount of data to be generated in each cluster was determined by this theory. Finally, the "safe" regions in the clusters were further identified according to the hardness values, and the specified amount of data was generated in the safe regions in the clusters by using the interpolation method. The comparative experiment was carried out on 6 open intrusion detection datasets. The proposed method achieves the optimal values of 1.33 on both Area Under Curve (AUC) and Geometric mean (G-mean), and has the AUC increased by 1.6 percentage points on average compared to Synthetic Minority Oversampling TEchnique (SMOTE) on 4 of the 6 datasets. The experimental results show that the proposed method can be well applied to imbalance problems in intrusion detection.

Key words: intrusion detection, imbalanced learning, oversampling method, instance hardness, optimum allocation

摘要: 针对由于网络流量数据不平衡而导致入侵检测模型检测率低的问题,提出了一种基于聚类和实例硬度的入侵检测过采样方法(CHO)。首先,测算少数类数据对应的硬度值并作为输入,即计算其近邻样本中多数类的样本所占的比例;接下来,运用Canopy聚类方法对少数类数据进行预聚类,将所得到的聚类数值作为K-means++聚类方法的聚类参数再次聚类;然后,计算不同簇的平均硬度和标准差,将平均硬度作为统计学最优分配原理中的“调查费用”,并由该原理确定各簇中应生成的数据量;最后,根据硬度值的大小进一步识别簇中的“安全”区域,并在各簇的安全区域中由插值法生成指定数量的数据。与合成少数类过采样技术(SMOTE)等方法在6组公开的入侵检测数据集上进行对比实验,所提方法在曲线下面积(AUC)和G-mean上均取得了值为1.33的最优值,且相较于SMOTE在其中4组数据集上的AUC平均提高了1.6个百分点。实验结果表明该方法适用于处理入侵检测中的不平衡问题。

关键词: 入侵检测, 不平衡学习, 过采样方法, 实例硬度, 最优分配

CLC Number: