Journal of Computer Applications ›› 2023, Vol. 43 ›› Issue (11): 3510-3516.DOI: 10.11772/j.issn.1001-9081.2022111738

• Cyber security • Previous Articles    

Unsupervised log anomaly detection model based on CNN and Bi-LSTM

Chunyong YIN(), Yangchun ZHANG   

  1. School of Computer Science,Nanjing University of Information Science and Technology,Nanjing Jiangsu 210044,China
  • Received:2022-11-22 Revised:2023-03-19 Accepted:2023-03-23 Online:2023-04-10 Published:2023-11-10
  • Contact: Chunyong YIN
  • About author:YIN Chunyong, born in 1977, Ph. D., professor. His research interests include cyberspace security, big data mining, privacy protection, artificial intelligence, new computing.
    ZHANG Yangchun, born in 1999, M. S. candidate. Her research interests include anomaly detection, deep learning, log analysis.

基于CNN和Bi-LSTM的无监督日志异常检测模型

尹春勇(), 张杨春   

  1. 南京信息工程大学 计算机学院、网络空间安全学院,南京 210044
  • 通讯作者: 尹春勇
  • 作者简介:尹春勇(1977—),男,山东潍坊人,教授,博士生导师,博士,主要研究方向:网络空间安全、大数据挖掘、隐私保护、人工智能、新型计算 yinchunyong@hotmail.com
    张杨春(1999—),女,江苏南通人,硕士研究生,主要研究方向:异常检测、深度学习、日志分析。

Abstract:

Logs can record the specific status of the system during the operation, and automated log anomaly detection is critical to network security. Concerning the problem of low accuracy in anomaly detection caused by the evolution of log sentences over time, an unsupervised log anomaly detection model LogCL was proposed. Firstly, the log parsing technique was used to convert semi-structured log data into structured log templates. Secondly, the sessions and fixed windows were employed to divide log events into log sequences. Thirdly, quantitative characteristics of the log sequences were extracted, natural language processing technique was used to extract semantic features of log templates, and Term Frequency-Inverse Word Frequency (TF-IWF) algorithm was utilized to generate weighted sentence embedding vectors. Finally, the feature vectors were input into a parallel model based on Convolutional Neural Network (CNN) and Bi-directional Long Short-Term Memory (Bi-LSTM) network for detection. Experimental results on two public real datasets show that the proposed model improves the anomaly detection F1-score by 3.6 and 2.3 percentage points respectively compared with the baseline model LogAnomaly. Therefore, LogCL can perform effectively on log anomaly detection.

Key words: anomaly detection, deep learning, log analysis, word embedding, Convolutional Neural Network (CNN), Bi-directional Long Short-Term Memory (Bi-LSTM) network

摘要:

日志能记录系统运行时的具体状态,而自动化的日志异常检测对网络安全至关重要。针对日志语句随时间演变导致异常检测准确率低的问题,提出一种无监督日志异常检测模型LogCL。首先,通过日志解析技术将半结构化的日志数据转换为结构化的日志模板;其次,使用会话和固定窗口将日志事件划分为日志序列;再次,提取日志序列的数量特征,使用自然语言处理技术对日志模板进行语义特征提取,并利用词频-词语逆频率(TF-IWF)算法生成加权的句嵌入向量;最后,将特征向量输入一个并列的基于卷积神经网络(CNN)和双向长短期记忆(Bi-LSTM)网络的模型中进行检测。在两个公开的真实数据集上的实验结果表明,所提模型较基准模型LogAnomaly在异常检测的F1?score上分别提高了3.6和2.3个百分点。因此LogCL能够对日志数据进行有效的异常检测。

关键词: 异常检测, 深度学习, 日志分析, 词嵌入, 卷积神经网络, 双向长短期记忆网络

CLC Number: