基于CNN和Bi-LSTM的无监督日志异常检测模型

doi:10.11772/j.issn.1001-9081.2022111738

《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (11): 3510-3516.DOI: 10.11772/j.issn.1001-9081.2022111738

所属专题：网络空间安全

基于CNN和Bi-LSTM的无监督日志异常检测模型

尹春勇(), 张杨春

南京信息工程大学计算机学院、网络空间安全学院，南京 210044

收稿日期:2022-11-22 修回日期:2023-03-19 接受日期:2023-03-23 发布日期:2023-04-10 出版日期:2023-11-10
通讯作者: 尹春勇
作者简介:尹春勇（1977—），男，山东潍坊人，教授，博士生导师，博士，主要研究方向：网络空间安全、大数据挖掘、隐私保护、人工智能、新型计算 yinchunyong@hotmail.com
张杨春（1999—），女，江苏南通人，硕士研究生，主要研究方向：异常检测、深度学习、日志分析。

Unsupervised log anomaly detection model based on CNN and Bi-LSTM

Chunyong YIN(), Yangchun ZHANG

School of Computer Science，Nanjing University of Information Science and Technology，Nanjing Jiangsu 210044，China

Received:2022-11-22 Revised:2023-03-19 Accepted:2023-03-23 Online:2023-04-10 Published:2023-11-10
Contact: Chunyong YIN
About author:YIN Chunyong， born in 1977， Ph. D.， professor. His research interests include cyberspace security， big data mining， privacy protection， artificial intelligence， new computing.
ZHANG Yangchun， born in 1999， M. S. candidate. Her research interests include anomaly detection， deep learning， log analysis.

摘要/Abstract

摘要：

日志能记录系统运行时的具体状态，而自动化的日志异常检测对网络安全至关重要。针对日志语句随时间演变导致异常检测准确率低的问题，提出一种无监督日志异常检测模型LogCL。首先，通过日志解析技术将半结构化的日志数据转换为结构化的日志模板；其次，使用会话和固定窗口将日志事件划分为日志序列；再次，提取日志序列的数量特征，使用自然语言处理技术对日志模板进行语义特征提取，并利用词频-词语逆频率（TF-IWF）算法生成加权的句嵌入向量；最后，将特征向量输入一个并列的基于卷积神经网络（CNN）和双向长短期记忆（Bi-LSTM）网络的模型中进行检测。在两个公开的真实数据集上的实验结果表明，所提模型较基准模型LogAnomaly在异常检测的F1?score上分别提高了3.6和2.3个百分点。因此LogCL能够对日志数据进行有效的异常检测。

关键词: 异常检测, 深度学习, 日志分析, 词嵌入, 卷积神经网络, 双向长短期记忆网络

Abstract:

Logs can record the specific status of the system during the operation， and automated log anomaly detection is critical to network security. Concerning the problem of low accuracy in anomaly detection caused by the evolution of log sentences over time， an unsupervised log anomaly detection model LogCL was proposed. Firstly， the log parsing technique was used to convert semi-structured log data into structured log templates. Secondly， the sessions and fixed windows were employed to divide log events into log sequences. Thirdly， quantitative characteristics of the log sequences were extracted， natural language processing technique was used to extract semantic features of log templates， and Term Frequency-Inverse Word Frequency （TF-IWF） algorithm was utilized to generate weighted sentence embedding vectors. Finally， the feature vectors were input into a parallel model based on Convolutional Neural Network （CNN） and Bi-directional Long Short-Term Memory （Bi-LSTM） network for detection. Experimental results on two public real datasets show that the proposed model improves the anomaly detection F1-score by 3.6 and 2.3 percentage points respectively compared with the baseline model LogAnomaly. Therefore， LogCL can perform effectively on log anomaly detection.

Key words: anomaly detection, deep learning, log analysis, word embedding, Convolutional Neural Network (CNN), Bi-directional Long Short-Term Memory (Bi-LSTM) network

中图分类号:

TP391.1

尹春勇, 张杨春. 基于CNN和Bi-LSTM的无监督日志异常检测模型[J]. 计算机应用, 2023, 43(11): 3510-3516.

Chunyong YIN, Yangchun ZHANG. Unsupervised log anomaly detection model based on CNN and Bi-LSTM[J]. Journal of Computer Applications, 2023, 43(11): 3510-3516.

图/表 10

参考文献 29

1	RUFF L， KAUFFMANN J R， VANDERMEULEN R A， et al. A unifying review of deep and shallow anomaly detection［J］. Proceedings of the IEEE， 2021， 109（5）： 756-795. 10.1109/jproc.2021.3052449
2	HE S， HE P， CHEN Z， et al. A survey on automated log analysis for reliability engineering［J］. ACM Computing Surveys， 2022， 54（6）： No.130. 10.1145/3460345
3	LE V H， ZHANG H. Log-based anomaly detection with deep learning： how far are we？［C］// Proceedings of the 44th International Conference on Software Engineering. New York： ACM， 2022： 1356-1367. 10.1145/3510003.3510155
4	LOU J G， FU Q， YANG S， et al. Mining invariants from console logs for system problem detection［C］// Proceedings of the 2010 USENIX Annual Technical Conference. Berkeley： USENIX Association， 2010： 1-14. 10.1109/msp.2009.28
5	LIN Q， ZHANG H， LOU J G， et al. Log clustering based problem identification for online service systems［C］// Proceedings of the IEEE/ACM 38th International Conference on Software Engineering Companion. New York： ACM， 2016： 102-111. 10.1145/2889160.2889232
6	MENG W， LIU Y， ZHU Y， et al. LogAnomaly： unsupervised detection of sequential and quantitative anomalies in unstructured logs［C］// Proceedings of the 28th International Joint Conference on Artificial Intelligence. California： ijcai.org， 2019： 4739-4745. 10.24963/ijcai.2019/658
7	GUO H， YUAN S， WU X. LogBERT： log anomaly detection via BERT［C］// Proceedings of the 2021 International Joint Conference on Neural Networks. Piscataway： IEEE， 2021： 1-8. 10.1109/ijcnn52387.2021.9534113
8	XU W， HUANG L， FOX A， et al. Detecting large-scale system problems by mining console logs［C］// Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles. New York： ACM， 2009： 117-132. 10.1145/1629575.1629587
9	LIANG Y， ZHANG Y， XIONG H， et al. Failure prediction in IBM BlueGene/L event logs［C］// Proceedings of the 7th IEEE International Conference on Data Mining. Piscataway： IEEE， 2007： 583-588. 10.1109/icdm.2007.46
10	HAN S， WU Q， ZHANG H， et al. Log-based anomaly detection with robust feature extraction and online learning［J］. IEEE Transactions on Information Forensics and Security， 2021， 16： 2300-2311. 10.1109/tifs.2021.3053371
11	LU S， WEI X， LI Y， et al. Detecting anomaly in big data system logs using convolutional neural network［C］// Proceedings of the IEEE 16th International Conference on Dependable， Autonomic and Secure Computing/ IEEE 16th International Conference on Pervasive Intelligence and Computing/ IEEE 4th International Conference on Big Data Intelligence and Computing/ IEEE 3rd Cyber Science and Technology Congress. Piscataway： IEEE， 2018： 151-158. 10.1109/dasc/picom/datacom/cyberscitec.2018.00037
12	LI X， CHEN P， JING L， et al. SwissLog： robust and unified deep learning based log anomaly detection for diverse faults［C］// Proceedings of the IEEE 31st International Symposium on Software Reliability Engineering. Piscataway： IEEE， 2020： 92-103. 10.1109/issre5003.2020.00018
13	HUANG S， LIU Y， FUNG C， et al. HitAnomaly： hierarchical transformers for anomaly detection in system log［J］. IEEE Transactions on Network and Service Management， 2020， 17（4）： 2064-2076. 10.1109/tnsm.2020.3034647
14	DU M， LI F， ZHENG G， et al. DeepLog： anomaly detection and diagnosis from system logs through deep learning［C］// Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. New York： ACM， 2017： 1285-1298. 10.1145/3133956.3134015
15	YANG L， CHEN J， WANG Z， et al. Semi-supervised log-based anomaly detection via probabilistic label estimation［C］// Proceedings of the IEEE/ACM 43rd International Conference on Software Engineering. Piscataway： IEEE， 2021： 1448-1460. 10.1109/icse43902.2021.00130
16	McINNES L， HEALY J， ASTELS S. HDBSCAN： hierarchical density based clustering［J］. The Journal of Open Source Software， 2017， 2（11）： No.205. 10.21105/joss.00205
17	LI B， MA S， DENG R， et al. Federated anomaly detection on system logs for the internet of things： a customizable and communication-efficient approach［J］. IEEE Transactions on Network and Service Management， 2022， 19（2）： 1705-1716. 10.1109/tnsm.2022.3152620
18	DUAN X， YING S， YUAN W， et al. QLLog： a log anomaly detection method based on Q-learning algorithm［J］. Information Processing and Management， 2021， 58（3）： No.102540. 10.1016/j.ipm.2021.102540
19	CLIFTON J， LABER E. Q-learning： theory and applications［J］. Annual Review of Statistics and Its Application， 2020， 7： 279-301. 10.1146/annurev-statistics-031219-041220
20	DAI H， LI H， CHEN C S， et al. Logram： efficient log parsing using n-gram dictionaries［J］. IEEE Transactions on Software Engineering， 2022， 48（3）： 879-892.
21	TAO S， MENG W， CHENG Y， et al. LogStamp： automatic online log parsing based on sequence labelling［J］. ACM SIGMETRICS Performance Evaluation Review， 2022， 49（4）： 93-98. 10.1145/3543146.3543168
22	HE P， ZHU J， ZHENG Z， et al. Drain： an online log parsing approach with fixed depth tree［C］// Proceedings of the 2017 IEEE International Conference on Web Services. Piscataway： IEEE， 2017： 33-40. 10.1109/icws.2017.13
23	孙嘉，张建辉，卜佑军，等.基于CNN-BiLSTM模型的日志异常检测方法［J］.计算机工程，2022，48（7）：151-158. 10.19678/j.issn.1000-3428.0061750
	SUN J， ZHANG J H， BU Y J， et al. Log anomaly detection method based on CNN-BiLSTM model［J］. Computer Engineering， 2022， 48（7）： 151-158. 10.19678/j.issn.1000-3428.0061750
24	GRAVE E， BOJANOWSKI P， GUPTA P， et al. Learning word vectors for 157 languages［C］// Proceedings of the 11th International Conference on Language Resources and Evaluation. ［S.l.］： European Language Resources Association， 2018： 3483-3487.
25	王小林，杨林，王东，等. 改进的TF-IDF关键词提取方法［J］. 计算机科学与应用， 2013， 3（1）： 64-68. 10.12677/CSA.2013.31012
	WANG X L， YANG L， WANG D， et al. Improved TF-IDF keyword extraction algorithm［J］. Computer Science and Application， 2013， 3（1）： 64-68. 10.12677/CSA.2013.31012
26	KIRANYAZ S， AVCI O， ABDELJABER O， et al. 1D convolutional neural networks and applications： a survey［J］. Mechanical Systems and Signal Processing， 2021， 151： No.107398. 10.1016/j.ymssp.2020.107398
27	LINDEMANN B， MASCHLER B， SAHLAB N， et al. A survey on anomaly detection for technical systems using LSTM networks［J］. Computers in Industry， 2021， 131： No.103498. 10.1016/j.compind.2021.103498
28	ZHANG B， ZHANG H， MOSCATO P， et al. Anomaly detection via mining numerical workflow relations from logs［C］// Proceedings of the 2020 International Symposium on Reliable Distributed Systems. Piscataway： IEEE， 2020： 195-204. 10.1109/srds51746.2020.00027
29	ZHANG X， XU Y， LIN Q， et al. Robust log-based anomaly detection on unstable log data［C］// Proceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. New York： ACM， 2019： 807-817. 10.1145/3338906.3338931

数据集	收集时间/d	大小/GB	总模板数	训练序列数	日志数	异常数	训练模板数
HDFS	2	1.490	30	5 000	1 175 629	16 838（blocks）	15
BGL	215	0.708	378	7 500	4 747 963	348 460（logs）	185

数据集	收集时间/d	大小/GB	总模板数	训练序列数	日志数	异常数	训练模板数
HDFS	2	1.490	30	5 000	1 175 629	16 838（blocks）	15
BGL	215	0.708	378	7 500	4 747 963	348 460（logs）	185

方法	HDFS			BGL
方法	精确度	召回率	F1-score	精确度	召回率	F1-score
LogCluster	0.993	0.371	0.540	0.955	0.640	0.766
ADR	0.931	0.929	0.925	0.937	1.000	0.967
OES	0.978	0.974	0.976	0.932	0.981	0.956
DeepLog	0.953	0.961	0.957	0.900	0.960	0.929
LogAnomaly	0.960	0.940	0.950	0.970	0.940	0.960
LogBERT	0.870	0.781	0.823	0.894	0.923	0.908
CNN-BiLSTM	0.980	0.830	0.900	0.949	0.930	0.939
LogRobust	0.980	1.000	0.999	0.912	0.964	0.937
LogCL	0.986	0.987	0.986	0.973	0.993	0.983

方法	HDFS			BGL
方法	精确度	召回率	F1-score	精确度	召回率	F1-score
LogCluster	0.993	0.371	0.540	0.955	0.640	0.766
ADR	0.931	0.929	0.925	0.937	1.000	0.967
OES	0.978	0.974	0.976	0.932	0.981	0.956
DeepLog	0.953	0.961	0.957	0.900	0.960	0.929
LogAnomaly	0.960	0.940	0.950	0.970	0.940	0.960
LogBERT	0.870	0.781	0.823	0.894	0.923	0.908
CNN-BiLSTM	0.980	0.830	0.900	0.949	0.930	0.939
LogRobust	0.980	1.000	0.999	0.912	0.964	0.937
LogCL	0.986	0.987	0.986	0.973	0.993	0.983

训练样本数/10³	模板数	新模板占比/%	精确度	召回率	F1-score
3	13	56.7	0.941	1.000	0.969
4	15	50.0	0.953	0.992	0.972
5	15	50.0	0.986	0.987	0.986
6	16	46.7	0.985	0.984	0.984

基于CNN和Bi-LSTM的无监督日志异常检测模型

Unsupervised log anomaly detection model based on CNN and Bi-LSTM

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 10

参考文献 29

相关文章 15

编辑推荐

Metrics

模型	技术分类	检测方法	语义表示
LogCluster	无监督	聚类	无
ADR	无监督	IM	无
OES	有监督	SVM	有
DeepLog	无监督	LSTM	无
LogAnomaly	无监督	LSTM	有
LogBERT	自监督	BERT	无
CNN-BiLSTM	有监督	CNN、BiLSTM	有
LogRobust	有监督	Bi-LSTM、注意力机制	有

模型	精确度	召回率	F1-score
A	0.964	0.928	0.945
B	0.975	0.982	0.977
C	0.968	0.990	0.979
LogCL	0.986	0.987	0.986

[1]	潘烨新, 杨哲. 基于多级特征双向融合的小目标检测优化模型[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2871-2877.
[2]	李云, 王富铕, 井佩光, 王粟, 肖澳. 基于不确定度感知的帧关联短视频事件检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2903-2910.
[3]	李顺勇, 李师毅, 胥瑞, 赵兴旺. 基于自注意力融合的不完整多视图聚类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2696-2703.
[4]	陈廷伟, 张嘉诚, 王俊陆. 面向联邦学习的随机验证区块链构建[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2770-2776.
[5]	黄云川, 江永全, 黄骏涛, 杨燕. 基于元图同构网络的分子毒性预测[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2964-2969.
[6]	秦璟, 秦志光, 李发礼, 彭悦恒. 基于概率稀疏自注意力神经网络的重性抑郁疾患诊断[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2970-2974.
[7]	王熙源, 张战成, 徐少康, 张宝成, 罗晓清, 胡伏原. 面向手术导航3D/2D配准的无监督跨域迁移网络[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2911-2918.
[8]	张春雪, 仇丽青, 孙承爱, 荆彩霞. 基于两阶段动态兴趣识别的购买行为预测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2365-2371.
[9]	汪雨晴, 朱广丽, 段文杰, 李书羽, 周若彤. 基于交互注意力机制的心理咨询文本情感分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2393-2399.
[10]	刘禹含, 吉根林, 张红苹. 基于骨架图与混合注意力的视频行人异常检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2551-2557.
[11]	顾焰杰, 张英俊, 刘晓倩, 周围, 孙威. 基于时空多图融合的交通流量预测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2618-2625.
[12]	石乾宏, 杨燕, 江永全, 欧阳小草, 范武波, 陈强, 姜涛, 李媛. 面向空气质量预测的多粒度突变拟合网络[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2643-2650.
[13]	陈虹, 齐兵, 金海波, 武聪, 张立昂. 融合1D-CNN与BiGRU的类不平衡流量异常检测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2493-2499.
[14]	赵宇博, 张丽萍, 闫盛, 侯敏, 高茂. 基于改进分段卷积神经网络和知识蒸馏的学科知识实体间关系抽取[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2421-2429.
[15]	吴筝, 程志友, 汪真天, 汪传建, 王胜, 许辉. 基于深度学习的患者麻醉复苏过程中的头部运动幅度分类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2258-2263.