《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (11): 3510-3516.DOI: 10.11772/j.issn.1001-9081.2022111738
所属专题: 网络空间安全
收稿日期:
2022-11-22
修回日期:
2023-03-19
接受日期:
2023-03-23
发布日期:
2023-04-10
出版日期:
2023-11-10
通讯作者:
尹春勇
作者简介:
尹春勇(1977—),男,山东潍坊人,教授,博士生导师,博士,主要研究方向:网络空间安全、大数据挖掘、隐私保护、人工智能、新型计算 yinchunyong@hotmail.com
Chunyong YIN(), Yangchun ZHANG
Received:
2022-11-22
Revised:
2023-03-19
Accepted:
2023-03-23
Online:
2023-04-10
Published:
2023-11-10
Contact:
Chunyong YIN
About author:
YIN Chunyong, born in 1977, Ph. D., professor. His research interests include cyberspace security, big data mining, privacy protection, artificial intelligence, new computing.摘要:
日志能记录系统运行时的具体状态,而自动化的日志异常检测对网络安全至关重要。针对日志语句随时间演变导致异常检测准确率低的问题,提出一种无监督日志异常检测模型LogCL。首先,通过日志解析技术将半结构化的日志数据转换为结构化的日志模板;其次,使用会话和固定窗口将日志事件划分为日志序列;再次,提取日志序列的数量特征,使用自然语言处理技术对日志模板进行语义特征提取,并利用词频-词语逆频率(TF-IWF)算法生成加权的句嵌入向量;最后,将特征向量输入一个并列的基于卷积神经网络(CNN)和双向长短期记忆(Bi-LSTM)网络的模型中进行检测。在两个公开的真实数据集上的实验结果表明,所提模型较基准模型LogAnomaly在异常检测的F1?score上分别提高了3.6和2.3个百分点。因此LogCL能够对日志数据进行有效的异常检测。
中图分类号:
尹春勇, 张杨春. 基于CNN和Bi-LSTM的无监督日志异常检测模型[J]. 计算机应用, 2023, 43(11): 3510-3516.
Chunyong YIN, Yangchun ZHANG. Unsupervised log anomaly detection model based on CNN and Bi-LSTM[J]. Journal of Computer Applications, 2023, 43(11): 3510-3516.
数据集 | 收集时间/d | 大小/GB | 总模板数 | 训练序列数 | 日志数 | 异常数 | 训练模板数 |
---|---|---|---|---|---|---|---|
HDFS | 2 | 1.490 | 30 | 5 000 | 1 175 629 | 16 838(blocks) | 15 |
BGL | 215 | 0.708 | 378 | 7 500 | 4 747 963 | 348 460(logs) | 185 |
表1 两个数据集的统计信息
Tab. 1 Statistics of two datasets
数据集 | 收集时间/d | 大小/GB | 总模板数 | 训练序列数 | 日志数 | 异常数 | 训练模板数 |
---|---|---|---|---|---|---|---|
HDFS | 2 | 1.490 | 30 | 5 000 | 1 175 629 | 16 838(blocks) | 15 |
BGL | 215 | 0.708 | 378 | 7 500 | 4 747 963 | 348 460(logs) | 185 |
模型 | 技术分类 | 检测方法 | 语义表示 |
---|---|---|---|
LogCluster | 无监督 | 聚类 | 无 |
ADR | 无监督 | IM | 无 |
OES | 有监督 | SVM | 有 |
DeepLog | 无监督 | LSTM | 无 |
LogAnomaly | 无监督 | LSTM | 有 |
LogBERT | 自监督 | BERT | 无 |
CNN-BiLSTM | 有监督 | CNN、BiLSTM | 有 |
LogRobust | 有监督 | Bi-LSTM、注意力机制 | 有 |
表2 基准模型详情
Tab. 2 Details of baseline models
模型 | 技术分类 | 检测方法 | 语义表示 |
---|---|---|---|
LogCluster | 无监督 | 聚类 | 无 |
ADR | 无监督 | IM | 无 |
OES | 有监督 | SVM | 有 |
DeepLog | 无监督 | LSTM | 无 |
LogAnomaly | 无监督 | LSTM | 有 |
LogBERT | 自监督 | BERT | 无 |
CNN-BiLSTM | 有监督 | CNN、BiLSTM | 有 |
LogRobust | 有监督 | Bi-LSTM、注意力机制 | 有 |
方法 | HDFS | BGL | ||||
---|---|---|---|---|---|---|
精确度 | 召回率 | F1-score | 精确度 | 召回率 | F1-score | |
LogCluster | 0.993 | 0.371 | 0.540 | 0.955 | 0.640 | 0.766 |
ADR | 0.931 | 0.929 | 0.925 | 0.937 | 1.000 | 0.967 |
OES | 0.978 | 0.974 | 0.976 | 0.932 | 0.981 | 0.956 |
DeepLog | 0.953 | 0.961 | 0.957 | 0.900 | 0.960 | 0.929 |
LogAnomaly | 0.960 | 0.940 | 0.950 | 0.970 | 0.940 | 0.960 |
LogBERT | 0.870 | 0.781 | 0.823 | 0.894 | 0.923 | 0.908 |
CNN-BiLSTM | 0.980 | 0.830 | 0.900 | 0.949 | 0.930 | 0.939 |
LogRobust | 0.980 | 1.000 | 0.999 | 0.912 | 0.964 | 0.937 |
LogCL | 0.986 | 0.987 | 0.986 | 0.973 | 0.993 | 0.983 |
表3 HDFS和BGL数据集上的实验结果
Tab. 3 Experimental results on HDFS dataset
方法 | HDFS | BGL | ||||
---|---|---|---|---|---|---|
精确度 | 召回率 | F1-score | 精确度 | 召回率 | F1-score | |
LogCluster | 0.993 | 0.371 | 0.540 | 0.955 | 0.640 | 0.766 |
ADR | 0.931 | 0.929 | 0.925 | 0.937 | 1.000 | 0.967 |
OES | 0.978 | 0.974 | 0.976 | 0.932 | 0.981 | 0.956 |
DeepLog | 0.953 | 0.961 | 0.957 | 0.900 | 0.960 | 0.929 |
LogAnomaly | 0.960 | 0.940 | 0.950 | 0.970 | 0.940 | 0.960 |
LogBERT | 0.870 | 0.781 | 0.823 | 0.894 | 0.923 | 0.908 |
CNN-BiLSTM | 0.980 | 0.830 | 0.900 | 0.949 | 0.930 | 0.939 |
LogRobust | 0.980 | 1.000 | 0.999 | 0.912 | 0.964 | 0.937 |
LogCL | 0.986 | 0.987 | 0.986 | 0.973 | 0.993 | 0.983 |
训练样本数/103 | 模板数 | 新模板占比/% | 精确度 | 召回率 | F1-score |
---|---|---|---|---|---|
3 | 13 | 56.7 | 0.941 | 1.000 | 0.969 |
4 | 15 | 50.0 | 0.953 | 0.992 | 0.972 |
5 | 15 | 50.0 | 0.986 | 0.987 | 0.986 |
6 | 16 | 46.7 | 0.985 | 0.984 | 0.984 |
表4 对新类型日志的评估结果
Tab. 4 Evaluation results of new logs
训练样本数/103 | 模板数 | 新模板占比/% | 精确度 | 召回率 | F1-score |
---|---|---|---|---|---|
3 | 13 | 56.7 | 0.941 | 1.000 | 0.969 |
4 | 15 | 50.0 | 0.953 | 0.992 | 0.972 |
5 | 15 | 50.0 | 0.986 | 0.987 | 0.986 |
6 | 16 | 46.7 | 0.985 | 0.984 | 0.984 |
模型 | 精确度 | 召回率 | F1-score |
---|---|---|---|
A | 0.964 | 0.928 | 0.945 |
B | 0.975 | 0.982 | 0.977 |
C | 0.968 | 0.990 | 0.979 |
LogCL | 0.986 | 0.987 | 0.986 |
表5 消融实验结果
Tab. 5 Results of ablation experiments
模型 | 精确度 | 召回率 | F1-score |
---|---|---|---|
A | 0.964 | 0.928 | 0.945 |
B | 0.975 | 0.982 | 0.977 |
C | 0.968 | 0.990 | 0.979 |
LogCL | 0.986 | 0.987 | 0.986 |
1 | RUFF L, KAUFFMANN J R, VANDERMEULEN R A, et al. A unifying review of deep and shallow anomaly detection[J]. Proceedings of the IEEE, 2021, 109(5): 756-795. 10.1109/jproc.2021.3052449 |
2 | HE S, HE P, CHEN Z, et al. A survey on automated log analysis for reliability engineering[J]. ACM Computing Surveys, 2022, 54(6): No.130. 10.1145/3460345 |
3 | LE V H, ZHANG H. Log-based anomaly detection with deep learning: how far are we?[C]// Proceedings of the 44th International Conference on Software Engineering. New York: ACM, 2022: 1356-1367. 10.1145/3510003.3510155 |
4 | LOU J G, FU Q, YANG S, et al. Mining invariants from console logs for system problem detection[C]// Proceedings of the 2010 USENIX Annual Technical Conference. Berkeley: USENIX Association, 2010: 1-14. 10.1109/msp.2009.28 |
5 | LIN Q, ZHANG H, LOU J G, et al. Log clustering based problem identification for online service systems[C]// Proceedings of the IEEE/ACM 38th International Conference on Software Engineering Companion. New York: ACM, 2016: 102-111. 10.1145/2889160.2889232 |
6 | MENG W, LIU Y, ZHU Y, et al. LogAnomaly: unsupervised detection of sequential and quantitative anomalies in unstructured logs[C]// Proceedings of the 28th International Joint Conference on Artificial Intelligence. California: ijcai.org, 2019: 4739-4745. 10.24963/ijcai.2019/658 |
7 | GUO H, YUAN S, WU X. LogBERT: log anomaly detection via BERT[C]// Proceedings of the 2021 International Joint Conference on Neural Networks. Piscataway: IEEE, 2021: 1-8. 10.1109/ijcnn52387.2021.9534113 |
8 | XU W, HUANG L, FOX A, et al. Detecting large-scale system problems by mining console logs[C]// Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles. New York: ACM, 2009: 117-132. 10.1145/1629575.1629587 |
9 | LIANG Y, ZHANG Y, XIONG H, et al. Failure prediction in IBM BlueGene/L event logs[C]// Proceedings of the 7th IEEE International Conference on Data Mining. Piscataway: IEEE, 2007: 583-588. 10.1109/icdm.2007.46 |
10 | HAN S, WU Q, ZHANG H, et al. Log-based anomaly detection with robust feature extraction and online learning[J]. IEEE Transactions on Information Forensics and Security, 2021, 16: 2300-2311. 10.1109/tifs.2021.3053371 |
11 | LU S, WEI X, LI Y, et al. Detecting anomaly in big data system logs using convolutional neural network[C]// Proceedings of the IEEE 16th International Conference on Dependable, Autonomic and Secure Computing/ IEEE 16th International Conference on Pervasive Intelligence and Computing/ IEEE 4th International Conference on Big Data Intelligence and Computing/ IEEE 3rd Cyber Science and Technology Congress. Piscataway: IEEE, 2018: 151-158. 10.1109/dasc/picom/datacom/cyberscitec.2018.00037 |
12 | LI X, CHEN P, JING L, et al. SwissLog: robust and unified deep learning based log anomaly detection for diverse faults[C]// Proceedings of the IEEE 31st International Symposium on Software Reliability Engineering. Piscataway: IEEE, 2020: 92-103. 10.1109/issre5003.2020.00018 |
13 | HUANG S, LIU Y, FUNG C, et al. HitAnomaly: hierarchical transformers for anomaly detection in system log[J]. IEEE Transactions on Network and Service Management, 2020, 17(4): 2064-2076. 10.1109/tnsm.2020.3034647 |
14 | DU M, LI F, ZHENG G, et al. DeepLog: anomaly detection and diagnosis from system logs through deep learning[C]// Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. New York: ACM, 2017: 1285-1298. 10.1145/3133956.3134015 |
15 | YANG L, CHEN J, WANG Z, et al. Semi-supervised log-based anomaly detection via probabilistic label estimation[C]// Proceedings of the IEEE/ACM 43rd International Conference on Software Engineering. Piscataway: IEEE, 2021: 1448-1460. 10.1109/icse43902.2021.00130 |
16 | McINNES L, HEALY J, ASTELS S. HDBSCAN: hierarchical density based clustering[J]. The Journal of Open Source Software, 2017, 2(11): No.205. 10.21105/joss.00205 |
17 | LI B, MA S, DENG R, et al. Federated anomaly detection on system logs for the internet of things: a customizable and communication-efficient approach[J]. IEEE Transactions on Network and Service Management, 2022, 19(2): 1705-1716. 10.1109/tnsm.2022.3152620 |
18 | DUAN X, YING S, YUAN W, et al. QLLog: a log anomaly detection method based on Q-learning algorithm[J]. Information Processing and Management, 2021, 58(3): No.102540. 10.1016/j.ipm.2021.102540 |
19 | CLIFTON J, LABER E. Q-learning: theory and applications[J]. Annual Review of Statistics and Its Application, 2020, 7: 279-301. 10.1146/annurev-statistics-031219-041220 |
20 | DAI H, LI H, CHEN C S, et al. Logram: efficient log parsing using n-gram dictionaries[J]. IEEE Transactions on Software Engineering, 2022, 48(3): 879-892. |
21 | TAO S, MENG W, CHENG Y, et al. LogStamp: automatic online log parsing based on sequence labelling[J]. ACM SIGMETRICS Performance Evaluation Review, 2022, 49(4): 93-98. 10.1145/3543146.3543168 |
22 | HE P, ZHU J, ZHENG Z, et al. Drain: an online log parsing approach with fixed depth tree[C]// Proceedings of the 2017 IEEE International Conference on Web Services. Piscataway: IEEE, 2017: 33-40. 10.1109/icws.2017.13 |
23 | 孙嘉,张建辉,卜佑军,等.基于CNN-BiLSTM模型的日志异常检测方法[J].计算机工程,2022,48(7):151-158. 10.19678/j.issn.1000-3428.0061750 |
SUN J, ZHANG J H, BU Y J, et al. Log anomaly detection method based on CNN-BiLSTM model[J]. Computer Engineering, 2022, 48(7): 151-158. 10.19678/j.issn.1000-3428.0061750 | |
24 | GRAVE E, BOJANOWSKI P, GUPTA P, et al. Learning word vectors for 157 languages[C]// Proceedings of the 11th International Conference on Language Resources and Evaluation. [S.l.]: European Language Resources Association, 2018: 3483-3487. |
25 | 王小林,杨林,王东,等. 改进的TF-IDF关键词提取方法[J]. 计算机科学与应用, 2013, 3(1): 64-68. 10.12677/CSA.2013.31012 |
WANG X L, YANG L, WANG D, et al. Improved TF-IDF keyword extraction algorithm[J]. Computer Science and Application, 2013, 3(1): 64-68. 10.12677/CSA.2013.31012 | |
26 | KIRANYAZ S, AVCI O, ABDELJABER O, et al. 1D convolutional neural networks and applications: a survey[J]. Mechanical Systems and Signal Processing, 2021, 151: No.107398. 10.1016/j.ymssp.2020.107398 |
27 | LINDEMANN B, MASCHLER B, SAHLAB N, et al. A survey on anomaly detection for technical systems using LSTM networks[J]. Computers in Industry, 2021, 131: No.103498. 10.1016/j.compind.2021.103498 |
28 | ZHANG B, ZHANG H, MOSCATO P, et al. Anomaly detection via mining numerical workflow relations from logs[C]// Proceedings of the 2020 International Symposium on Reliable Distributed Systems. Piscataway: IEEE, 2020: 195-204. 10.1109/srds51746.2020.00027 |
29 | ZHANG X, XU Y, LIN Q, et al. Robust log-based anomaly detection on unstable log data[C]// Proceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. New York: ACM, 2019: 807-817. 10.1145/3338906.3338931 |
[1] | 潘烨新, 杨哲. 基于多级特征双向融合的小目标检测优化模型[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2871-2877. |
[2] | 李云, 王富铕, 井佩光, 王粟, 肖澳. 基于不确定度感知的帧关联短视频事件检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2903-2910. |
[3] | 李顺勇, 李师毅, 胥瑞, 赵兴旺. 基于自注意力融合的不完整多视图聚类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2696-2703. |
[4] | 陈廷伟, 张嘉诚, 王俊陆. 面向联邦学习的随机验证区块链构建[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2770-2776. |
[5] | 黄云川, 江永全, 黄骏涛, 杨燕. 基于元图同构网络的分子毒性预测[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2964-2969. |
[6] | 秦璟, 秦志光, 李发礼, 彭悦恒. 基于概率稀疏自注意力神经网络的重性抑郁疾患诊断[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2970-2974. |
[7] | 王熙源, 张战成, 徐少康, 张宝成, 罗晓清, 胡伏原. 面向手术导航3D/2D配准的无监督跨域迁移网络[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2911-2918. |
[8] | 张春雪, 仇丽青, 孙承爱, 荆彩霞. 基于两阶段动态兴趣识别的购买行为预测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2365-2371. |
[9] | 汪雨晴, 朱广丽, 段文杰, 李书羽, 周若彤. 基于交互注意力机制的心理咨询文本情感分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2393-2399. |
[10] | 刘禹含, 吉根林, 张红苹. 基于骨架图与混合注意力的视频行人异常检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2551-2557. |
[11] | 顾焰杰, 张英俊, 刘晓倩, 周围, 孙威. 基于时空多图融合的交通流量预测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2618-2625. |
[12] | 石乾宏, 杨燕, 江永全, 欧阳小草, 范武波, 陈强, 姜涛, 李媛. 面向空气质量预测的多粒度突变拟合网络[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2643-2650. |
[13] | 陈虹, 齐兵, 金海波, 武聪, 张立昂. 融合1D-CNN与BiGRU的类不平衡流量异常检测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2493-2499. |
[14] | 赵宇博, 张丽萍, 闫盛, 侯敏, 高茂. 基于改进分段卷积神经网络和知识蒸馏的学科知识实体间关系抽取[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2421-2429. |
[15] | 吴筝, 程志友, 汪真天, 汪传建, 王胜, 许辉. 基于深度学习的患者麻醉复苏过程中的头部运动幅度分类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2258-2263. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||