Journal of Computer Applications ›› 2018, Vol. 38 ›› Issue (2): 352-356.DOI: 10.11772/j.issn.1001-9081.2017071786

Previous Articles     Next Articles

Design and implementation of log parsing system based on machine learning

ZHONG Ya1,2, GUO Yuanbo1,2   

  1. 1. Cyberspace Security College, Information Engineering University, Zhengzhou Henan 450001, China;
    2 State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou Henan 450001, China
  • Received:2017-07-20 Revised:2017-09-05 Online:2018-02-10 Published:2018-02-10
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61501515).

基于机器学习的日志解析系统设计与实现

钟雅1,2, 郭渊博1,2   

  1. 1. 信息工程大学 网络空间安全学院, 郑州 450001;
    2. 数学工程与先进计算国家重点实验室, 郑州 450001
  • 通讯作者: 郭渊博
  • 作者简介:钟雅(1995-),女,湖南岳阳人,硕士研究生,主要研究方向:信息安全;郭渊博(1975-),男,陕西周至人,教授,博士生导师,博士,CCF高级会员,主要研究方向:网络攻防对抗。
  • 基金资助:
    国家自然科学基金资助项目(61501515)。

Abstract: Focusing on the problem that the existing log classification method is only applicable to the formative log, and the performance is closely related to the structure of the log, the existing log parsing algorithm LogSig (Log Signature) was extended and improved based on machine learning, and a log clustering analysis system was designed by combining data processing and result analysis in one, including raw data preprocessing, log analysis, clustering analysis and evaluation, scatter diagram display of results. This system was tested on the open source firewall log data set in VAST 2011 challenge. The experimental results show that the average accuracy of the improved algorithm in the classification of the event log reaches more than 85%; compared with the original LogSig algorithm, the log parsing accuracy is improved by 50%, and the parsing time is only 25% of the original algorithm. The proposed algorithm can be used to analyze multi-source unstructured log data efficiently and accurately in large data environment.

Key words: log parsing, machine learning, clustering, anomaly detection, LogSig (Log Signature) algorithm

摘要: 针对现有日志分类方法只适用于格式化的日志,且性能依赖于日志结构的问题,基于机器学习方法对日志信息解析算法LogSig进行了扩展改进,并设计开发了一个集数据处理与结果分析于一体的日志解析系统,包括原始数据预处理、日志解析、聚类分析评价、聚类结果散点图显示等功能,在VAST 2011挑战赛的开源防火墙日志数据集上进行了测试。实验结果表明,改进后的算法在归类整理日志事件时的平均准确性达到85%以上;与原LogSig算法相比,日志解析精度提高了50%,同时解析时间仅为原先的25%,可用于大数据环境下高效准确地对多源非结构化日志数据进行解析。

关键词: 日志解析, 机器学习, 聚类, 异常检测, LogSig算法

CLC Number: