《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (1): 196-203.DOI: 10.11772/j.issn.1001-9081.2024010054

• 计算机软件技术 • 上一篇    下一篇

基于因果干预的微服务系统故障根因分析方法

丁建立1, 何雨峰1, 王静2,3()   

  1. 1.中国民航大学 计算机科学与技术学院,天津 300300
    2.中国民航大学 安全科学与工程学院,天津 300300
    3.中国民航大学 信息安全测评中心,天津 300300
  • 收稿日期:2024-01-19 修回日期:2024-03-21 接受日期:2024-03-21 发布日期:2024-05-09 出版日期:2025-01-10
  • 通讯作者: 王静
  • 作者简介:丁建立(1963—),男,河南洛阳人,教授,博士,CCF会员,主要研究方向:民航信息系统主动容灾、民航大数据;
    何雨峰(1996—),男,四川成都人,硕士研究生,主要研究方向:智能运维、根因定位;
  • 基金资助:
    国家自然科学基金资助项目(U2033205)

Causal intervention-based root cause analysis method for microservice system faults

Jianli DING1, Yufeng HE1, Jing WANG2,3()   

  1. 1.College of Computer Science and Technology,Civil Aviation University of China,Tianjin 300300,China
    2.College of Safety Science and Engineering,Civil Aviation University of China,Tianjin 300300,China
    3.Information Security Evaluation Center,Civil Aviation University of China,Tianjin 300300,China
  • Received:2024-01-19 Revised:2024-03-21 Accepted:2024-03-21 Online:2024-05-09 Published:2025-01-10
  • Contact: Jing WANG
  • About author:DING Jianli, born in 1963, Ph. D., professor. His research interests include proactive disaster recovery in civil aviation information system, civil aviation big data.
    HE Yufeng,born in 1996, M. S. candidate. His research interests include intelligent operation and maintenance, root cause positioning.
  • Supported by:
    National Natural Science Foundation of China(U2033205)

摘要:

针对现有故障根因分析方法因果关系丢失、在复杂环境中分析效率低下以及缺乏对于非机器指标故障类型分析能力的问题,提出一种基于因果干预的微服务系统故障根因分析(CIMF-RCA)方法。首先,利用马尔可夫假设和调用模式对调用链和微服务进行筛选,从而缩减干预识别的搜索空间并提高故障根因分析方法在复杂环境中的效率;其次,通过解析并融合非结构化的日志数据,实现机器指标数据和日志数据的联合分析;最后,引入因果贝叶斯网络(CBN)和干预数据,提出一种改进的干预识别算法及分治的故障根因分析方式。在大规模微服务基准平台Train-Ticket上进行实验的结果表明,对比表现最优的根本原因发现(RCD)方法,所提CIMF-RCA方法的Top-5平均准确率提高了26.33个百分点,所需时间减少了41.61%;而在RCD无法识别的非机器指标故障类型中,所提方法的Top-5准确率达到了77.00%。可见,所提方法能有效地分析微服务系统中的故障根因。

关键词: 微服务系统, 根因分析, 干预识别, 因果结构发现, 数据融合

Abstract:

To address the causality loss, low analysis efficiency in complex environments and lack of analytical capability for non-machine indicator fault type in the existing fault root cause analysis methods, a Causal Intervention-based Microservice system Fault Root Cause Analysis (CIMF-RCA) was proposed. Firstly, the call chains and microservices were filtered by Markov assumption and call patterns, resulting in a reduced search space for intervention recognition and enhanced efficiency of the root cause analysis method in complex environments. Secondly, the joint analysis of machine indicator data and log data was achieved by parsing and integrating unstructured log data. Finally, an improved intervention recognition algorithm and a divide-and-conquer method for fault root cause analysis were proposed by introducing Causal Bayesian Network (CBN) and intervention data. Experimental results on Train-Ticket, a large-scale microservice benchmark platform show that, compared to the best-performing Root Cause Discovery (RCD) method, the proposed method increases the Top-5 average accuracy by 26.33 percentage points and reduces the required time by 41.61%. In type of non-machine indicator faults that RCD cannot recognize, the proposed method has the Top-5 accuracy reached 77.00%. It can be seen that the proposed method can analyze root causes of faults in microservice system effectively.

Key words: microservice system, root cause analysis, intervention recognition, causal structure discovery, data fusion

中图分类号: