计算机应用 ›› 2015, Vol. 35 ›› Issue (10): 2784-2788.DOI: 10.11772/j.issn.1001-9081.2015.10.2784

• 第十五届中国机器学习会议(CCML2015)论文 • 上一篇    下一篇

基于心跳超时机制的Hadoop实时容错技术

关国栋1, 滕飞1,2, 杨燕1   

  1. 1. 西南交通大学 信息科学与技术学院, 成都 610031;
    2. 计算机软件新技术国家重点实验室(南京大学), 南京 210023
  • 收稿日期:2015-06-16 修回日期:2015-07-04 出版日期:2015-10-10 发布日期:2015-10-14
  • 通讯作者: 滕飞(1984-),女,山东泰安人,讲师,博士,主要研究方向:云计算、并行计算、工作流调度,fteng@swjtu.edu.cn
  • 作者简介:关国栋(1991-),男,湖北潜江人,硕士研究生,CCF会员,主要研究方向:云计算容错;杨燕(1964-),女,安徽合肥人,教授,博士生导师,博士,CCF高级会员,主要研究方向:数据挖掘、计算智能、集成学习。
  • 基金资助:
    国家自然科学基金资助项目(61202043,61170111);网络智能信息处理四川省高校重点实验室开放课题资助项目(SZJJ2014-049)。

Real-time fault-tolerant technology for Hadoop based on heartbeat expired time mechanism

GUAN Guodong1, TENG Fei1,2, YANG Yan1   

  1. 1. School of Information Science and Technology, Southwest Jiaotong University, Chengdu Sichuan 610031, China;
    2. State Key Laboratory for Novel Software Technology (Nanjing University), Nanjing Jiangsu 210023, China
  • Received:2015-06-16 Revised:2015-07-04 Online:2015-10-10 Published:2015-10-14

摘要: 针对官方的Hadoop软件中提供的节点心跳超时容错机制对短作业并不合理,而且忽略了异构集群中各节点超期时间设置的公平性的问题,提出了公平心跳超时容错机制。首先根据每个节点的可靠性及计算性能构建节点故障误判损失模型,提出公平误判损失(FMJL)算法,使其同时满足长作业和短作业要求;接着,设计并实现了基于FMJL算法的公平超时机制。在实现了公平超时机制的Hadoop上运行大约345 s的短作业时,当出现TaskTracker节点故障时作业完成时间平均大约节省了44%,与自适应超时机制相比,作业完成时间大约节省了23%。实验结果表明,公平超时机制在保证不影响长作业完成时间的情况下缩短了短作业的容错处理时间,提高了Hadoop的实时处理效率。

关键词: 云计算, 心跳机制, 容错, 异构集群, 实时性

Abstract: The heartbeat mechanism in Hadoop is not reasonable for short jobs, and ignores the fairness of expired time set of nodes in heterogeneous cluster. In order to overcome the problem, a fair expired time fault-tolerant mechanism was proposed. First of all, a failure misjudgement loss model and a Fair MisJudgment Loss (FMJL) algorithm were put forward according to reliability and computational performance of nodes, so as to meet requirements of the long jobs and short jobs at the same time. Then a fair expired time mechanism based on FMJL algorithm was designed and implemented. Running a 345 seconds short job on the Hadoop with the proposed fair expired time mechanism, the results showed that it saved completion time by 44% when there was fault on TaskTracker nodes, and saved completion time by 23% compared with self-adaptation expired time mechanism. The experimental results show that the proposed fair expired time mechanism shortens the fault-tolerant processing time without affecting the completion time of long jobs, and can improve the efficiency of real-time processing ability for a heterogeneous Hadoop cluster.

Key words: cloud computing, heartbeat mechanism, fault-tolerant, heterogeneous cluster, real-time

中图分类号: