Journal of Computer Applications ›› 2021, Vol. 41 ›› Issue (8): 2396-2405.DOI: 10.11772/j.issn.1001-9081.2020101566

Special Issue: 第八届CCF大数据学术会议(CCF Bigdata 2020)

• CCF Bigdata 2020 • Previous Articles     Next Articles

Low-latency cluster scheduling framework for large-scale short-time tasks

ZHAO Quan, TANG Xiaochun, ZHU Ziyu, MAO Anqi, LI Zhanhuai   

  1. School of Computer Science, Northwestern Polytechnical University, Xi'an Shaanxi 710129, China
  • Received:2020-10-12 Revised:2020-12-11 Online:2021-08-10 Published:2021-01-27
  • Supported by:
    This work is partially supported by the National Key Research and Development Program of China (2018YFB1003400).

大规模短时间任务的低延迟集群调度框架

赵全, 汤小春, 朱紫钰, 毛安琪, 李战怀   

  1. 西北工业大学 计算机学院, 西安 710129
  • 通讯作者: 汤小春
  • 作者简介:赵全(1997-),男,河北衡水人,硕士研究生,主要研究方向:集群资源管理;汤小春(1969-),男,陕西汉中人,副教授,博士,主要研究方向:图数据管理、分布式计算、集群资源管理;朱紫钰(1996-),女,河北邯郸人,硕士研究生,主要研究方向:集群资源管理;毛安琪(1996-),女,河南洛阳人,硕士研究生,主要研究方向:集群资源管理;李战怀(1961-),男,陕西咸阳人,教授,博士,CCF会员,主要研究方向:海量数据管理、大数据计算。
  • 基金资助:
    国家重点研发计划项目(2018YFB1003400)。

Abstract: There are always some tasks with short duration and high concurrency in the large-scale data analysis environment. How to schedule these concurrent jobs with low-latency requirement is a hot research topic. In some existing cluster resource management frameworks, the centralized schedulers cannot meet the low-latency requirement due to the bottleneck of the master node, and some distributed schedulers achieve the low-latency task scheduling, but has shortcomings in the optimal resource allocation and resource allocation conflict. By considering the needs for large-scale real-time jobs, a distributed cluster resource scheduling framework was designed and implemented to meet the low-latency requirement of large-scale data processing. Firstly, a two-stage scheduling framework and an optimized two-stage multi-path scheduling framework were proposed. Secondly, aiming at some resource conflict problems in two-stage multi-path scheduling, a task transfer mechanism based on load balancing was proposed to solve the load imbalance problems among computing nodes. At last, the task scheduling framework for large-scale clusters was simulated and verified by using actual load and a simulated scheduler. For the actual load, the scheduling delay of the proposed framework is controlled within 12% of that of the ideal scheduling. In the simulated environment, this framework has the delay of short-time tasks reduced by more than 40% compared with the centralized scheduler.

Key words: low-latency, distributed scheduling, two-stage scheduling, load balancing, greedy scheduling

摘要: 大规模数据分析环境中,经常存在一些持续时间较短、并行度较大的任务。如何调度这些低延迟要求的并发作业是目前研究的一个热点。现有的一些集群资源管理框架中,集中式调度器由于主节点的瓶颈无法达到低延迟的要求,而一些分布式调度器虽然达成了低延迟的任务调度,但在最优资源分配以及资源分配冲突方面存在一定的不足。从大规模实时作业的需求出发,设计和实现了一个分布式的集群资源调度框架,以满足大规模数据处理的低延迟要求。首先提出了两阶段调度框架以及优化后的两阶段多路调度框架;然后针对两阶段多路调度过程中存在的一些资源冲突问题,提出了基于负载平衡的任务转移机制,从而解决了各个计算节点的负载不平衡问题;最后使用实际负载以及一个模拟调度器对大规模集群中的任务调度框架进行了模拟和验证。对于实际负载,所提框架的调度延迟控制在理想调度的12%以内;在模拟环境下,该框架与集中式调度器相比在短时间任务的延迟上能够减少40%以上。

关键词: 低延迟, 分布式调度, 两阶段调度, 负载平衡, 贪心调度

CLC Number: