计算机应用 ›› 2018, Vol. 38 ›› Issue (3): 699-706.DOI: 10.11772/j.issn.1001-9081.2017082125

• 先进计算 • 上一篇    下一篇

Storm环境下基于权重的任务调度算法

鲁亮1, 于炯1,2, 卞琛2, 英昌甜2,3, 师康利2, 蒲勇霖2   

  1. 1. 新疆大学 信息科学与工程学院, 乌鲁木齐 830046;
    2. 新疆大学 软件学院, 乌鲁木齐 830008;
    3. 新疆大学 电气工程学科博士后科研流动站, 乌鲁木齐 830047
  • 收稿日期:2017-08-31 修回日期:2017-11-07 出版日期:2018-03-10 发布日期:2018-03-07
  • 通讯作者: 鲁亮
  • 作者简介:鲁亮(1990-),男,湖南湘潭人,博士研究生,CCF会员,主要研究方向:分布式计算、内存计算;于炯(1964-),男,北京人,教授,博士生导师,博士,CCF会员,主要研究方向:网格计算、分布式计算;卞琛(1981-),男,江苏南京人,副教授,博士,CCF会员,主要研究方向:分布式计算、内存计算;英昌甜(1989-),女,新疆乌鲁木齐人,博士,主要研究方向:内存计算、绿色存储;师康利(1990-),女,河南许昌人,硕士研究生,主要研究方向:分布式计算、内存计算;蒲勇霖(1991-),男,山东淄博人,硕士研究生,主要研究方向:内存计算、绿色计算。
  • 基金资助:
    国家自然科学基金资助项目(61462079,61562086);新疆维吾尔自治区自然科学基金资助项目(2017D01A20);新疆维吾尔自治区高校科研计划项目(XJEDU2016S106);新疆维吾尔自治区研究生科研创新项目(XJGRI2016028)。

Task scheduling algorithm based on weight in Storm

LU Liang1, YU Jiong1,2, BIAN Chen2, YING Changtian2,3, SHI Kangli2, PU Yonglin2   

  1. 1. School of Information Science and Engineering, Xinjiang University, Urumqi Xinjiang 830046, China;
    2. School of Software, Xinjiang University, Urumqi Xinjiang 830008, China;
    3. Postdoctoral Research Station of Electrical Engineering, Xinjiang University, Urumqi Xinjiang 830047, China
  • Received:2017-08-31 Revised:2017-11-07 Online:2018-03-10 Published:2018-03-07
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61462079, 61562086), the Natural Science Foundation of Xinjiang Uygur Autonomous Region of China (2017D01A20), the Educational Research Program of Xinjiang Uygur Autonomous Region of China (XJEDU2016S106), the Graduate Research and Innovation Project of Xinjiang Uygur Autonomous Region of China (XJGRI2016028).

摘要: 大数据流式计算平台Apache Storm默认采用轮询的方式进行任务调度,未考虑到拓扑中各任务计算开销的差异以及任务之间不同类型的通信模式,在负载均衡和通信开销方面存在较大的优化空间。针对这一问题,提出一种Storm环境下基于权重的任务调度算法(TSAW-Storm)。该算法首先根据各任务的CPU资源占用情况以及任务间的数据流大小,分别确定拓扑的点权和边权;并利用最大化边权增益的思想,逐步构建起各工作节点中承载的任务集合,在保证集群负载均衡的同时,尽可能将边权较大的节点间数据流转化为节点内数据流,从而降低网络传输开销。实验结果表明,在包含有8个工作节点的WordCount基准测试中,TSAW-Storm的系统延迟和节点间数据流大小相比Storm默认调度算法分别降低了30.0%和32.9%,且各工作节点的CPU负载标准差仅为Storm默认调度算法的25.8%;此外,在与在线调度算法的对比实验中,TSAW-Storm在系统延迟、节点间数据流大小和CPU负载标准差方面分别降低了7.76%、11.8%和5.93%,且算法的执行开销明显降低,有效提高了Storm系统的运行效率。

关键词: 大数据, 流式计算, Storm, 权重, 任务调度, 负载均衡, 通信开销

Abstract: Apache Storm, a typical platform for big data stream computing, uses a round-robin scheduling algorithm as the default scheduler, which does not consider the fact that differences of computational and communication cost are ubiquitous among different tasks and different data streams in one topology. Hence optimization is needed in terms of load balance and communication cost. To solve this problem, a Task Scheduling Algorithm based on Weight in Storm (TSAW-Storm) was proposed. In the algorithm, CPU occupation was taken as the weight of a task in a specific topology, and similarly tuple rate between a pair of tasks was taken as the weight of a data stream. Then tasks were assigned to the most suitable work node gradually by maximizing the gain of weight of data streams via transforming inter-node data streams into intra-node ones as many as possible with load balance ensured in order to reduce network overhead. Experimental results show that TSAW-Storm can reduce latency and inter-node tuple rate by about 30.0% and 32.9% respectively, and standard deviation of CPU load of work nodes is only 25.8% when compared to Storm default scheduling algorithm in WordCount benchmark with 8 work nodes. Additionally, online scheduler is deployed in contrast experiment. Experimental results show that TSAW-Storm can reduce latency, inter-node tuple rate and standard deviation of CPU load by about 7.76%, 11.8% and 5.93% respectively, which needs only a bit of executive overhead compared to online scheduler. Therefore, the proposed algorithm can reduce communication cost as well as improve load balance effectively, which makes a great contribution to the efficient operation of Apache Storm.

Key words: big data, stream computing, Storm, weight, task scheduling, load balancing, communication cost

中图分类号: