计算机应用 ›› 2018, Vol. 38 ›› Issue (9): 2560-2567.DOI: 10.11772/j.issn.1001-9081.2017122910

• 先进计算 • 上一篇    下一篇

基于流网络的流式计算动态任务调度策略

李梓杨1, 于炯1,2, 卞琛2, 鲁亮1, 蒲勇霖2   

  1. 1. 新疆大学 信息科学与工程学院, 乌鲁木齐 830046;
    2. 新疆大学 软件学院, 乌鲁木齐 830008
  • 收稿日期:2017-12-13 修回日期:2018-02-07 出版日期:2018-09-10 发布日期:2018-09-06
  • 通讯作者: 于炯
  • 作者简介:李梓杨(1993—),男,新疆乌鲁木齐人,博士研究生,CCF会员,主要研究方向:云计算、分布式计算;于炯(1964—),男,北京人,教授,博士生导师,博士,CCF高级会员,主要研究方向:网络安全、网格计算、分布式计算;卞琛(1981—),男,江苏南京人,副教授,博士,CCF会员,主要研究方向:网络计算、分布式系统;鲁亮(1990—),男,新疆乌鲁木齐人,博士研究生,CCF会员,主要研究方向:云计算、分布式计算、内存计算;蒲勇霖(1991—),男,山东淄博人,硕士研究生,CCF会员,主要研究方向:绿色计算、分布式计算。
  • 基金资助:
    国家自然科学基金资助项目(61262088,61462079,61562086,61363083);国家科技部科技支撑项目(2015BAH02F01);新疆维吾尔自治区自然科学基金资助项目(2017D01A20);新疆维吾尔自治区高校科研计划项目(XJEDU2016S106)。

Dynamic task dispatching strategy for stream processing based on flow network

LI Ziyang1, YU Jiong1,2, BIAN Chen2, LU Liang1, PU Yonglin2   

  1. 1. School of Information Science and Engineering, Xinjiang University, Urumqi Xinjiang 830046, China;
    2. School of Software, Xinjiang University, Urumqi Xinjiang 830008, China
  • Received:2017-12-13 Revised:2018-02-07 Online:2018-09-10 Published:2018-09-06
  • Contact: 于炯
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61262088, 61462079, 61562086, 61363083), the Science and Technology Support Project of Ministry of National Science and Technology (2015BAH02F01), the Natural Science Foundation of Xinjiang Uygur Autonomous Region of China (2017D01A20), the Educational Research Program of Xinjiang Uygur Autonomous Region (XJEDU2016S106).

摘要: 针对大数据流式计算平台中输入数据流速急剧上升所导致的计算延迟升高问题,提出了基于流网络模型的动态调度策略,并将其应用于Flink数据流计算平台。首先,通过定义有向无环图(DAG)中每条边的容量和流量将其转化为流网络模型,并通过容量检测算法确定每条边的容量值;然后,通过最大流算法计算对应的增进网络和优化路径,从而在输入速率上升阶段提升集群的吞吐量,并通过评估时空代价论证了算法的可行性;最后,讨论了重要参数对算法执行效果的影响,并通过实验得出了在不同类型的作业中推荐的参数取值。经实验验证得出:所提算法与Flink平台现有的任务调度策略相比,在输入速率上升阶段对不同作业类型中集群吞吐量的优化比均高于16.12%。实验结果表明动态调度策略在满足任务延迟约束的前提下有效提高了集群的吞吐量。

关键词: 数据流, 任务调度, 流网络, 最大流, Apache Flink

Abstract: Concerning the problem that sharp increase of data input rate leads to the rising of computing latency which influences the real-time of computing in big data stream processing platform, a dynamic dispatching strategy based on flow network was proposed and applied to a data stream processing platform named Apache Flink. Firstly, a Directed Acyclic Graph (DAG) was transformed to a flow network by defining the capacity and flow of every edge and a capacity detection algorithm was used to ascertain the capacity value of every edge. Secondly, a maximum flow algorithm was used to acquire the improved network and the optimization path in order to promote the throughput of cluster when the data input rate is increasing; meanwhile the feasibility of the algorithm was proved by evaluating its time-space complexity. Finally, the influence of an important parameter on the algorithm execution was discussed and recommended parameter values of different types of jobs were obtained by experiments. The experimental results show that the throughput promoting rate of the strategy is higher than 16.12% during the increasing phases of the data input rate in different types of benchmarks compared with the original dispatching strategy of Apache Flink, so the dynamic dispatching strategy efficiently promotes the throughput of cluster under the premise of task latency constraint.

Key words: data stream, task scheduling, flow network, maximum flow, Apache Flink

中图分类号: