基于流网络的流式计算动态任务调度策略

doi:10.11772/j.issn.1001-9081.2017122910

计算机应用 ›› 2018, Vol. 38 ›› Issue (9): 2560-2567.DOI: 10.11772/j.issn.1001-9081.2017122910

基于流网络的流式计算动态任务调度策略

李梓杨¹, 于炯^1,2, 卞琛², 鲁亮¹, 蒲勇霖²

1. 新疆大学信息科学与工程学院, 乌鲁木齐 830046;
2. 新疆大学软件学院, 乌鲁木齐 830008

收稿日期:2017-12-13 修回日期:2018-02-07 出版日期:2018-09-10 发布日期:2018-09-06
通讯作者: 于炯
作者简介:李梓杨(1993—),男,新疆乌鲁木齐人,博士研究生,CCF会员,主要研究方向:云计算、分布式计算;于炯(1964—),男,北京人,教授,博士生导师,博士,CCF高级会员,主要研究方向:网络安全、网格计算、分布式计算;卞琛(1981—),男,江苏南京人,副教授,博士,CCF会员,主要研究方向:网络计算、分布式系统;鲁亮(1990—),男,新疆乌鲁木齐人,博士研究生,CCF会员,主要研究方向:云计算、分布式计算、内存计算;蒲勇霖(1991—),男,山东淄博人,硕士研究生,CCF会员,主要研究方向:绿色计算、分布式计算。
基金资助:
国家自然科学基金资助项目（61262088，61462079，61562086，61363083）；国家科技部科技支撑项目（2015BAH02F01）；新疆维吾尔自治区自然科学基金资助项目（2017D01A20）；新疆维吾尔自治区高校科研计划项目（XJEDU2016S106）。

Dynamic task dispatching strategy for stream processing based on flow network

LI Ziyang¹, YU Jiong^1,2, BIAN Chen², LU Liang¹, PU Yonglin²

1. School of Information Science and Engineering, Xinjiang University, Urumqi Xinjiang 830046, China;
2. School of Software, Xinjiang University, Urumqi Xinjiang 830008, China

Received:2017-12-13 Revised:2018-02-07 Online:2018-09-10 Published:2018-09-06
Contact: 于炯
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61262088, 61462079, 61562086, 61363083), the Science and Technology Support Project of Ministry of National Science and Technology (2015BAH02F01), the Natural Science Foundation of Xinjiang Uygur Autonomous Region of China (2017D01A20), the Educational Research Program of Xinjiang Uygur Autonomous Region (XJEDU2016S106).

摘要/Abstract

摘要： 针对大数据流式计算平台中输入数据流速急剧上升所导致的计算延迟升高问题，提出了基于流网络模型的动态调度策略，并将其应用于Flink数据流计算平台。首先，通过定义有向无环图（DAG）中每条边的容量和流量将其转化为流网络模型，并通过容量检测算法确定每条边的容量值；然后，通过最大流算法计算对应的增进网络和优化路径，从而在输入速率上升阶段提升集群的吞吐量，并通过评估时空代价论证了算法的可行性；最后，讨论了重要参数对算法执行效果的影响，并通过实验得出了在不同类型的作业中推荐的参数取值。经实验验证得出：所提算法与Flink平台现有的任务调度策略相比，在输入速率上升阶段对不同作业类型中集群吞吐量的优化比均高于16.12%。实验结果表明动态调度策略在满足任务延迟约束的前提下有效提高了集群的吞吐量。

关键词: 数据流, 任务调度, 流网络, 最大流, Apache Flink

Abstract: Concerning the problem that sharp increase of data input rate leads to the rising of computing latency which influences the real-time of computing in big data stream processing platform, a dynamic dispatching strategy based on flow network was proposed and applied to a data stream processing platform named Apache Flink. Firstly, a Directed Acyclic Graph (DAG) was transformed to a flow network by defining the capacity and flow of every edge and a capacity detection algorithm was used to ascertain the capacity value of every edge. Secondly, a maximum flow algorithm was used to acquire the improved network and the optimization path in order to promote the throughput of cluster when the data input rate is increasing; meanwhile the feasibility of the algorithm was proved by evaluating its time-space complexity. Finally, the influence of an important parameter on the algorithm execution was discussed and recommended parameter values of different types of jobs were obtained by experiments. The experimental results show that the throughput promoting rate of the strategy is higher than 16.12% during the increasing phases of the data input rate in different types of benchmarks compared with the original dispatching strategy of Apache Flink, so the dynamic dispatching strategy efficiently promotes the throughput of cluster under the premise of task latency constraint.

Key words: data stream, task scheduling, flow network, maximum flow, Apache Flink

中图分类号:

TP393.02

李梓杨, 于炯, 卞琛, 鲁亮, 蒲勇霖. 基于流网络的流式计算动态任务调度策略[J]. 计算机应用, 2018, 38(9): 2560-2567.

LI Ziyang, YU Jiong, BIAN Chen, LU Liang, PU Yonglin. Dynamic task dispatching strategy for stream processing based on flow network[J]. Journal of Computer Applications, 2018, 38(9): 2560-2567.

参考文献

[1] 智研咨询集团.2017-2022年中国大数据行业深度调研及未来前景预测报告,R510340[R].北京:智研咨询集团,2017:4.(Zhi Yan Consulting Group. Deep survey and future forecast report of China's big data industry for 2017-2022, R510340[R]. Beijing:Zhi Yan Consulting Group, 2017:4.)
[2] 孟小峰,慈祥.大数据管理:概念、技术与挑战[J].计算机研究与发展,2013,50(1):146-169.(MENG X F, CI X. Big data management:concepts, techniques and challenges[J]. Journal of Computer Research and Development, 2013, 50(1):146-169.)
[3] 陈付梅,韩德志,毕坤,等.大数据环境下的分布式数据流处理关键技术探析[J].计算机应用,2017,37(3):620-627.(CHEN F M, HAN D Z, BI K, et al. Key technologies of distributed data stream processing based on big data[J]. Journal of Computer Applications, 2017, 37(3):620-627.)
[4] 孙大为,张广艳,郑纬民.大数据流式计算:关键技术及系统实例[J].软件学报,2014,25(4):839-862.(SUN D W, ZHANG G Y, ZHENG W M. Big data stream computing:technologies and instances[J]. Journal of Software, 2014, 25(4):839-862.)
[5] ALEXANDROV A, BERGMANN R, EWEN S, et al. The Stratosphere platform for big data analytics[J]. The VLDB Journal, 2014, 23(6):939-964.
[6] CARBONE P, EWEN S, HARIDI S. Apache Flink:stream and batch processing in a single engine[EB/OL].[2017-11-20]. http://sites.computer.org/debull/A15dec/p28.pdf.
[7] KOSTAS T, ELLEN F. Introduction to Apache Flink[M]. Boston:O'Reilly, 2016:54.
[8] TANMAY D. Learning Apache Flink[M]. Birmingham:PACKT Publishing, 2017:63.
[9] Apache Software Foundation. Apache Flink[EB/OL].[2017-10-13]. http://flink.apache.org/.
[10] Apache Software Foundation. Apache Storm[EB/OL].[2017-10-13]. http://storm.apache.org/.
[11] CARBONE P, FÓRA G, EWEN S, et al. Lightweight asyn-chronous snapshots for distributed dataflows[J/OL]. arxiv Preprint, 2017[2017-11-01]. https://arxiv.org/pdf/1506.08603.pdf.
[12] UFUK C. How Apache Flink handles backpressure[EB/OL].[2017-10-13]. http://data-artisans.com/blog/how-flink-handles-backpressure/.
[13] CARBONE P, TRAUB J, KATSIFODIMOS A, et al. Cutty:aggregate sharing for user-defined windows[C]//Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. New York:ACM, 2016:1201-1210.
[14] BJÖRN L, DANIEL W, ODEJ K. Massively-parallel stream processing under QoS constraints with Nephele[C]//Proceedings of the 21st International Symposium on High-Performance Parallel and Distributed Computing. New York:ACM, 2012:271-282.
[15] BJÖRN L, DANIEL W, ODEJ K. Nephele streaming:stream processing under QoS constraints at scale[J]. Cluster Computing, 2014, 17(1):61-78.
[16] LOHRMANN B, JANACIK P, KAO O. Elastic stream processing with latency guarantees[C]//ICDCS 2015:Proceedings of the 2015 IEEE 35th International Conference on Distributed Computing Systems. Piscataway, NJ:IEEE, 2015:399-410.
[17] WU Y, TAN K L. ChronoStream:elastic stateful stream computation in the cloud[C]//ICDE 2015:Proceedings of the 2015 IEEE 31st International Conference on Data Engineering. Piscataway, NJ:IEEE, 2015:723-734.
[18] GULISANO V, JIMENEZ-PERIS R, PATINO-MARTINEZ M, et al. Streamcloud:an elastic and scalable data streaming system[J]. IEEE Transactions on Parallel and Distributed Systems, 2012, 23(12):2351-2365.
[19] SUN D, ZHANG G, YANG S, et al. Re-Stream:real-time and energy-efficient resource scheduling in big data stream computing environments[J]. Information Sciences, 2015, 319:92-112.
[20] 李梓杨,于炯,卞琛,等.基于负载感知的数据流动态负载均衡策略[J].计算机应用,2017,37(10):2760-2766.(LI Z Y, YU J, BIAN C, et al. Dynamic data stream load balancing strategy based on load awareness[J]. Journal of Computer Applications, 2017, 37(10):2760-2766.)
[21] 阿里云.权威详解|阿里新一代实时计算引擎Blink[EB/OL].[2018-01-24]. http://yq.aliyun.com/articles/90243.(Alibaba cloud. Blink, The new generation of real-time computing engine in Alibaba cloud[EB/OL].[2018-01-24]. http://yq.aliyun.com/articles/90243.)

基于流网络的流式计算动态任务调度策略

Dynamic task dispatching strategy for stream processing based on flow network

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	尹春勇, 张帼杰. 面向分布式漂移数据流的集成分类模型[J]. 计算机应用, 2021, 41(7): 1947-1955.
[2]	郭帅, 苏旸. 基于数据流的加密流量分类方法[J]. 计算机应用, 2021, 41(5): 1386-1391.
[3]	郑李萍, 王建强, 张玉召, 董祚帆. 多时空配送任务驱动的无人车队车辆数优化方法[J]. 计算机应用, 2021, 41(5): 1406-1411.
[4]	平凡, 汤小春, 潘彦宇, 李战怀. 不规则任务在图形处理器集群上的调度策略[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3295-3301.
[5]	樊仲欣. 基于数据流的聚类趋势分析算法[J]. 计算机应用, 2020, 40(8): 2248-2254.
[6]	刘士豪, 胡学敏, 姜博厚, 张若晗, 孔力. 基于生成对抗双网络的虚拟到真实驾驶场景的视频翻译模型[J]. 计算机应用, 2020, 40(6): 1621-1626.
[7]	霍晴晴, 郭健全. 基于改进遗传算法的生鲜多目标闭环物流网络模型[J]. 计算机应用, 2020, 40(5): 1494-1500.
[8]	刘智翔, 刘慧超, 黄冬梅, 周丽萍, 苏诚. 多种任务调度混合的IB-LBM并行优化方法[J]. 计算机应用, 2020, 40(2): 386-391.
[9]	李孜颖, 石振国. 面向大数据任务的调度方法[J]. 计算机应用, 2020, 40(10): 2923-2928.
[10]	苏振宇, 宋桂香, 刘雁鸣, 赵媛. 服务器管理控制系统威胁建模与应用[J]. 计算机应用, 2019, 39(7): 1991-1996.
[11]	杨晓华, 郭健全. 模糊环境下多周期多决策生鲜闭环物流网络[J]. 计算机应用, 2019, 39(7): 2168-2174.
[12]	龚鸣清, 叶煌, 张鉴, 卢兴敬, 陈伟. 基于ARMv8架构的面向机器翻译的单精度浮点通用矩阵乘法优化[J]. 计算机应用, 2019, 39(6): 1557-1562.
[13]	孙小涓, 石涛, 胡玉新, 佟继周, 李冰, 宋峣. 基于流式计算的空间科学卫星数据实时处理[J]. 计算机应用, 2019, 39(6): 1563-1568.
[14]	张译天, 于炯, 鲁亮, 李梓杨. 大数据流式计算框架Heron环境下的流分类任务调度策略[J]. 计算机应用, 2019, 39(4): 1106-1116.
[15]	韩萌, 丁剑. 数据流频繁模式挖掘综述[J]. 计算机应用, 2019, 39(3): 719-727.