大数据流式计算框架Heron环境下的流分类任务调度策略

doi:10.11772/j.issn.1001-9081.2018081848

计算机应用 ›› 2019, Vol. 39 ›› Issue (4): 1106-1116.DOI: 10.11772/j.issn.1001-9081.2018081848

大数据流式计算框架Heron环境下的流分类任务调度策略

张译天¹, 于炯^1,2, 鲁亮³, 李梓杨²

1. 新疆大学软件学院, 乌鲁木齐 830008;
2. 新疆大学信息科学与工程学院, 乌鲁木齐 830046;
3. 中国民航大学计算机科学与技术学院, 天津 300300

收稿日期:2018-09-09 修回日期:2018-11-13 发布日期:2019-04-10 出版日期:2019-04-10
通讯作者: 于炯
作者简介:张译天(1995-),男,河南商丘人,硕士研究生,CCF会员,主要研究方向:云计算、实时计算、分布式计算;于炯(1964-),男,北京人,教授,博士生导师,博士,CCF会员,主要研究方向:网格计算、分布式计算;鲁亮(1990-),男,湖南湘潭人,讲师,博士,CCF会员,主要研究方向:云计算、分布式计算、内存计算;李梓杨(1993-),男,新疆乌鲁木齐人,博士研究生,CCF会员,主要研究方向:云计算、分布式计算。
基金资助:
国家自然科学基金资助项目（61462079，61562078，61562086）；国家科技支撑计划项目（2015BAH02F01）；新疆维吾尔自治区自然科学基金资助项目（2017D01A20）；新疆维吾尔自治区高校科研计划项目（XJEDU2016S106）。

Task scheduling strategy based on data stream classification in Heron

ZHANG Yitian¹, YU Jiong^1,2, LU Liang³, LI Ziyang²

1. Software College, Xinjiang University, Urumqi Xinjiang 830008, China;
2. College of Information Science and Engineering, Xinjiang University, Urumqi Xinjiang 830046, China;
3. School of Computer Science and Technology, Civil Aviation University of China, Tianjin 300300, China

Received:2018-09-09 Revised:2018-11-13 Online:2019-04-10 Published:2019-04-10
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61462079, 61562078, 61562086), the National Key Technology R&D Program of China (2015BAH02F01), the Natural Science Foundation of Xinjiang Uygur Autonomous Region of China (2017D01A20), the Educational Research Program of Xinjiang Uygur Autonomous Region of China (XJEDU2016S106).

摘要/Abstract

摘要： 新型大数据流式计算框架Apache Heron默认使用轮询调度算法进行任务调度，忽略了拓扑运行时状态以及任务实例间不同通信方式对系统性能的影响。针对这个问题，提出Heron环境下流分类任务调度策略（DSC-Heron），包括流分类算法、流簇分配算法和流分类调度算法。首先通过建立Heron作业模型明确任务实例间不同通信方式的通信开销差异；其次基于流分类模型，根据任务实例间实时数据流大小对数据流进行分类；最后将相互关联的高频数据流整体作为基本调度单元构建任务分配计划，在满足资源约束条件的同时尽可能多地将节点间通信转化为节点内通信以最小化系统通信开销。在包含9个节点的Heron集群环境下分别运行SentenceWordCount、WordCount和FileWordCount拓扑，结果表明DSC-Heron相对于Heron默认调度策略，在系统完成时延、节点间通信开销和系统吞吐量上分别平均优化了8.35%、7.07%和6.83%；在负载均衡性方面，工作节点的CPU占用率和内存占用率标准差分别平均下降了41.44%和41.23%。实验结果表明，DSC-Heron对测试拓扑的运行性能有一定的优化作用，其中对接近真实应用场景的FileWordCount拓扑优化效果最为显著。

关键词: 大数据, 流式计算, Apache Heron, 任务调度, 数据流分类, 通信开销

Abstract: In a new platform for big data stream processing called Heron, the round-robin scheduling algorithm is usually used for task scheduling by default, which does not consider the topology runtime state and the impact of different communication modes among task instances on Heron's performance. To solve this problem, a task scheduling strategy based on Data Stream Classification in Heron (DSC-Heron) was proposed, including data stream classification algorithm, data stream cluster allocation algorithm and data stream classification scheduling algorithm. Firstly, the instance allocation model of Heron was established to clarify the difference in communication overhead among different communication modes of the task instances. Secondly, the data stream was classified according to the real-time data stream size between task instances based on the data stream classification model of Heron. Finally, the packing plan of Heron was constructed by using the interrelated high-frequency data streams as the basic scheduling unit to complete the scheduling to minimize the communication cost by transforming inter-node data streams into intra-node ones as many as possible. After running SentenceWordCount, WordCount and FileWordCount topologies in a Heron cluster environment with 9 nodes, the results show that compared with the Heron default scheduling strategy, DSC-Heron has 8.35%, 7.07% and 6.83% improvements in system complete latency, inter-node communication overhead and system throughput respectively; in the load balancing aspect, the standard deviations of CPU usage and memory usage of the working nodes are decreased by 41.44% and 41.23% respectively. All experimental results show that DSC-Heron can effectively improve the performance of the topologies, and has the most significant optimization effect on FileWordCount topology which is close to the real application scenario.

Key words: big data, stream computing, Apache Heron, task scheduling, data stream classification, communication overhead

中图分类号:

TP311

张译天, 于炯, 鲁亮, 李梓杨. 大数据流式计算框架Heron环境下的流分类任务调度策略[J]. 计算机应用, 2019, 39(4): 1106-1116.

ZHANG Yitian, YU Jiong, LU Liang, LI Ziyang. Task scheduling strategy based on data stream classification in Heron[J]. Journal of Computer Applications, 2019, 39(4): 1106-1116.

参考文献

[1] 孙大为.大数据流式计算:应用特征和技术挑战[J]. 大数据, 2015, 1(3):99-105. (SUN D W. Big data stream computing:features and challenges[J]. Big Data Research, 2015, 1(3):99-105.)
[2] Seagate. Data age 2025[EB/OL].[2018-08-10]. https://www.seagate.com/files/www-content/our-story/trends/files/data-age-2025-white-paper-simplified-chinese.pdf.
[3] 孙大为, 张广艳, 郑纬民.大数据流式计算:关键技术及系统实例[J]. 软件学报, 2014, 25(4):839-862. (SUN D W, ZHANG G Y, ZHENG W M. Big data stream computing:technologies and instances[J]. Journal of Software, 2014, 25(4):839-862.)
[4] TOSHNIWAL A, TANEJA S, SHUKLA A, et al. Storm@Twitter[C]//Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. New York:ACM, 2014:147-156.
[5] CARBONE P, EWEN S, HARIDI S, et al. Apache Flink^TM:stream and batch processing in a single engine[J]. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2015, 36(4):28-38.
[6] ANIELLO L, BALDONI R, QUERZONI L. Adaptive online scheduling in Storm[C]//Proceedings of the 7th ACM International Conference on Distributed Event-Based Systems. New York:ACM, 2013:207-218.
[7] XU J L, CHEN Z H, TANG J, et al. T-Storm:traffic-aware online scheduling in Storm[C]//Proceedings of the 34th IEEE International Conference on Distributed Computing Systems. Piscataway, NJ:IEEE, 2014:535-544.
[8] PENG B Y, HOSSEINI M, HONG Z H, et al. R-Storm:resource-aware scheduling in Storm[C]//Proceedings of the 16th Annual Middleware Conference. New York:ACM, 2015:149-161.
[9] 鲁亮, 于炯, 卞琛, 等.大数据流式计算框架Storm的任务迁移策略[J]. 计算机研究与发展, 2018, 55(1):71-92. (LU L, YU J, BIAN C, et al. A task migration strategy in big data stream computing with Storm[J]. Journal of Computer Research and Development, 2018, 55(1):71-92.)
[10] 李梓杨, 于炯, 卞琛, 等.基于流网络的流式计算动态任务调度策略[J]. 计算机应用, 2018, 38(9):2560-2567. (LI Z Y, YU J, BIAN C, et al. Dynamic task dispatching strategy for stream processing based on flow network[J]. Journal of Computer Applications, 2018, 38(9):2560-2567.)
[11] de ASSUNCAO M D, da SILVA VEITH A, BUYYA R. Distributed data stream processing and edge computing:a survey on resource elasticity and future directions[J]. Journal of Network & Computer Applications, 2018, 103:1-17.
[12] SHUKLA A, SIMMHAN Y. Model-driven scheduling for distributed stream processing systems[J]. Journal of Parallel & Distributed Computing, 2018, 117:98-114.
[13] TRUONG T M, HARWOOD A, SINNOTT R O. Predicting the stability of large-scale distributed stream processing systems on the cloud[C]//Proceedings of the 7th International Conference on Cloud Computing and Services Science. Piscataway, NJ:IEEE, 2017:603-610.
[14] SUN D, HUANG R. A stable online scheduling strategy for real-time stream computing over fluctuating big data streams[J]. IEEE Access, 2016, 4:8593-8607.
[15] KULKARNI S, BHAGAT N, FU M, et al. Twitter Heron:stream processing at scale[C]//Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. New York:ACM, 2015:239-250.
[16] FU M, MITTAL S, KEDIGEHALLI V, et al. Streaming@Twitter[J]. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2015, 38(4):15-27.
[17] FU M, AGRAWAL A, FLORATOU A, et al. Twitter Heron:towards extensible streaming engines[C]//Proceedings of the 2017 IEEE 33rd International Conference on Data Engineering. Piscataway, NJ:IEEE, 2017:35-44.
[18] Apache. Apache Aurora[EB/OL].[2018-08-10]. http://aurora.apache.org.
[19] HINDMAN B, KONWINSKI A, ZAHARIA M, et al. Mesos:a platform for fine-grained resource sharing in the data center[C]//Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation. Berkeley:USENIX Association, 2010:429-483.
[20] VAVILAPALLI V K, MURTHY A C, AGARWAL S, et al. Apache Hadoop YARN:yet another resource negotiator[C]//Proceedings of the 4th Annual Symposium on Cloud Computing. New York:ACM, 2013:5.
[21] KREPS J, NARKHEDE N, RAO J. Kafka:a distributed messaging system for log processing[EB/OL].[2018-05-10]. http://pages.cs.wisc.edu/~akella/CS744/F17/838-CloudPapers/Kafka.pdf.
[22] Apache. Apache DistributedLog[EB/OL].[2018-05-10]. http://bookkeeper.apache.org/distributedlog/.
[23] Twitter. Implementing a custom scheduler[EB/OL].[2018-05-10]. https://apache.github.io/incubator-heron/docs/contributors/custom-scheduler/.
[24] KULKARNI S. Apache/incubator-heron[EB/OL].[2018-05-16]. https://github.com/apache/incubator-heron.
[25] KAMBURUGAMUVE S, RAMASAMY K, SWANY M, et al. Low latency stream processing:Apache Heron with Infiniband & Intel Omni-Path[C]//Proceedings of the 10th International Conference on Utility and Cloud Computing. New York:ACM, 2017:101-110.

大数据流式计算框架Heron环境下的流分类任务调度策略

Task scheduling strategy based on data stream classification in Heron

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	李旭, 何玉林, 崔来中, 黄哲学, PHILIPPE Fournier‑Viger. 基于大数据随机样本划分的分布式观测点分类器[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1727-1733.
[2]	尚绍法, 蒋林, 李远成, 朱筠. 异构平台下卷积神经网络推理模型自适应划分和调度方法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2828-2835.
[3]	穆栋梁, 韩萌, 李昂, 刘淑娟, 高智慧. 概念漂移复杂数据流分类方法综述[J]. 《计算机应用》唯一官方网站, 2023, 43(6): 1664-1675.
[4]	方和平, 刘曙光, 冉泳屹, 钟坤华. 基于深度强化学习的多数据中心一体化调度优化[J]. 《计算机应用》唯一官方网站, 2023, 43(6): 1884-1892.
[5]	曹萌, 余孙婕, 曾辉, 史红周. 基于区块链的医疗数据分级访问控制与共享系统[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1518-1526.
[6]	宁明超, 张俊勃, 陈戈. 基于面向服务架构的工业软件的任务调度算法[J]. 《计算机应用》唯一官方网站, 2023, 43(3): 885-893.
[7]	杨力, 陈建廷, 向阳. 基于HBase的工业时序大数据分布式存储性能优化策略[J]. 《计算机应用》唯一官方网站, 2023, 43(3): 759-766.
[8]	凌宇, 单志龙. 基于兴趣增强的知识概念推荐系统[J]. 《计算机应用》唯一官方网站, 2023, 43(12): 3697-3702.
[9]	刘乾, 张洋铭, 万定生. 网格化分布式新安江模型并行计算算法[J]. 《计算机应用》唯一官方网站, 2023, 43(11): 3327-3333.
[10]	吴仁彪, 张振驰, 贾云飞, 乔晗. 云平台下基于截止时间的自适应调度策略[J]. 《计算机应用》唯一官方网站, 2023, 43(1): 176-184.
[11]	姜松岩, 廖晓鹃, 陈光柱. 基于可满足性模理论的多处理机通信延迟优化任务调度方法[J]. 《计算机应用》唯一官方网站, 2023, 43(1): 185-191.
[12]	章振宇, 谭国平, 周思源. 基于1‑bit压缩感知的高效无线联邦学习算法[J]. 《计算机应用》唯一官方网站, 2022, 42(6): 1675-1682.
[13]	张金泉, 徐寿伟, 李信诚, 王重洋, 徐景芝. 基于正交自适应鲸鱼优化的云计算任务调度[J]. 《计算机应用》唯一官方网站, 2022, 42(5): 1516-1523.
[14]	邱鑫源, 叶泽聪, 崔翛龙, 高志强. 联邦学习通信开销研究综述[J]. 《计算机应用》唯一官方网站, 2022, 42(2): 333-342.
[15]	平凡, 汤小春, 潘彦宇, 李战怀. 不规则任务在图形处理器集群上的调度策略[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3295-3301.