Journal of Computer Applications ›› 2019, Vol. 39 ›› Issue (4): 1106-1116.DOI: 10.11772/j.issn.1001-9081.2018081848

Previous Articles     Next Articles

Task scheduling strategy based on data stream classification in Heron

ZHANG Yitian1, YU Jiong1,2, LU Liang3, LI Ziyang2   

  1. 1. Software College, Xinjiang University, Urumqi Xinjiang 830008, China;
    2. College of Information Science and Engineering, Xinjiang University, Urumqi Xinjiang 830046, China;
    3. School of Computer Science and Technology, Civil Aviation University of China, Tianjin 300300, China
  • Received:2018-09-09 Revised:2018-11-13 Online:2019-04-10 Published:2019-04-10
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61462079, 61562078, 61562086), the National Key Technology R&D Program of China (2015BAH02F01), the Natural Science Foundation of Xinjiang Uygur Autonomous Region of China (2017D01A20), the Educational Research Program of Xinjiang Uygur Autonomous Region of China (XJEDU2016S106).

大数据流式计算框架Heron环境下的流分类任务调度策略

张译天1, 于炯1,2, 鲁亮3, 李梓杨2   

  1. 1. 新疆大学 软件学院, 乌鲁木齐 830008;
    2. 新疆大学 信息科学与工程学院, 乌鲁木齐 830046;
    3. 中国民航大学 计算机科学与技术学院, 天津 300300
  • 通讯作者: 于炯
  • 作者简介:张译天(1995-),男,河南商丘人,硕士研究生,CCF会员,主要研究方向:云计算、实时计算、分布式计算;于炯(1964-),男,北京人,教授,博士生导师,博士,CCF会员,主要研究方向:网格计算、分布式计算;鲁亮(1990-),男,湖南湘潭人,讲师,博士,CCF会员,主要研究方向:云计算、分布式计算、内存计算;李梓杨(1993-),男,新疆乌鲁木齐人,博士研究生,CCF会员,主要研究方向:云计算、分布式计算。
  • 基金资助:
    国家自然科学基金资助项目(61462079,61562078,61562086);国家科技支撑计划项目(2015BAH02F01);新疆维吾尔自治区自然科学基金资助项目(2017D01A20);新疆维吾尔自治区高校科研计划项目(XJEDU2016S106)。

Abstract: In a new platform for big data stream processing called Heron, the round-robin scheduling algorithm is usually used for task scheduling by default, which does not consider the topology runtime state and the impact of different communication modes among task instances on Heron's performance. To solve this problem, a task scheduling strategy based on Data Stream Classification in Heron (DSC-Heron) was proposed, including data stream classification algorithm, data stream cluster allocation algorithm and data stream classification scheduling algorithm. Firstly, the instance allocation model of Heron was established to clarify the difference in communication overhead among different communication modes of the task instances. Secondly, the data stream was classified according to the real-time data stream size between task instances based on the data stream classification model of Heron. Finally, the packing plan of Heron was constructed by using the interrelated high-frequency data streams as the basic scheduling unit to complete the scheduling to minimize the communication cost by transforming inter-node data streams into intra-node ones as many as possible. After running SentenceWordCount, WordCount and FileWordCount topologies in a Heron cluster environment with 9 nodes, the results show that compared with the Heron default scheduling strategy, DSC-Heron has 8.35%, 7.07% and 6.83% improvements in system complete latency, inter-node communication overhead and system throughput respectively; in the load balancing aspect, the standard deviations of CPU usage and memory usage of the working nodes are decreased by 41.44% and 41.23% respectively. All experimental results show that DSC-Heron can effectively improve the performance of the topologies, and has the most significant optimization effect on FileWordCount topology which is close to the real application scenario.

Key words: big data, stream computing, Apache Heron, task scheduling, data stream classification, communication overhead

摘要: 新型大数据流式计算框架Apache Heron默认使用轮询调度算法进行任务调度,忽略了拓扑运行时状态以及任务实例间不同通信方式对系统性能的影响。针对这个问题,提出Heron环境下流分类任务调度策略(DSC-Heron),包括流分类算法、流簇分配算法和流分类调度算法。首先通过建立Heron作业模型明确任务实例间不同通信方式的通信开销差异;其次基于流分类模型,根据任务实例间实时数据流大小对数据流进行分类;最后将相互关联的高频数据流整体作为基本调度单元构建任务分配计划,在满足资源约束条件的同时尽可能多地将节点间通信转化为节点内通信以最小化系统通信开销。在包含9个节点的Heron集群环境下分别运行SentenceWordCount、WordCount和FileWordCount拓扑,结果表明DSC-Heron相对于Heron默认调度策略,在系统完成时延、节点间通信开销和系统吞吐量上分别平均优化了8.35%、7.07%和6.83%;在负载均衡性方面,工作节点的CPU占用率和内存占用率标准差分别平均下降了41.44%和41.23%。实验结果表明,DSC-Heron对测试拓扑的运行性能有一定的优化作用,其中对接近真实应用场景的FileWordCount拓扑优化效果最为显著。

关键词: 大数据, 流式计算, Apache Heron, 任务调度, 数据流分类, 通信开销

CLC Number: