Task scheduling strategy based on data stream classification in Heron
ZHANG Yitian1, YU Jiong1,2, LU Liang3, LI Ziyang2
1. Software College, Xinjiang University, Urumqi Xinjiang 830008, China; 2. College of Information Science and Engineering, Xinjiang University, Urumqi Xinjiang 830046, China; 3. School of Computer Science and Technology, Civil Aviation University of China, Tianjin 300300, China
Abstract:In a new platform for big data stream processing called Heron, the round-robin scheduling algorithm is usually used for task scheduling by default, which does not consider the topology runtime state and the impact of different communication modes among task instances on Heron's performance. To solve this problem, a task scheduling strategy based on Data Stream Classification in Heron (DSC-Heron) was proposed, including data stream classification algorithm, data stream cluster allocation algorithm and data stream classification scheduling algorithm. Firstly, the instance allocation model of Heron was established to clarify the difference in communication overhead among different communication modes of the task instances. Secondly, the data stream was classified according to the real-time data stream size between task instances based on the data stream classification model of Heron. Finally, the packing plan of Heron was constructed by using the interrelated high-frequency data streams as the basic scheduling unit to complete the scheduling to minimize the communication cost by transforming inter-node data streams into intra-node ones as many as possible. After running SentenceWordCount, WordCount and FileWordCount topologies in a Heron cluster environment with 9 nodes, the results show that compared with the Heron default scheduling strategy, DSC-Heron has 8.35%, 7.07% and 6.83% improvements in system complete latency, inter-node communication overhead and system throughput respectively; in the load balancing aspect, the standard deviations of CPU usage and memory usage of the working nodes are decreased by 41.44% and 41.23% respectively. All experimental results show that DSC-Heron can effectively improve the performance of the topologies, and has the most significant optimization effect on FileWordCount topology which is close to the real application scenario.
张译天, 于炯, 鲁亮, 李梓杨. 大数据流式计算框架Heron环境下的流分类任务调度策略[J]. 计算机应用, 2019, 39(4): 1106-1116.
ZHANG Yitian, YU Jiong, LU Liang, LI Ziyang. Task scheduling strategy based on data stream classification in Heron. Journal of Computer Applications, 2019, 39(4): 1106-1116.
[1] 孙大为.大数据流式计算:应用特征和技术挑战[J]. 大数据, 2015, 1(3):99-105. (SUN D W. Big data stream computing:features and challenges[J]. Big Data Research, 2015, 1(3):99-105.) [2] Seagate. Data age 2025[EB/OL].[2018-08-10]. https://www.seagate.com/files/www-content/our-story/trends/files/data-age-2025-white-paper-simplified-chinese.pdf. [3] 孙大为, 张广艳, 郑纬民.大数据流式计算:关键技术及系统实例[J]. 软件学报, 2014, 25(4):839-862. (SUN D W, ZHANG G Y, ZHENG W M. Big data stream computing:technologies and instances[J]. Journal of Software, 2014, 25(4):839-862.) [4] TOSHNIWAL A, TANEJA S, SHUKLA A, et al. Storm@Twitter[C]//Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. New York:ACM, 2014:147-156. [5] CARBONE P, EWEN S, HARIDI S, et al. Apache FlinkTM:stream and batch processing in a single engine[J]. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2015, 36(4):28-38. [6] ANIELLO L, BALDONI R, QUERZONI L. Adaptive online scheduling in Storm[C]//Proceedings of the 7th ACM International Conference on Distributed Event-Based Systems. New York:ACM, 2013:207-218. [7] XU J L, CHEN Z H, TANG J, et al. T-Storm:traffic-aware online scheduling in Storm[C]//Proceedings of the 34th IEEE International Conference on Distributed Computing Systems. Piscataway, NJ:IEEE, 2014:535-544. [8] PENG B Y, HOSSEINI M, HONG Z H, et al. R-Storm:resource-aware scheduling in Storm[C]//Proceedings of the 16th Annual Middleware Conference. New York:ACM, 2015:149-161. [9] 鲁亮, 于炯, 卞琛, 等.大数据流式计算框架Storm的任务迁移策略[J]. 计算机研究与发展, 2018, 55(1):71-92. (LU L, YU J, BIAN C, et al. A task migration strategy in big data stream computing with Storm[J]. Journal of Computer Research and Development, 2018, 55(1):71-92.) [10] 李梓杨, 于炯, 卞琛, 等.基于流网络的流式计算动态任务调度策略[J]. 计算机应用, 2018, 38(9):2560-2567. (LI Z Y, YU J, BIAN C, et al. Dynamic task dispatching strategy for stream processing based on flow network[J]. Journal of Computer Applications, 2018, 38(9):2560-2567.) [11] de ASSUNCAO M D, da SILVA VEITH A, BUYYA R. Distributed data stream processing and edge computing:a survey on resource elasticity and future directions[J]. Journal of Network & Computer Applications, 2018, 103:1-17. [12] SHUKLA A, SIMMHAN Y. Model-driven scheduling for distributed stream processing systems[J]. Journal of Parallel & Distributed Computing, 2018, 117:98-114. [13] TRUONG T M, HARWOOD A, SINNOTT R O. Predicting the stability of large-scale distributed stream processing systems on the cloud[C]//Proceedings of the 7th International Conference on Cloud Computing and Services Science. Piscataway, NJ:IEEE, 2017:603-610. [14] SUN D, HUANG R. A stable online scheduling strategy for real-time stream computing over fluctuating big data streams[J]. IEEE Access, 2016, 4:8593-8607. [15] KULKARNI S, BHAGAT N, FU M, et al. Twitter Heron:stream processing at scale[C]//Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. New York:ACM, 2015:239-250. [16] FU M, MITTAL S, KEDIGEHALLI V, et al. Streaming@Twitter[J]. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2015, 38(4):15-27. [17] FU M, AGRAWAL A, FLORATOU A, et al. Twitter Heron:towards extensible streaming engines[C]//Proceedings of the 2017 IEEE 33rd International Conference on Data Engineering. Piscataway, NJ:IEEE, 2017:35-44. [18] Apache. Apache Aurora[EB/OL].[2018-08-10]. http://aurora.apache.org. [19] HINDMAN B, KONWINSKI A, ZAHARIA M, et al. Mesos:a platform for fine-grained resource sharing in the data center[C]//Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation. Berkeley:USENIX Association, 2010:429-483. [20] VAVILAPALLI V K, MURTHY A C, AGARWAL S, et al. Apache Hadoop YARN:yet another resource negotiator[C]//Proceedings of the 4th Annual Symposium on Cloud Computing. New York:ACM, 2013:5. [21] KREPS J, NARKHEDE N, RAO J. Kafka:a distributed messaging system for log processing[EB/OL].[2018-05-10]. http://pages.cs.wisc.edu/~akella/CS744/F17/838-CloudPapers/Kafka.pdf. [22] Apache. Apache DistributedLog[EB/OL].[2018-05-10]. http://bookkeeper.apache.org/distributedlog/. [23] Twitter. Implementing a custom scheduler[EB/OL].[2018-05-10]. https://apache.github.io/incubator-heron/docs/contributors/custom-scheduler/. [24] KULKARNI S. Apache/incubator-heron[EB/OL].[2018-05-16]. https://github.com/apache/incubator-heron. [25] KAMBURUGAMUVE S, RAMASAMY K, SWANY M, et al. Low latency stream processing:Apache Heron with Infiniband & Intel Omni-Path[C]//Proceedings of the 10th International Conference on Utility and Cloud Computing. New York:ACM, 2017:101-110.