计算机应用 ›› 2020, Vol. 40 ›› Issue (10): 2923-2928.DOI: 10.11772/j.issn.1001-9081.2020030348

• 数据科学与技术 • 上一篇    下一篇

面向大数据任务的调度方法

李孜颖, 石振国   

  1. 南通大学 信息科学技术学院, 江苏 南通 226001
  • 收稿日期:2020-03-24 修回日期:2020-05-08 出版日期:2020-10-10 发布日期:2020-05-18
  • 通讯作者: 石振国
  • 作者简介:李孜颖(1996-),女,江苏南京人,硕士研究生,主要研究方向:大数据、人工智能;石振国(1963-),男,江苏南通人,副教授,博士,CCF会员,主要研究方向:人工智能、机器学习。
  • 基金资助:
    江苏省自然科学基金资助项目(18KJB520041);南通市科技项目(JC2018132);南京航空航天大学高安全系统的软件开发与验证技术工业和信息化部重点实验室开放基金资助项目(NJ2018014)。

Scheduling method for big data tasks

LI Ziying, SHI Zhenguo   

  1. School of Information Science and Technology, Nantong University, Nantong Jiangsu 226001, China
  • Received:2020-03-24 Revised:2020-05-08 Online:2020-10-10 Published:2020-05-18
  • Supported by:
    This work is partially supported by the Natural Science Foundation of Jiangsu Province (18KJB520041), the Science and Technology Project of Nantong City (JC2018132), the Open Project of Key Laboratory of Ministry of Industry and Information Technology of Safety-Critical Software at Nanjing University of Aeronautics and Astronautics (NJ2018014).

摘要: 针对在大数据的处理过程中,对大数据任务的划分和资源分配缺乏合理性的问题,提出一种面向大数据任务的调度方法。该方法首先引入了调度理论用于处理大数据任务,帮助建立合理的大数据任务管理体系并规范大数据任务处理流程;然后,基于大数据任务的本质对数据集进行分析处理,引入决策表进行属性约简,以减小大数据分析任务的数据量和提高大数据分析效率;最后,采用模糊综合评价方法,将模糊综合评价的结果作为对任务调度的依据,以提高任务资源分配合理性。在UCI(University of California Irvine)数据集上进行测试,实验结果表明,该调度算法在平均预测准确度上比朴素贝叶斯(NB)算法高7.42个百分点,比误差反向传播(BP)算法高5.16个百分点,比均方根传递(RMSProp)算法高3.74个百分点。而对于特征数较多的数据集,所提算法在预测精度上较其他算法有显著提高。所提算法在平均调度长度比(SLR)上较HCPFS(Heterogeneous Critcal Path First Synthesis)算法和HIPLTS(Heterogeneous Improved Priority List for Task Scheduling)算法分别下降了12.14%和4.56%,在平均加速比上分别提升了7.14%和42.56%,表明该算法能有效提高大数据系统中任务调度的效率。综合比较分析,所提方法具有较高的预测精度,且高效可靠。

关键词: 大数据, 任务调度, 决策表, 属性约简, 模糊综合评价

Abstract: Because the division and resource allocation of big data tasks lacks rationality in big data processing procedure, a scheduling method for big data tasks was proposed. First, in order to establish a reasonable management system of big data tasks and standardize the big data task processing flow, the scheduling theory was introduced to handle big data tasks. Then, based on the natures of big data tasks, the datasets were analyzed and handled, the decision table was introduced to perform attribute reduction, so as to reduce the data amount of big data analysis tasks and improve the big data analysis efficiency. Finally, the fuzzy comprehensive evaluation method was adopted, and the result of fuzzy comprehensive evaluation was used as the basis for task scheduling, thereby improving the rationality of task resource allocation. Experimental results on University of California Irvine (UCI) datasets show that the average prediction accuracy of the proposed scheduling algorithm is 7.42 percentage points higher than that of the Naive Bayes (NB) algorithm, 5.16 percentage points higher than that of the error Back Propagation (BP) algorithm, and 3.74 percentage points higher than that of the Root Mean Square Prop (RMSProp) algorithm. For datasets with a large number of features, the prediction accuracy of the proposed algorithm is significantly improved compared to those of other algorithms. Compared with Heterogeneous Critcal Path First Synthesis (HCPFS) algorithm and Heterogeneous Improved Priority List for Task Scheduling (HIPLTS) algorithm, the proposed algorithm has the average Scheduling Length Ratio (SLR) decreased by 12.14% and 4.56% respectively, and the average speedup ratio increased by 7.14% and 42.56% respectively, showing that the algorithm can effectively improve the efficiency of task scheduling in big data systems. Comprehensive analysis shows that the proposed algorithm performs well in prediction accuraing, and is efficient and reliable.

Key words: big data, task scheduling, decision table, attribute reduction, fuzzy comprehensive evaluation

中图分类号: