计算机应用 ›› 2018, Vol. 38 ›› Issue (4): 1078-1083.DOI: 10.11772/j.issn.1001-9081.2017092358

• 先进计算 • 上一篇    下一篇

面向MapReduce计算模式的中间数据通信优化

曹云鹏1,2, 王海峰1,2   

  1. 1. 临沂大学 信息科学与工程学院, 山东 临沂 276002;
    2. 山东省网络环境智能计算技术重点实验室 临沂大学研究所, 山东 临沂 276002
  • 收稿日期:2017-09-29 修回日期:2017-12-01 出版日期:2018-04-10 发布日期:2018-04-09
  • 通讯作者: 王海峰
  • 作者简介:曹云鹏(1967-),男,山东临沂人,副教授,硕士,主要研究方向:大数据、分布式计算;王海峰(1976-),男,山东临沂人,教授,博士,CCF会员,主要研究方向:大数据、分布式计算。
  • 基金资助:
    山东省自然科学基金资助项目(ZR2017MF050,ZR2015FL014);山东省高等学校科学技术计划项目(J17KA049);山东省自主创新及成果转化专项(2014ZZCX02702);山东省重点研发项目(2016GGX109001)。

Communication optimization for intermediate data of MapReduce computing model

CAO Yunpeng1,2, WANG Haifeng1,2   

  1. 1. College of Information Science and Engineering, Linyi University, Linyi Shandong 276002, China;
    2. Institute of Linyi University, Shandong Provincial Key Laboratory of Network Based Intelligent Computing, Linyi Shandong 276000, China
  • Received:2017-09-29 Revised:2017-12-01 Online:2018-04-10 Published:2018-04-09
  • Supported by:
    This work is partially supported by the Natural Science Foundation of Shandong Province (ZR2017MF050, ZR2015FL014), the Higher Educational Science and Technology Program of Shandong Province (J17KA049), the Independent Innovation and Achievements Transformation Special Project of Shandong Province (2014ZZCX02702), the Primary Research and Development Project of Shandong Province (2016GGX109001).

摘要: 针对MapReduce计算模式在Map阶段结束后会产生海量中间数据,导致存在大量跨越机架交换机的数据通信问题,提出一种优化Map密集型作业的中间数据通信优化方法。首先,提取MapReduce计算作业的运行前调度信息的特征并且量化数据通信活跃度;然后,采用朴素贝叶斯分类模型实现分类预测,将历史作业的运行数据作为样本来训练分类模型;最后,根据作业分类预测结果把通信活跃的作业集中映射到同一机架中,通过提高通信局部性来优化性能瓶颈。实验结果表明,所提方案对Shuffle子过程稠密的作业优化效果明显,能够提高4%~5%的计算性能;此外,在多用户运行情况下能降低4.1%中间数据通信延迟。所提方法可有效降低大数据计算过程中的通信延迟,提高异构集群的计算性能。

关键词: MapReduce计算模型, 大数据处理, 通信优化, 中间数据, 机器学习

Abstract: Aiming at the communication problem of crossing the rack switches for a large amount of intermediate data generated after the Map phase in the MapReduce process, a new optimization method was proposed for the map-intensive jobs. Firstly, the features from the pre-running scheduling information were extracted and the data communication activity was quantified. Then naive Bayesian classification model was used to realize the classification prediction by using the historical jobs running data to train the classification model. Finally, the jobs with active intermediate data communication process were mapped into the same rack to keep communication locality. The experimental results show that the proposed communication optimization scheme has a good effect on shuffle-intensive jobs, and the calculation performance can be improved by 4%-5%. In the case of multi-user multi-jobs environment, the intermediate data can be reduced by 4.1%. The proposed method can effectively reduce the communication latency in large-scale data processing and improve the performance of heterogeneous clusters.

Key words: MapReduce computing model, big data processing, communication optimization, intermediate data, machine learning

中图分类号: