• •    

MapReduce Shuffle性能改进

张?<张?a张? 张?h张?r张?e张?f张?=张?"张?h张?t张?t张?p张?:张?/张?/张?w张?w张?w张?.张?j张?o张?c张?a张?.张?c张?n张?/张?C张?N张?/张?a张?r张?t张?i张?c张?l张?e张?/张?a张?d张?v张?a张?n张?c张?e张?d张?S张?e张?a张?r张?c张?h张?R张?e张?s张?u张?l张?t张?.张?d张?o张??张?s张?e张?a张?r张?c张?h张?S张?Q张?L张?=张?(张?(张?(张?张?张?[张?A张?u张?t张?h张?o张?r张?]张?)张? 张?A张?N张?D张? 张?1张?[张?J张?o张?u张?r张?n张?a张?l张?]张?)张? 张?A张?N张?D张? 张?y张?e张?a张?r张?[张?O张?r张?d张?e张?r张?]张?)张?"张? 张?t张?a张?r张?g张?e张?t张?=张?"张?_张?b张?l张?a张?n张?k张?"张?>张?张?张?<张?/张?a张?>张?,张?张??张?,张?张?张?   

  1. 湖北大学
  • 收稿日期:2016-05-24 修回日期:2016-07-04 发布日期:2016-07-04
  • 通讯作者: 熊倩

Improvement of MapReduce Shuffle Performance

  • Received:2016-05-24 Revised:2016-07-04 Online:2016-07-04

摘要: MapReduce是一种编程模型,作为Hadoop核心组件它对Hadoop在大数据的处理过程中的性能和效率起着关键性作用。对于Reduce端从Map端拷贝大量的结果数据耗时问题,提出对Map节点上同一作业的多个Map任务所产生的大量临时结果数据做总的合并,取代原有MapReduce架构对单个Map任务的结果数据做合并的机制。该方案减少了Map节点的输出结果数据量,以达到大量减少整个集群的网络传输数据量,节省Reduce端拷贝Map端输出数据的时间,从而减少整个MapReduce作业执行时间提升MapReduce的执行性能。

关键词: Hadoop, MapReduce, Shuffle, 性能

Abstract: As the core component of Hadoop, MapReduce is a programming model, It determines the performance and efficiency of Hadoop in its treatment of large data. Putting forward this idea that combining a large amount of the temporary result data produced by many Map tasks in the same job of the Map node for a lot of time consuming by Reduce end pulling a large amount of result data from the Map end via the internet,replaceing the mechanism that the original MapReduce architecture combines the result data of a single Map task .Through the improved project, the amount of output result data decreased on the Map node ,so that the amount of data transmission of the entire cluster is decreased ,saving the time of Reduce end copying Map end output data ,so the execution time of the MapReuce job is reduced, which improves the execution performance of the MapReduce .

Key words: Hadoop, MapReduce, Shuffle, performance

中图分类号: