改进的Spark Shuffle内存分配算法

doi:10.11772/j.issn.1001-9081.2017.12.3401

计算机应用 ›› 2017, Vol. 37 ›› Issue (12): 3401-3405.DOI: 10.11772/j.issn.1001-9081.2017.12.3401

改进的Spark Shuffle内存分配算法

侯伟凡, 樊玮, 张宇翔

中国民航大学计算机科学与技术学院, 天津 300300

收稿日期:2017-05-09 修回日期:2017-07-24 发布日期:2017-12-18 出版日期:2017-12-10
通讯作者: 侯伟凡
作者简介:侯伟凡(1992-),男,内蒙古乌兰浩特人,硕士研究生,主要研究方向:智能算法、数据挖掘;樊玮(1968-),男,陕西乾县人,教授,博士,CCF会员,主要研究方向:智能信息处理、软件工程;张宇翔(1975-),男,山西五寨人,副教授,博士,CCF会员,主要研究方向:网络数据分析、分布式网络。
基金资助:
国家自然科学基金资助项目（U1533104）。

Improved Spark Shuffle memory allocation algorithm

HOU Weifan, FAN Wei, ZHANG Yuxiang

College of Computer Science and Technology, Civil Aviation University of China, Tianjin 300300, China

Received:2017-05-09 Revised:2017-07-24 Online:2017-12-18 Published:2017-12-10
Supported by:
The work is partially supported by the National Natural Science Foundation of China (U1533104).

摘要/Abstract

摘要： Shuffle性能是影响大数据集群性能的重要指标，Spark自身的Shuffle内存分配算法试图为内存池中的每一个Task平均分配内存，但是在实验中发现，由于各Task对于内存需求的不均衡导致了内存的浪费和运行效率较低的问题。针对上述问题，提出一种改进的Spark Shuffle内存分配算法。该算法根据Task的内存申请量和历史运行数据将Task按内存需求分为大小两类，对小内存需求型Task作"分割化"处理，对大内存需求型Task基于Task溢出次数和溢出后等待时间分配内存。该算法充分利用内存池的空闲内存，可以在数据倾斜导致的Task内存需求不均衡的情况下进行Task内存分配的自适应调节。实验结果表明，改进后算法较原算法降低了Task的溢出率，减少了Task的周转时间，提高了集群的运行性能。

关键词: Apache Spark, Shuffle, 自适应, 内存分配, 运行性能

Abstract: Shuffle performance is an important indicator of affecting cluster performance for big data frameworks. The Shuffle memory allocation algorithm of Spark itself tries to allocate memory evenly for every Task in the memory pool, but it is found in experiments that the memory was wasted and the efficiency was low due to the imbalance of memory requirements for each Task. In order to solve the problem, an improved Spark Shuffle memory allocation algorithm was proposed. According to the amount of memory applications and historical operating data, the Task was divided into two categories based on memory requirements. The "split"processing was carried out for the Task of small memory requirements, while the memory was allocated based on the number of Task overflows and the waiting time after overflow for the Task of large memory requirements. By taking full advantage of the free memory of memory pool, the adaptive adjustment of Task memory allocation could be realized under the condition of unbalanced Task memory requirements caused by the data skew. The experimental results show that, compared with the original algorithm, the improved algorithm can reduce the overflow rate of the Task, decrease the turnaround time of the Task, and improve the running performance of the cluster.

Key words: Apache Spark, Shuffle, adaptive, memory allocation, running performance

中图分类号:

TP311.5

侯伟凡, 樊玮, 张宇翔. 改进的Spark Shuffle内存分配算法[J]. 计算机应用, 2017, 37(12): 3401-3405.

HOU Weifan, FAN Wei, ZHANG Yuxiang. Improved Spark Shuffle memory allocation algorithm[J]. Journal of Computer Applications, 2017, 37(12): 3401-3405.

参考文献

[1] 程学旗,靳小龙,王元卓,等.大数据系统和分析技术综述[J].软件学报,2014,25(9):1889-1908.(CHENG X Q, JIN X L, WANG Y Z, et al. Survey on big data system and analytic technology[J]. Journal of Software, 2014, 25(9):1889-1908.)
[2] Apache. Apache Hadoop[2017-04-20]. http://apache.hadoop.org.
[3] ZAHARIA M, CHOWDHURY M, DAS T, et al. Resilient distributed datasets:a fault-tolerant abstraction for in-memory cluster computing[C]//Proceedings of the 20129th USENIX Conference on Networked Systems Design and Implementation. Berkeley, CA:USENIX Association, 2012:2-2.
[4] ZAHARIA M, BORTHAKUR D, SARMA J S, et al. Delay scheduling:a simple technique for achieving locality and fairness in cluster scheduling[C]//Proceedings of the 20105th European Conference on Computer Systems. New York:ACM, 2010:265-278.
[5] GUO Y F, RAO J, CHENG D Z, et al. iShuffle:improving Hadoop performance with shuffle-on-write[J]. IEEE Transactions on Parallel & Distributed Systems, 2017, 28(6):1649-1662.
[6] 熊倩,张龑,郭明,等.MapReduce Shuffle性能改进[J].计算机应用,2017,37(S1):58-62,67.(XIONG Q, ZHANG Y, GUO M, et al. Performance improvement of MapReduce Shuffle[J]. Journal of Computer Applications, 2017, 37(S1):58-62, 67.)
[7] 彭辅权,金苍宏,吴明晖,等.MapReduce中shuffle优化与重构[J].中国科技论文,2012,7(4):241-245.(PENG F Q, JIN C H, WU M H, et al. Optimization and reconstruction shuffle in MapReduce[J]. China Sciencepaper, 2012, 7(4):241-245.)
[8] DAVIDSON A, OR A. Optimizing shuffle performance in Spark[EB/OL].[2017-04-12]. https://people.eecs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf.
[9] 李玉林,董晶.基于Hadoop的MapReduce模型的研究与改进[J].计算机工程与设计,2012,33(8):3110-3116.(LI Y L, DONG J. Study and improvement of MapReduce based on Hadoop[J]. Computer Engineering and Design, 2012, 33(8):3110-3116.)
[10] GRISHCHENKO A. Distributed systems architecture[EB/OL].[2017-04-12]. https://0x0fff.com/spark-architecture.
[11] WANG J H, QIU M K, GUO B, et al. Phase-reconfigurable Shuffle optimization for Hadoop MapReduce[J]. IEEE Transactions on Cloud Computing, 2015, PP(99):1-1.
[12] LI J G, LIN X L, CUI X L, et al. Improving the Shuffle of Hadoop MapReduce[C]//Proceedings of the 2013 IEEE 5th International Conference on Cloud Computing Technology and Science. Piscataway, NJ:IEEE, 2013:266-273

[1]	赵秦壮, 谭红叶. 基于自适应阈值学习的时序因果推断方法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2660-2666.
[2]	方介泼, 陶重犇. 应对零日攻击的混合车联网入侵检测系统[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2763-2769.
[3]	姚光磊, 熊菊霞, 杨国武. 基于神经网络优化的花朵授粉算法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2829-2837.
[4]	杨乐, 张达敏, 何庆, 邓佳欣, 左锋琴. 改进猎人猎物优化算法在WSN覆盖中的应用[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2506-2513.
[5]	徐航, 杨智, 陈性元, 韩冰, 杜学绘. 基于自适应敏感区域变异的覆盖引导模糊测试[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2528-2535.
[6]	吴锦富, 柳毅. 基于随机噪声和自适应步长的快速对抗训练方法[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1807-1815.
[7]	李焱, 潘大志, 郑思情. 多车场带时间窗车辆路径问题的改良自适应大邻域搜索算法[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1897-1904.
[8]	李威, 陈玲, 徐修远, 朱敏, 郭际香, 周凯, 牛颢, 张煜宸, 易珊烨, 章毅, 罗凤鸣. 基于多任务学习的间质性肺病分割算法[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1285-1293.
[9]	韦修喜, 彭茂松, 黄华娟. 基于多策略改进蝴蝶优化算法的无线传感网络节点覆盖优化[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1009-1017.
[10]	刘一迪, 温自豪, 任富香, 李诗音, 唐德玉. 自适应球形演化的药物-靶标相互作用预测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 989-994.
[11]	李俊杰, 望育梅, 李志军, 刘雨. 全景视频基于块的视口自适应传输方案综述[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 536-547.
[12]	王震, 张珊珊, 邬斌扬, 苏万华. 基于自适应粒子群优化算法的串联复合涡轮储能优化策略[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 611-618.
[13]	刘勇, 杨锟. 新能源汽车电池回收网点竞争选址模型及算法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 595-603.
[14]	王超, 姚姗姗. 基于语音质量自适应和类三元组思想的说话人确认方法[J]. 《计算机应用》唯一官方网站, 2024, 44(12): 3899-3906.
[15]	闫文杰, 党东月. 基于特征自适应提取的宽度量子态层析模型[J]. 《计算机应用》唯一官方网站, 2024, 44(12): 3861-3866.

改进的Spark Shuffle内存分配算法

Improved Spark Shuffle memory allocation algorithm

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics