Partitioning and mapping algorithm for in-memory computing framework based on iterative filling

doi:10.11772/j.issn.1001-9081.2017.03.647

Journal of Computer Applications ›› 2017, Vol. 37 ›› Issue (3): 647-653.DOI: 10.11772/j.issn.1001-9081.2017.03.647

Previous Articles Next Articles

Partitioning and mapping algorithm for in-memory computing framework based on iterative filling

BIAN Chen, YU Jiong, XIU Weirong, YING Changtian, QIAN Yurong

School of Information Science and Engineering, Xinjiang University, Urumqi Xinjiang 830046, China

Received:2016-09-26 Revised:2016-10-17 Online:2017-03-22 Published:2017-03-10
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61262088, 61462079, 61363083, 61562086), the Educational Research Program of Xinjiang Uygur Autonomous Region(XJEDU2016S106).

基于迭代填充的内存计算框架分区映射算法

卞琛, 于炯, 修位蓉, 英昌甜, 钱育蓉

新疆大学信息科学与工程学院, 乌鲁木齐 830046

通讯作者: 卞琛
作者简介:卞琛(1981-),男,江苏南京人,副教授,博士研究生,CCF会员,主要研究方向:网络计算、分布式系统;于炯(1964-),男,北京人,教授,博士,CCF高级会员,主要研究方向:网格计算、高性能计算;修位蓉(1979-),女,重庆人,讲师,硕士,主要研究方向:数据挖掘、分布式应用;英昌甜(1989-),女,新疆乌鲁木齐人,博士研究生,主要研究方向:大数据存储、内存计算;钱育蓉(1979-),女,新疆乌鲁木齐人,副教授,博士,CCF会员,主要研究方向:云计算、图形图像处理。
基金资助:
国家自然科学基金资助项目（61262088，61462079，61363083，61562086）；新疆维吾尔自治区高校科研计划项目（XJEDU2016S106）。

Abstract

Abstract: Focusing on the issue that the only one Hash/Range partitioning strategy in Spark usually results in unbalanced data load at Reduce phase and increases job duration sharply, an Iterative Filling data Partitioning and Mapping algorithm (IFPM) which include several innovative approaches was proposed. First of all, according to the analysis of job execute scheme of Spark, the job efficiency model and partition mapping model were established, the definitions of job execute timespan and allocation incline degree were given. Moreover, the Extendible Partitioning Algorithm (EPA) and Iterative Mapping Algorithm (IMA) were proposed, which reserved partial data into extend region by one-to-many partition function at Map phase. Data in extended region would be mapped by extra iterative allocation until the approximate data distribution was obtained, and the adaptive mapping function was executed by awareness of calculated data size at Reduce phase to revise the unbalanced data load in original region allocation. Experimental results demonstrate that for any distribution of the data, IFPM promotes the rationality of data load allocation from Map phase to Reduce phase and optimize the job efficiency of in-memory computing framework.

Key words: in-memory computing, load balance, extendible partitioning, iterative mapping

摘要： 针对内存计算框架Spark在作业Shuffle阶段一次分区产生的数据倾斜问题，提出一种内存计算框架的迭代填充分区映射算法（IFPM）。首先，分析Spark作业的执行机制，建立作业效率模型和分区映射模型，给出作业执行时间和分配倾斜度的定义，证明这些定义与作业执行效率的因果逻辑关系；然后，根据模型和定义求解，设计扩展式数据分区算法（EPA）和迭代式分区映射算法（IMA），在Map端建立一对多分区函数，并通过分区函数将部分数据填入扩展区内，在数据分布局部感知后再执行扩展区迭代式的多轮数据分配，根据Reduce端已分配数据量建立适应性的扩展区映射规则，对原生区的数据倾斜进行逐步修正，以此保障数据分配的均衡性。实验结果表明，在不同源数据分布条件下，算法均提高了作业Shuffle过程分区映射合理性，缩减了宽依赖Stage的同步时间，提高了作业执行效率。

关键词: 内存计算, 数据均衡, 扩展式分区, 迭代式映射

CLC Number:

TP393.09

BIAN Chen, YU Jiong, XIU Weirong, YING Changtian, QIAN Yurong. Partitioning and mapping algorithm for in-memory computing framework based on iterative filling[J]. Journal of Computer Applications, 2017, 37(3): 647-653.

卞琛, 于炯, 修位蓉, 英昌甜, 钱育蓉. 基于迭代填充的内存计算框架分区映射算法[J]. 计算机应用, 2017, 37(3): 647-653.

References

[1] STRANDE S M, CICOTTI P, SINKOVITS R S, et al. Gordon:design, performance, and experiences deploying and supporting a data intensive supercomputer[C]//Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment:Bridging from the Extreme to the Campus and Beyond. New York:ACM, 2012:Article No. 3.
[2] BRONEVETSKY G, MOODY A. Scalable I/O systems via node-local storage:approaching 1 TB/sec file I/O, LLNL-TR-415791[R]. Livermore, CA:Lawrence Livermore National Laboratory, 2009:1-6.
[3] ZAHARIA M, CHOWDHURY M, DAS T, et al. Fast and interactive analytics over Hadoop data with Spark[J]. Login, 2012, 37(4):45-51.
[4] Apache Spark. Spark overview[EB/OL].[2015-03-18]. http://spark.apache.org.
[5] ZAHARIA M, CHOWDHURY M, DAS T, et al. Resilient distributed datasets:a fault-tolerant abstraction for in-memory cluster computing[C]//Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. Berkeley, CA:USENIX Association, 2012:2.
[6] LIN X, WANG P, WU B. Log analysis in cloud computing environment with Hadoop and Spark[C]//Proceedings of the 5th IEEE International Conference on Broadband Network and Multimedia Technology. Piscataway, NJ:IEEE, 2013:273-276.
[7] DONG X, XIE Y, MURALIMANOHAR N, et al. Hybrid checkpointing using emerging nonvolatile memories for future exascale systems[J]. ACM Transactions on Architecture and Code Optimization, 2011, 8(2):510-521.
[8] 慈轶为,张展,左德承,等.可扩展的多周期检查点设置[J].软件学报,2010,21(2):218-230.(CI Y W, ZHANG Z, ZUO D C, et al. Scalable time-based multi-cycle checkpointing[J]. Journal of Software, 2010, 21(2):218-230.)
[9] DEAN J, GHEMAWAT S. MapReduce:simplified data processing on large clusters[C]//Proceedings of the 6th Conference on Symposium on Opearting Systems Design and Implementation. Berkeley, CA:USENIX Association, 2004,6:10.
[10] KWON Y, BALAZINSKA M, HOWE B, et al. A study of skew in MapReduce application[EB/OL].[2016-03-18]. https://www.researchgate.net/publication/228941278_A_Study_of_Skew_in_MapReduce_Applications.
[11] KWON Y, BALAZINSKA M, HOWE B, et al. Skew-resistant parallel processing of feature-extracting scientific user-defined functions[C]//Proceedings of the 1st ACM Symposium on Cloud Computing. New York:ACM, 2010:75-86.
[12] 王卓,陈群,李战怀,等.基于增量式分区策略的MapReduce数据均衡方法[J].计算机学报,2016,39(1):19-35.(WANG Z, CHEN Q, LI Z H, et al. An incremental partitioning strategy for data balance on MapReduce[J]. Chinese Journal of Computers, 2016, 39(1):19-35.)
[13] KWON Y, BALAZINSKA M, HOWE B, et al. SkewTune:mitigating skew in MapReduce applications[C]//Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. New York:ACM, 2012:25-36.
[14] YAN W, XUE Y, MALIN B. Scalable and robust key group size estimation for reducer load balancing in MapReduce[C]//Proceedings of the 2013 IEEE International Conference on Big Data. Piscataway, NJ:IEEE, 2013:156-162.
[15] RAMAKRISHNAN S R, SWART G, URMANOV A, et al. Balancing reducer skew in MapReduce workloads using progressive sampling[C]//Proceedings of the 3rd ACM Symposium on Cloud Computing. New York:ACM, 2012:Article No. 16.
[16] GUFLER B, AUGSTEN N, REISER A, et al. Handing data skew in MapReduce[C]//Proceedings of the 1st International Conference on Cloud Computing and Services Science. Berlin:Springer, 2011:574-583.
[17] GUFLER B, AUGSTEN N, REISER A, et al. Load balancing in MapReduce based on scalable cardinality estimates[C]//Proceedings of the 2012 IEEE 28th International Conference on Data Engineering. Washington, DC:IEEE Computer Society, 2012:522-533.
[18] KOLB L, THOR A, RAHM E. Load balancing for MapReduce-based entity resolution[C]//Proceedings of the 2012 IEEE 28th International Conference on Data Engineering. Washington, DC:IEEE Computer Society, 2012:618-629.
[19] KOLB L, THOR A, RAHM E, et al. Block-based load balancing for entity resolution with MapReduce[C]//Proceedings of the 20th ACM International Conference on Information and Knowledge Management. New York:ACM, 2011:2397-2400.
[20] RACHA S C. Load balancing Map-Reduce communications for efficient executions of applications in a cloud[D]. Bangalore, India:Indian Institute of Science, 2012:12-16.
[21] IBRAHIM S, JIN H, LU L, et al. Handling partitioning skew in MapReduce using LEEN[J]. Peer-to-Peer Networking and Applications, 2013, 6(4):409-424.
[22] JURE L. Stanford network analysis project[EB/OL].[2015-03-18]. http://snap.stanford.edu.

Partitioning and mapping algorithm for in-memory computing framework based on iterative filling

基于迭代填充的内存计算框架分区映射算法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

[1]	Qi GAO, Na LYU, Jingcheng MIAO. Wireless virtual network embedding algorithm based on load balance [J]. Journal of Computer Applications, 2022, 42(10): 3148-3153.
[2]	FAN Zifu, ZHOU Kaiheng, YAO Jie. Game theory based SDN master controller reselection mechanism [J]. Journal of Computer Applications, 2018, 38(3): 776-779.
[3]	XU Lijie. Broadcasting energy-efficiency optimization algorithms for asymmetric duty-cycled sensor network [J]. Journal of Computer Applications, 2018, 38(11): 3275-3281.
[4]	LUO Xiaoxia, SI Fengwei, LUO Xiangyu. Effects of large-scale graph structural feature on partitioning quality [J]. Journal of Computer Applications, 2018, 38(1): 1-5.
[5]	ZHANG Jian, WANG Yang, LIU Dandan. Yac:yet another distributed consensus algorithm [J]. Journal of Computer Applications, 2017, 37(9): 2524-2530.
[6]	LI Zhuhong, ZHAO Canming, ZHOU Fang, ZHANG Xinming. Efficient and load balanced open shortest path first protocol in electric power communication network [J]. Journal of Computer Applications, 2017, 37(7): 1873-1876.
[7]	WANG Yefu, ZENG Guosun, DING Chunling. Load-aware optimal placement for heterogeneous controllers in software-defined networking based on unbalanced graph partitioning [J]. Journal of Computer Applications, 2016, 36(4): 883-888.
[8]	WANG Guilan, ZHOU Guoliang, SA Churila, ZHU Yongli. Parallel fuzzy C-means clustering algorithm in Spark [J]. Journal of Computer Applications, 2016, 36(2): 342-347.
[9]	ZHAO Canming, LI Zhuhong, YAN Fan, ZHANG Xinming. Load balanced routing protocol in electric power communication networks [J]. Journal of Computer Applications, 2016, 36(11): 3028-3032.
[10]	SHEN Li, LI Jie, ZHU Huayong. Task coordination and workload balance method for multi-robot based on trading-tree [J]. Journal of Computer Applications, 2016, 36(11): 3127-3130.
[11]	SUN Xuemei, ZHANG Xinzhong, WANG Yaning, ZHANG Tianyuan. Improvement on RaSMaLai in wireless sensor networks [J]. Journal of Computer Applications, 2015, 35(9): 2436-2439.
[12]	CHEN Feiyu, WANG Binqiang, WANG Wenbo, WANG Zhiming. Progressive auction based switch migration mechanism in software defined network [J]. Journal of Computer Applications, 2015, 35(8): 2118-2123.
[13]	ZHENG Hongxing, WU Yue, TU Chuang, LIU Jinping. Quay crane allocation and scheduling joint optimization model for single ship [J]. Journal of Computer Applications, 2015, 35(1): 247-251.
[14]	CAI Song ZHANG Jianming CHEN Jiming PAN Jingui. Load balancing technology based on naive Bayes algorithm in cloud computing environment [J]. Journal of Computer Applications, 2014, 34(2): 360-364.
[15]	FENG Yong YIN Jiena XU Hongyan. Distributed rules mining algorithm with load balance based on vertical FP-tree [J]. Journal of Computer Applications, 2014, 34(2): 396-400.