基于MapReduce的Hadoop大表导入编程模型

doi:10.11772/j.issn.1001-9081.2013.09.2486

计算机应用 ›› 2013, Vol. 33 ›› Issue (09): 2486-2489.DOI: 10.11772/j.issn.1001-9081.2013.09.2486

基于MapReduce的Hadoop大表导入编程模型

陈吉荣,乐嘉锦

东华大学计算机科学与技术学院,上海 201620

收稿日期:2013-03-18 修回日期:2013-04-20 出版日期:2013-09-01 发布日期:2013-10-18
通讯作者: 陈吉荣
作者简介:陈吉荣(1971-),男,安徽舒城人,讲师,博士,主要研究方向:Hadoop生态系统的大数据平台;
乐嘉锦(1951-),男,上海人,教授,博士生导师,主要研究方向:数据工程。
基金资助:
国家核高基项目

Programming model based on MapReduce for importing big table into HDFS

CHEN Jirong,LE Jiajin

Computer Science and Technology School, Donghua University, Shanghai 201620, China

Received:2013-03-18 Revised:2013-04-20 Online:2013-10-18 Published:2013-09-01
Contact: CHEN Jirong

摘要/Abstract

摘要： 针对Sqoop在导入大表时表现出的不稳定和效率较低两个主要问题,设计并实现了一种新的基于MapReduce的大表导入编程模型。该模型对于大表的切分算法是:将大表总的记录数对mapper数求步长,获得对应每个split的SQL查询语句的起始行和区间长度(等于步长),从而保证每个mapper的导入工作量完全相同。该模型的map方式是:进入map函数的键值对中的键是一个split所对应的SQL语句,将查询放在map函数中完成,从而使得模型中的每个mapper只调用一次map函数。对比实验表明:两个记录数相同的大表,无论其记录区间如何分布,其导入时间基本相同,或者对同一表分别用不同的分割字段,导入时间也完全相同;而对于同一个大表,模型的导入效率比Sqoop有显著提高。

关键词: 编程模型, Hadoop, MapReduce, Hadoop分布式文件系统, Sqoop

Abstract: To solve the problems of instability and inefficiency when data from a relation database system are transferred into Hadoop Distributed File System (HDFS) using Sqoop, the authors proposed and implemented a new programming model based on MapReduce framework. The algorithm splitting a big table in this model was as follows: firstly a step was calculated by dividing the total lines by the mapper number, then a SQL statement corresponding to each split could be constructed with a start line index and a span range equal to the above step, so this approach could guarantee that each mapper task would issue identical SQL workload. In map phrase, a mapper would only call map function once, with the single key-value pair below: the key was the above SQL statement corresponding to a split, and the value was null. The comparison experiments show that, for two different big tables with the same number of records, the respective importing time was approximately identical regardless of the records distribution, while using two different splitting fields in one big table, the importing time was also the same. At the same time, when applying two different approaches to one big table, the importing efficiency using the model was largely promoted than that using Sqoop.

Key words: programming model, Hadoop, MapReduce, Hadoop Distributed File System (HDFS), Sqoop

中图分类号:

陈吉荣乐嘉锦. 基于MapReduce的Hadoop大表导入编程模型[J]. 计算机应用, 2013, 33(09): 2486-2489.

CHEN Jirong LE Jiajin. Programming model based on MapReduce for importing big table into HDFS[J]. Journal of Computer Applications, 2013, 33(09): 2486-2489.

[1]	董聪, 张晓, 程文迪, 石佳. 基于新型存储器件的分布式文件系统性能优化[J]. 计算机应用, 2020, 40(12): 3594-3603.
[2]	李耘书, 滕飞, 李天瑞. 基于微操作的Hadoop参数自动调优方法[J]. 计算机应用, 2019, 39(6): 1589-1594.
[3]	郑振涛, 赵卓峰, 王桂玲, 徐垚. 面向港口停留区域识别的船舶停留轨迹提取方法[J]. 计算机应用, 2019, 39(1): 113-117.
[4]	郭方方, 潮洛蒙, 朱建文. 基于相似连接的多源数据并行预处理方法[J]. 计算机应用, 2019, 39(1): 57-60.
[5]	曹云鹏, 王海峰. 面向MapReduce计算模式的中间数据通信优化[J]. 计算机应用, 2018, 38(4): 1078-1083.
[6]	马友忠, 张智辉, 林春杰. 大数据相似性连接查询技术研究进展[J]. 计算机应用, 2018, 38(4): 978-986.
[7]	张承畅, 张华誉, 罗建昌, 何丰. 基于云计算和改进K-means算法的海量用电数据分析方法[J]. 计算机应用, 2018, 38(1): 159-164.
[8]	李强, 刘晓峰. 基于Hopfield神经网络的云存储负载均衡策略[J]. 计算机应用, 2017, 37(8): 2214-2217.
[9]	马生俊, 陈旺虎, 俞茂义, 李金溶, 郏文博. 云环境下影响数据分布并行应用执行效率的因素分析[J]. 计算机应用, 2017, 37(7): 1883-1887.
[10]	廖彬, 张陶, 国冰磊, 于炯, 张旭光, 刘炎. 基于Spark的ItemBased推荐算法性能优化[J]. 计算机应用, 2017, 37(7): 1900-1905.
[11]	肖子达, 朱立谷, 冯东煜, 张迪. 分布式数据库聚合计算性能优化[J]. 计算机应用, 2017, 37(5): 1251-1256.
[12]	吴家皋, 夏轩, 刘林峰. 基于MapReduce的轨迹压缩并行化方法[J]. 计算机应用, 2017, 37(5): 1282-1286.
[13]	王卓, 索勃, 潘巍. 三角形的并行枚举算法[J]. 计算机应用, 2017, 37(12): 3397-3400.
[14]	温占考, 易秀双, 田申申, 李婕, 王兴伟. 基于边界矩阵低阶近似和近邻模型的协同过滤算法[J]. 计算机应用, 2017, 37(12): 3472-3476.
[15]	付晨, 钟诚, 叶波. MapReduce并行加速数据流多模式相似性搜索[J]. 计算机应用, 2017, 37(1): 37-41.

基于MapReduce的Hadoop大表导入编程模型

Programming model based on MapReduce for importing big table into HDFS

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics