计算机应用 ›› 2013, Vol. 33 ›› Issue (09): 2486-2489.DOI: 10.11772/j.issn.1001-9081.2013.09.2486

• 数据库技术 • 上一篇    下一篇

基于MapReduce的Hadoop大表导入编程模型

陈吉荣,乐嘉锦   

  1. 东华大学 计算机科学与技术学院,上海 201620
  • 收稿日期:2013-03-18 修回日期:2013-04-20 出版日期:2013-09-01 发布日期:2013-10-18
  • 通讯作者: 陈吉荣
  • 作者简介:陈吉荣(1971-),男,安徽舒城人,讲师,博士,主要研究方向:Hadoop生态系统的大数据平台;
    乐嘉锦(1951-),男,上海人,教授,博士生导师,主要研究方向:数据工程。
  • 基金资助:

    国家核高基项目

Programming model based on MapReduce for importing big table into HDFS

CHEN Jirong,LE Jiajin   

  1. Computer Science and Technology School, Donghua University, Shanghai 201620, China
  • Received:2013-03-18 Revised:2013-04-20 Online:2013-10-18 Published:2013-09-01
  • Contact: CHEN Jirong

摘要: 针对Sqoop在导入大表时表现出的不稳定和效率较低两个主要问题,设计并实现了一种新的基于MapReduce的大表导入编程模型。该模型对于大表的切分算法是:将大表总的记录数对mapper数求步长,获得对应每个split的SQL查询语句的起始行和区间长度(等于步长),从而保证每个mapper的导入工作量完全相同。该模型的map方式是:进入map函数的键值对中的键是一个split所对应的SQL语句,将查询放在map函数中完成,从而使得模型中的每个mapper只调用一次map函数。对比实验表明:两个记录数相同的大表,无论其记录区间如何分布,其导入时间基本相同,或者对同一表分别用不同的分割字段,导入时间也完全相同;而对于同一个大表,模型的导入效率比Sqoop有显著提高。

关键词: 编程模型, Hadoop, MapReduce, Hadoop分布式文件系统, Sqoop

Abstract: To solve the problems of instability and inefficiency when data from a relation database system are transferred into Hadoop Distributed File System (HDFS) using Sqoop, the authors proposed and implemented a new programming model based on MapReduce framework. The algorithm splitting a big table in this model was as follows: firstly a step was calculated by dividing the total lines by the mapper number, then a SQL statement corresponding to each split could be constructed with a start line index and a span range equal to the above step, so this approach could guarantee that each mapper task would issue identical SQL workload. In map phrase, a mapper would only call map function once, with the single key-value pair below: the key was the above SQL statement corresponding to a split, and the value was null. The comparison experiments show that, for two different big tables with the same number of records, the respective importing time was approximately identical regardless of the records distribution, while using two different splitting fields in one big table, the importing time was also the same. At the same time, when applying two different approaches to one big table, the importing efficiency using the model was largely promoted than that using Sqoop.

Key words: programming model, Hadoop, MapReduce, Hadoop Distributed File System (HDFS), Sqoop

中图分类号: