Journal of Computer Applications ›› 2016, Vol. 36 ›› Issue (2): 330-335.DOI: 10.11772/j.issn.1001-9081.2016.02.0330

Previous Articles     Next Articles

Distributed deduplication storage system based on Hadoop platform

LIU Qing, FU Yinjin, NI Guiqiang, MEI Jianmin   

  1. College of Command Information System, PLA University of Science and Technology, Nanjing Jiangsu 210007, China
  • Received:2015-09-15 Revised:2015-09-22 Online:2016-02-10 Published:2016-02-03

基于Hadoop平台的分布式重删存储系统

刘青, 付印金, 倪桂强, 梅建民   

  1. 解放军理工大学 指挥信息系统学院, 南京 210007
  • 通讯作者: 付印金(1984-),男,湖南湘乡人,讲师,博士,CCF会员,主要研究方向:分布式存储系统、大数据管理。
  • 作者简介:刘青(1990-),女,河北邯郸人,硕士研究生,CCF会员,主要研究方向:网络存储、数据容灾;倪桂强(1966-),男,浙江湖州人,教授,博士生导师,博士,主要研究方向:网络管理、网络存储;梅建民(1990-),男,湖北仙桃人,硕士研究生,CCF会员,主要研究方向:固态存储、大数据管理。
  • 基金资助:
    国家863计划项目(2012AA01A509,2012AA01A510);国家自然科学基金资助项目(61402518)。

Abstract: Focusing on the issues that there is a lot of data redundancy in data center, especially the backup data has caused a tremendous waste on storage space, a deduplication prototype based on Hadoop platform was proposed. Deduplication technology which detects and eliminates redundant data in a particular data set can greatly reduce the data storage capacity and optimize the utilization of storage space. Using the two big data management tools——Hadoop Distributed File System (HDFS) and non-relational database HBase, a scalable and distributed deduplication storage system was designed and implemented. In this system, the MapReduce parallel programming framework was responsible for parallel deduplication, and HDFS was responsible for data storage after deduplication. The index table was stored in HBase for efficient chunk fingerprint indexing. The system was also tested with virtual machine image file sets. The results demonstrate that the Hadoop based distributed deduplication system can ensure high throughput and excellent scalability as well as guaranting high deduplication rate.

Key words: deduplication, distributed storage, Hadoop, HBase, Hadoop Distributed File System(HDFS)

摘要: 针对数据中心存在大量数据冗余的问题,特别是备份数据造成的存储容量浪费,提出一种基于Hadoop平台的分布式重复数据删除解决方案。该方案通过检测并消除特定数据集内的冗余数据,来显著降低数据存储容量,优化存储空间利用率。利用Hadoop大数据处理平台下的分布式文件系统(HDFS)和非关系型数据库HBase两种数据管理模式,设计并实现一种可扩展分布式重删存储系统。其中,MapReduce并行编程框架实现分布式并行重删处理,HDFS负责重删后的数据存储,在HBase数据库中构建索引表,实现高效数据块索引查询。最后,利用虚拟机镜像文件数据集对系统进行了测试,基于Hadoop平台的分布式重删系统能在保证高重删率的同时,具有高吞吐率和良好的可扩展性。

关键词: 重复数据删除, 分布式存储, Hadoop, HBase, Hadoop分布式文件系统

CLC Number: