基于Hadoop平台的分布式重删存储系统

doi:10.11772/j.issn.1001-9081.2016.02.0330

计算机应用 ›› 2016, Vol. 36 ›› Issue (2): 330-335.DOI: 10.11772/j.issn.1001-9081.2016.02.0330

• 第三届CCF大数据学术会议(CCF BigData 2015) • 上一篇下一篇

基于Hadoop平台的分布式重删存储系统

刘青, 付印金, 倪桂强, 梅建民

解放军理工大学指挥信息系统学院, 南京 210007

收稿日期:2015-09-15 修回日期:2015-09-22 发布日期:2016-02-03 出版日期:2016-02-10
通讯作者: 付印金(1984-),男,湖南湘乡人,讲师,博士,CCF会员,主要研究方向:分布式存储系统、大数据管理。
作者简介:刘青(1990-),女,河北邯郸人,硕士研究生,CCF会员,主要研究方向:网络存储、数据容灾;倪桂强(1966-),男,浙江湖州人,教授,博士生导师,博士,主要研究方向:网络管理、网络存储;梅建民(1990-),男,湖北仙桃人,硕士研究生,CCF会员,主要研究方向:固态存储、大数据管理。
基金资助:
国家863计划项目(2012AA01A509,2012AA01A510);国家自然科学基金资助项目(61402518)。

Distributed deduplication storage system based on Hadoop platform

LIU Qing, FU Yinjin, NI Guiqiang, MEI Jianmin

College of Command Information System, PLA University of Science and Technology, Nanjing Jiangsu 210007, China

Received:2015-09-15 Revised:2015-09-22 Online:2016-02-03 Published:2016-02-10

摘要/Abstract

摘要： 针对数据中心存在大量数据冗余的问题,特别是备份数据造成的存储容量浪费,提出一种基于Hadoop平台的分布式重复数据删除解决方案。该方案通过检测并消除特定数据集内的冗余数据,来显著降低数据存储容量,优化存储空间利用率。利用Hadoop大数据处理平台下的分布式文件系统(HDFS)和非关系型数据库HBase两种数据管理模式,设计并实现一种可扩展分布式重删存储系统。其中,MapReduce并行编程框架实现分布式并行重删处理,HDFS负责重删后的数据存储,在HBase数据库中构建索引表,实现高效数据块索引查询。最后,利用虚拟机镜像文件数据集对系统进行了测试,基于Hadoop平台的分布式重删系统能在保证高重删率的同时,具有高吞吐率和良好的可扩展性。

关键词: 重复数据删除, 分布式存储, Hadoop, HBase, Hadoop分布式文件系统

Abstract: Focusing on the issues that there is a lot of data redundancy in data center, especially the backup data has caused a tremendous waste on storage space, a deduplication prototype based on Hadoop platform was proposed. Deduplication technology which detects and eliminates redundant data in a particular data set can greatly reduce the data storage capacity and optimize the utilization of storage space. Using the two big data management tools——Hadoop Distributed File System (HDFS) and non-relational database HBase, a scalable and distributed deduplication storage system was designed and implemented. In this system, the MapReduce parallel programming framework was responsible for parallel deduplication, and HDFS was responsible for data storage after deduplication. The index table was stored in HBase for efficient chunk fingerprint indexing. The system was also tested with virtual machine image file sets. The results demonstrate that the Hadoop based distributed deduplication system can ensure high throughput and excellent scalability as well as guaranting high deduplication rate.

Key words: deduplication, distributed storage, Hadoop, HBase, Hadoop Distributed File System(HDFS)

中图分类号:

TP309.3

刘青, 付印金, 倪桂强, 梅建民. 基于Hadoop平台的分布式重删存储系统[J]. 计算机应用, 2016, 36(2): 330-335.

LIU Qing, FU Yinjin, NI Guiqiang, MEI Jianmin. Distributed deduplication storage system based on Hadoop platform[J]. Journal of Computer Applications, 2016, 36(2): 330-335.

参考文献

[1] 付印金,肖侬,刘芳.重复数据删除关键技术研究进展[J].计算机研究与发展, 2012, 49(1): 12-22. (FU Y J, XIAO N, LIU F. Research and development on key techniques of data deduplication[J]. Journal of Computer Research and Development, 2012, 49(1):12-20.)
[2] 程学旗,靳小龙,王元卓,等.大数据系统和分析技术综述[J].软件学报,2014,25(9):1889-1908. (CHENG X Q, JIN X L, WANG Y Z, et al. Survey on big data system and analytic technology[J]. Journal of Software, 2014, 25(9): 1889-1908.)
[3] CHANG R-S, LIAO C-S, FAN K-Z, et al. Dynamic de-duplication decision in a Hadoop distributed file system[J]. International Journal of Distributed Sensor Networks, 2014, 2014(6): 774-777.
[4] SUN Z, SHEN J, YONG J. A novel approach to data deduplication over the engineering-oriented cloud systems[J]. Integrated Computer-Aided Engineering, 2013, 20(1): 45-57.
[5] KOLB L, THOR A, RAHM E. Dedoop: efficient deduplication with Hadoop[J]. Proceedings of the VLDB Endowment, 2012, 5(12): 1878-1881.
[6] TWEET. Data deduplication tactics with HDFS and MapReduce [EB/OL]. (2013-03-25) [2015-05-25]. http://www.hadoopsphere.com/2013/02/data-de-duplication-tactics-with-hdfs.html.
[7] 曹英忠.基于Hadoop的重复数据删除技术的研究与应用[D].桂林:桂林理工大学,2012:61-67. (CAO Y Z. Research and application of data deduplication techniques based on Hadoop [D]. Guilin: Guilin University of Technology, 2012: 61-67.)
[8] KATHPAL A, JOHN M, MAKKAR G. Distributed duplicate detection in post-process data de-duplication[C]//HiPC 2011: Proceedings of the 2011 18th International Conference on High Performance Computing. Washington, DC: IEEE Computer Society, 2011 [2015-03-09]. http://www.hipc.org/hipc2011/studsym-papers/1569512535.pdf.
[9] WHITE T. Hadoop: the definitive guide[M]. 3rd edition. [S.l.]: Yahoo! Press, 2010: 45.

[1]	杨力, 陈建廷, 向阳. 基于HBase的工业时序大数据分布式存储性能优化策略[J]. 《计算机应用》唯一官方网站, 2023, 43(3): 759-766.
[2]	龙运波, 唐聃. 分布式存储中基于局部修复码的负载均衡方法[J]. 《计算机应用》唯一官方网站, 2023, 43(3): 767-775.
[3]	周娜, 成茗, 贾孟霖, 杨杨. 基于缩略图加密和分布式存储的医学图像隐私保护[J]. 《计算机应用》唯一官方网站, 2023, 43(10): 3149-3155.
[4]	卿欣艺, 陈玉玲, 周正强, 涂园超, 李涛. 基于中国剩余定理的区块链存储扩展模型[J]. 计算机应用, 2021, 41(7): 1977-1982.
[5]	崔双双, 王宏志. 基于日志结构合并树的轻量级分布式索引实现方法[J]. 计算机应用, 2021, 41(3): 630-635.
[6]	徐江峰, 谭玉龙. 基于HBase的多维索引查询机制的优化[J]. 《计算机应用》唯一官方网站, 2020, 40(2): 571-577.
[7]	董聪, 张晓, 程文迪, 石佳. 基于新型存储器件的分布式文件系统性能优化[J]. 计算机应用, 2020, 40(12): 3594-3603.
[8]	张航, 刘善政, 唐聃, 蔡红亮. 分布式存储系统中的低修复成本纠删码[J]. 计算机应用, 2020, 40(10): 2942-2950.
[9]	张国潮, 王瑞锦. 基于门限秘密共享的区块链分片存储模型[J]. 计算机应用, 2019, 39(9): 2617-2622.
[10]	李耘书, 滕飞, 李天瑞. 基于微操作的Hadoop参数自动调优方法[J]. 计算机应用, 2019, 39(6): 1589-1594.
[11]	郑振涛, 赵卓峰, 王桂玲, 徐垚. 面向港口停留区域识别的船舶停留轨迹提取方法[J]. 计算机应用, 2019, 39(1): 113-117.
[12]	冯钧, 李顶圣, 陆佳民, 张立霞. 基于HBase的路网移动对象时空索引方法[J]. 计算机应用, 2018, 38(6): 1575-1583.
[13]	崔晨, 郑林江, 韩凤萍, 何牧君. 基于内存的HBase二级索引设计[J]. 计算机应用, 2018, 38(6): 1584-1590.
[14]	吴仁彪, 刘超, 屈景怡. 基于HBase和Hive的航班延误平台的存储方法[J]. 计算机应用, 2018, 38(5): 1339-1345.
[15]	王青松, 葛慧. Winnowing指纹串匹配的重复数据删除算法[J]. 计算机应用, 2018, 38(3): 677-681.

基于Hadoop平台的分布式重删存储系统

Distributed deduplication storage system based on Hadoop platform

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics