Distributed deduplication storage system based on Hadoop platform

doi:10.11772/j.issn.1001-9081.2016.02.0330

Journal of Computer Applications ›› 2016, Vol. 36 ›› Issue (2): 330-335.DOI: 10.11772/j.issn.1001-9081.2016.02.0330

Previous Articles Next Articles

Distributed deduplication storage system based on Hadoop platform

LIU Qing, FU Yinjin, NI Guiqiang, MEI Jianmin

College of Command Information System, PLA University of Science and Technology, Nanjing Jiangsu 210007, China

Received:2015-09-15 Revised:2015-09-22 Online:2016-02-03 Published:2016-02-10

基于Hadoop平台的分布式重删存储系统

刘青, 付印金, 倪桂强, 梅建民

解放军理工大学指挥信息系统学院, 南京 210007

通讯作者: 付印金(1984-),男,湖南湘乡人,讲师,博士,CCF会员,主要研究方向:分布式存储系统、大数据管理。
作者简介:刘青(1990-),女,河北邯郸人,硕士研究生,CCF会员,主要研究方向:网络存储、数据容灾;倪桂强(1966-),男,浙江湖州人,教授,博士生导师,博士,主要研究方向:网络管理、网络存储;梅建民(1990-),男,湖北仙桃人,硕士研究生,CCF会员,主要研究方向:固态存储、大数据管理。
基金资助:
国家863计划项目(2012AA01A509,2012AA01A510);国家自然科学基金资助项目(61402518)。

Abstract

Abstract: Focusing on the issues that there is a lot of data redundancy in data center, especially the backup data has caused a tremendous waste on storage space, a deduplication prototype based on Hadoop platform was proposed. Deduplication technology which detects and eliminates redundant data in a particular data set can greatly reduce the data storage capacity and optimize the utilization of storage space. Using the two big data management tools——Hadoop Distributed File System (HDFS) and non-relational database HBase, a scalable and distributed deduplication storage system was designed and implemented. In this system, the MapReduce parallel programming framework was responsible for parallel deduplication, and HDFS was responsible for data storage after deduplication. The index table was stored in HBase for efficient chunk fingerprint indexing. The system was also tested with virtual machine image file sets. The results demonstrate that the Hadoop based distributed deduplication system can ensure high throughput and excellent scalability as well as guaranting high deduplication rate.

Key words: deduplication, distributed storage, Hadoop, HBase, Hadoop Distributed File System(HDFS)

摘要： 针对数据中心存在大量数据冗余的问题,特别是备份数据造成的存储容量浪费,提出一种基于Hadoop平台的分布式重复数据删除解决方案。该方案通过检测并消除特定数据集内的冗余数据,来显著降低数据存储容量,优化存储空间利用率。利用Hadoop大数据处理平台下的分布式文件系统(HDFS)和非关系型数据库HBase两种数据管理模式,设计并实现一种可扩展分布式重删存储系统。其中,MapReduce并行编程框架实现分布式并行重删处理,HDFS负责重删后的数据存储,在HBase数据库中构建索引表,实现高效数据块索引查询。最后,利用虚拟机镜像文件数据集对系统进行了测试,基于Hadoop平台的分布式重删系统能在保证高重删率的同时,具有高吞吐率和良好的可扩展性。

关键词: 重复数据删除, 分布式存储, Hadoop, HBase, Hadoop分布式文件系统

CLC Number:

TP309.3

LIU Qing, FU Yinjin, NI Guiqiang, MEI Jianmin. Distributed deduplication storage system based on Hadoop platform[J]. Journal of Computer Applications, 2016, 36(2): 330-335.

刘青, 付印金, 倪桂强, 梅建民. 基于Hadoop平台的分布式重删存储系统[J]. 计算机应用, 2016, 36(2): 330-335.

References

[1] 付印金,肖侬,刘芳.重复数据删除关键技术研究进展[J].计算机研究与发展, 2012, 49(1): 12-22. (FU Y J, XIAO N, LIU F. Research and development on key techniques of data deduplication[J]. Journal of Computer Research and Development, 2012, 49(1):12-20.)
[2] 程学旗,靳小龙,王元卓,等.大数据系统和分析技术综述[J].软件学报,2014,25(9):1889-1908. (CHENG X Q, JIN X L, WANG Y Z, et al. Survey on big data system and analytic technology[J]. Journal of Software, 2014, 25(9): 1889-1908.)
[3] CHANG R-S, LIAO C-S, FAN K-Z, et al. Dynamic de-duplication decision in a Hadoop distributed file system[J]. International Journal of Distributed Sensor Networks, 2014, 2014(6): 774-777.
[4] SUN Z, SHEN J, YONG J. A novel approach to data deduplication over the engineering-oriented cloud systems[J]. Integrated Computer-Aided Engineering, 2013, 20(1): 45-57.
[5] KOLB L, THOR A, RAHM E. Dedoop: efficient deduplication with Hadoop[J]. Proceedings of the VLDB Endowment, 2012, 5(12): 1878-1881.
[6] TWEET. Data deduplication tactics with HDFS and MapReduce [EB/OL]. (2013-03-25) [2015-05-25]. http://www.hadoopsphere.com/2013/02/data-de-duplication-tactics-with-hdfs.html.
[7] 曹英忠.基于Hadoop的重复数据删除技术的研究与应用[D].桂林:桂林理工大学,2012:61-67. (CAO Y Z. Research and application of data deduplication techniques based on Hadoop [D]. Guilin: Guilin University of Technology, 2012: 61-67.)
[8] KATHPAL A, JOHN M, MAKKAR G. Distributed duplicate detection in post-process data de-duplication[C]//HiPC 2011: Proceedings of the 2011 18th International Conference on High Performance Computing. Washington, DC: IEEE Computer Society, 2011 [2015-03-09]. http://www.hipc.org/hipc2011/studsym-papers/1569512535.pdf.
[9] WHITE T. Hadoop: the definitive guide[M]. 3rd edition. [S.l.]: Yahoo! Press, 2010: 45.

[1]	Li YANG, Jianting CHEN, Yang XIANG. Performance optimization strategy of distributed storage for industrial time series big data based on HBase [J]. Journal of Computer Applications, 2023, 43(3): 759-766.
[2]	Yunbo LONG, Dan TANG. Load balancing method based on local repair code in distributed storage [J]. Journal of Computer Applications, 2023, 43(3): 767-775.
[3]	Na ZHOU, Ming CHENG, Menglin JIA, Yang YANG. Medical image privacy protection based on thumbnail encryption and distributed storage [J]. Journal of Computer Applications, 2023, 43(10): 3149-3155.
[4]	QING Xinyi, CHEN Yuling, ZHOU Zhengqiang, TU Yuanchao, LI Tao. Blockchain storage expansion model based on Chinese remainder theorem [J]. Journal of Computer Applications, 2021, 41(7): 1977-1982.
[5]	CUI Shuangshuang, WANG Hongzhi. Implementation method of lightweight distributed index based on log structured merge-tree [J]. Journal of Computer Applications, 2021, 41(3): 630-635.
[6]	TANG Xin, ZHOU Linna. Response obfuscation based secure deduplication method for cloud data with resistance against appending chunk attack [J]. Journal of Computer Applications, 2020, 40(4): 1085-1090.
[7]	Jiangfeng XU, Yulong TAN. Optimization of multidimensional index query mechanism based on HBase [J]. Journal of Computer Applications, 2020, 40(2): 571-577.
[8]	DONG Cong, ZHANG Xiao, CHENG Wendi, SHI Jia. Performance optimization of distributed file system based on new type storage devices [J]. Journal of Computer Applications, 2020, 40(12): 3594-3603.
[9]	ZHANG Hang, LIU Shanzheng, TANG Dan, CAI Hongliang. Erasure code with low recovery-overhead in distributed storage systems [J]. Journal of Computer Applications, 2020, 40(10): 2942-2950.
[10]	ZHANG Guochao, WANG Ruijin. Blockchain shard storage model based on threshold secret sharing [J]. Journal of Computer Applications, 2019, 39(9): 2617-2622.
[11]	LI Yunshu, TENG Fei, LI Tianrui. Microoperation-based parameter auto-optimization method of Hadoop [J]. Journal of Computer Applications, 2019, 39(6): 1589-1594.
[12]	ZHENG Zhentao, ZHAO Zhuofeng, WANG Guiling, XU Yao. Ship trajectory extraction method for port parking area identification [J]. Journal of Computer Applications, 2019, 39(1): 113-117.
[13]	FENG Jun, LI Dingsheng, LU Jiamin, ZHANG Lixia. Spatio-temporal index method for moving objects in road network based on HBase [J]. Journal of Computer Applications, 2018, 38(6): 1575-1583.
[14]	CUI Chen, ZHENG Linjiang, HAN Fengping, HE Mujun. Design of secondary indexes in HBase based on memory [J]. Journal of Computer Applications, 2018, 38(6): 1584-1590.
[15]	WU Renbiao, LIU Chao, QU Jingyi. Storage method for flight delay platform based on HBase and Hive [J]. Journal of Computer Applications, 2018, 38(5): 1339-1345.

Distributed deduplication storage system based on Hadoop platform

基于Hadoop平台的分布式重删存储系统

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics