Journal of Computer Applications ›› 2014, Vol. 34 ›› Issue (11): 3091-3095.DOI: 10.11772/j.issn.1001-9081.2014.11.3091

Previous Articles     Next Articles

Optimization of small files storage and accessing on Hadoop distributed file system

LI Tie,YAN Cairong,HUANG Yongfeng,Song Yalong   

  1. School of Computer Science and Technology, Donghua University, Shanghai 201620, China
  • Received:2014-07-18 Revised:2014-07-30 Online:2014-11-01 Published:2014-12-01
  • Contact: YAN Cairong

面向Hadoop分布式文件系统的小文件存取优化方法

李铁,燕彩蓉,黄永锋,宋亚龙   

  1. 东华大学 计算机科学与技术学院,上海 201620
  • 通讯作者: 燕彩蓉
  • 作者简介:李铁(1989-),男,湖南永州人,硕士研究生,主要研究方向:分布式存储、分布式计算;燕彩蓉(1978-),女,湖北仙桃人,副教授,博士,主要研究方向:并行计算、分布式计算、大数据处理;黄永锋(1971-),男,山东泰安人,副教授,博士,主要研究方向:数据挖掘、机器学习、图像处理;宋亚龙(1988-),男,河南鹤壁人,硕士研究生,主要研究方向:分布式存储、分布式计算。
  • 基金资助:

    国家自然科学基金资助项目;中央高校基本科研业务费专项资金资助项目;上海市自然科学基金资助项目

Abstract:

In order to improve the efficiency of processing small files in Hadoop Distributed File System (HDFS), a new efficient approach named SmartFS was proposed. By analyzing the file accessing log to obtain the accessing behavior of users, SmartFS established a probability model of file associations. This model was the reference of merging algorithm to merge the relevant small files into large files which would be stored on HDFS. When a file was accessed, SmartFS prefetched the related files according to the prefetching algorithm to accelerate the access speed. To guarantee the enough cache space, a cache replacement algorithm was put forward. The experimental results show that SmartFS can save the metadata space of NameNode in HDFS, reduce the interaction between users and HDFS, and improve the storing and accessing efficiency of small files on HDFS.

摘要:

为提高Hadoop分布式文件系统(HDFS)的小文件处理效率,提出了一种面向HDFS的智能小文件存取优化方法——SmartFS。SmartFS通过分析小文件访问日志,获取用户访问行为,建立文件关联概率模型,并根据基于文件关联关系的合并算法将小文件组装成大文件之后存至HDFS;当从HDFS获取文件时,根据基于文件关联关系的预取算法来提高文件访问效率,并提出基于预取的缓存替换算法来管理缓存空间,从而提高文件的命中率。实验结果表明,SmartFS有效减少了HDFS中NameNode的元数据空间,减少了用户与HDFS的交互次数,提高了小文件的存储效率和访问速度。

CLC Number: