计算机应用 ›› 2016, Vol. 36 ›› Issue (9): 2465-2471.DOI: 10.11772/j.issn.1001-9081.2016.09.2465

• 大数据 • 上一篇    下一篇

大数据存储架构和算法研究综述

杨俊杰1,2, 廖卓凡1,2, 冯超超3   

  1. 1. 长沙理工大学 计算机与通信工程学院, 长沙 410114;
    2. 综合交通运输大数据智能处理湖南省重点实验室(长沙理工大学), 长沙 410114;
    3. 国防科技大学 计算机学院, 长沙 410073
  • 收稿日期:2016-03-01 修回日期:2016-05-01 出版日期:2016-09-10 发布日期:2016-09-08
  • 通讯作者: 廖卓凡
  • 作者简介:杨俊杰(1992-),男,河南洛阳人,硕士研究生,主要研究方向:大数据优化部署;廖卓凡(1981-),女,湖南湘潭人,讲师,博士,CCF会员,主要研究方向:移动互联网、大数据优化部署、无线传感器网络;冯超超(1982-),男,云南昆明人,助理研究员,博士,CCF会员,主要研究方向:高性能微处理器设计。
  • 基金资助:
    国家自然科学基金资助项目(61402056,61303066);湖南省教育厅科研项目(14C0030)。

Survey on big data storage framework and algorithm

YANG Junjie1,2, LIAO Zhuofan1,2, FENG Chaochao3   

  1. 1. School of Computer and Communication Engineering, Changsha University of Science and Technology, Changsha Hunan 410114, China;
    2. Hunan Provincial Key Laboratory of Intelligent Processing of Big Data on Transportation (Changsha University of Science and Technology), Changsha Hunan 410114, China;
    3. School of Computer, National University of Defense Technology, Changsha Hunan 410073, China
  • Received:2016-03-01 Revised:2016-05-01 Online:2016-09-10 Published:2016-09-08
  • Supported by:
    This work is partially supported by National Natural Science Foundation of China (61402056, 61303066), the Scientific Research Fund of Hunan Provincial Education Department (14C0030).

摘要: 随着大数据计算需求的增长,集群的处理速度需要得到快速的提升,然而目前大数据处理框架的处理性能已逐渐满足不了这种快速增长的需求。由于集群的存储架构是分布式存储,因此数据的存放在大数据处理过程中成为影响集群的处理性能的因素之一。首先,对当今的分布式文件存储系统的结构进行了介绍;接着,根据不同的优化目标,例如减少网络负载、负载均衡、降低能耗和高容错性等,对近年国内外大数据存储算法的研究进行了总结,分析和对比了已有算法的优点以及存在的问题;最后,对大数据存储架构和优化算法设计的挑战和未来研究方向作了展望。

关键词: 大数据, 数据部署, 分布式文件系统, MapReduce, Hadoop

Abstract: With the growing demand of big data computing, the processing speed of the cluster needs to be improved rapidly. However, the processing performance of the existing big data framework can not satisfy the requirement of the computing development gradually. As the framework of the storage is distributed, the placement of data to be processed has become one of the key factors affecting the performance of the cluster. Firstly, the current distributed file system structure was introduced. Then the popular data placement algorithms were summarized and classified according to different optimization goals, such as network load balance, energy saving and fault tolerance. Finally, future challenges and research directions in the area of storage framework and algorithms were presented.

Key words: big data, data placement, distributed file system, MapReduce, Hadoop

中图分类号: