Journal of Computer Applications ›› 2017, Vol. 37 ›› Issue (7): 1883-1887.DOI: 10.11772/j.issn.1001-9081.2017.07.1883

Previous Articles     Next Articles

Analysis of factors affecting efficiency of data distributed parallel application in cloud environment

MA Shengjun, CHEN Wanghu, YU Maoyi, LI Jinrong, JIA Wenbo   

  1. College of Computer Science and Engineering, Northwest Normal University, Lanzhou Gansu 730070, China
  • Received:2017-01-16 Revised:2017-03-11 Online:2017-07-10 Published:2017-07-18
  • Supported by:
    This work is supported by the National Natural Science Foundation of China (61462076).


马生俊, 陈旺虎, 俞茂义, 李金溶, 郏文博   

  1. 西北师范大学 计算机科学与工程学院, 兰州 730070
  • 通讯作者: 马生俊
  • 作者简介:马生俊(1989-),男,甘肃广河人,硕士研究生,主要研究方向:大数据与云计算;陈旺虎(1973-),男,甘肃静宁人,教授,博士,CCF会员,主要研究方向:大数据与云计算;俞茂义(1991-),男,安徽铜陵人,硕士研究生,主要研究方向:大数据与云计算;李金溶(1989-),女,山东肥城人,硕士研究生,主要研究方向:大数据与云计算;郏文博(1992-),男,江苏丰县人,硕士研究生,主要研究方向:大数据与云计算。
  • 基金资助:

Abstract: Data distributed parallel applications like MapReduce are widely used. Focusing on the issues such as low execution efficiency and high cost of such applications, a case analysis of Hadoop was given. Firstly, based on the analyses of the execution processes of such applications, it was found that the data volume, the numbers of the nodes and tasks were the main factors that affected their execution efficiency. Secondly, the impacts of the factors mentioned above on the execution efficiency of an application were explored. Finally, based on a set of experiments, two important novel rules were derived as follows. Given a specific volume of data, the execution efficiency of a data distributed parallel application could not be improved remarkably only by increasing the number of nodes, but the execution cost would raise on the contrary. However, when the number of tasks was nearly equal to that of the nodes, a higher efficiency and lower cost could be got for such an application. The conclusions are useful for users to optimize their data distributed parallel applications and to estimate the necessary computing resources to be rented in a cloud environment.

Key words: cloud environment, data distributed parallel application, MapReduce, efficiency, cost

摘要: 云环境下,类似MapReduce的数据分布并行应用被广泛运用。针对此类应用执行效率低、成本高的问题,以Hadoop为例,首先,分析该类应用的执行方式,发现数据量、节点数和任务数是影响其效率的主要因素;其次,探讨以上因素对应用效率的影响;最后,通过实验得出在数据量一定的情况下,增加节点数不会明显提高应用的执行效率,反而极大地增加执行成本;当任务数接近节点数时,应用的执行效率较高、成本较低。该结论为云环境中类似MapReduce的数据分布并行应用的效率优化提供借鉴,并为用户租用云资源提供参考。

关键词: 云环境, 数据分布并行应用, MapReduce, 效率, 成本

CLC Number: