计算机应用 ›› 2013, Vol. 33 ›› Issue (10): 2792-2795.

• 先进计算 • 上一篇    下一篇

Hadoop视角下的Nutch爬行性能优化

周世龙,陈兴蜀,罗永刚   

  1. 四川大学 计算机学院, 成都 610065
  • 收稿日期:2013-03-25 修回日期:2013-05-03 出版日期:2013-10-01 发布日期:2013-11-01
  • 通讯作者: 陈兴蜀
  • 作者简介:周世龙(1988-),男,山西平定人,硕士研究生,CCF会员,主要研究方向:云计算安全、云计算中大规模数据处理;陈兴蜀(1968-),女,四川成都人,教授,博士生导师,博士,主要研究方向:信息安全、计算机网络、云计算;罗永刚(1980-),男,贵州黔南人,博士研究生,主要研究方向:信息安全、云计算安全。
  • 基金资助:
    科技部支撑项目

Nutch crawling optimization from view of Hadoop

ZHOU Shilong,CHEN Xingshu,LUO Yonggang   

  1. College of Computer Science, Sichuan University, Chengdu Sichuan 610065, China
  • Received:2013-03-25 Revised:2013-05-03 Online:2013-11-01 Published:2013-10-01
  • Contact: CHEN Xingshu

摘要: 通过对Nutch MapReduce job配置参数调优而优化Nutch爬行性能。以Hadoop视角梳理Nutch爬行过程,并基于此详细分析Nutch MapReduce job的工作流特性;对Nutch爬行时MapReduce job进行持续监测,生成优化参数并代入下一轮相同类型的job运行中,从而达到优化目的;通过选取合适的间隔监测值平衡集群环境误差和监测负载以改进优化效果。经过实验测试,Nutch的爬行性能提高了5%~14%,且当监测间隔值为5时有最好优化效果

关键词: Nutch, Hadoop, MapReduce, 工作流, 性能优化

Abstract: Nutch crawling performance was optimized by tunning Nutch MapReduce job configurations. In order to optimize Nutch performance, firstly Nutch crawling processes were studied from the view of Hadoop. And based on that, the characters of Nutch jobs workflows were analyzed in detail. Then tunned job configurations were generated by profiling Nutch crawling process. The tunned configurations were set before the next job running of the same type. The appropriate profiling interval was selected to consider the balance between cluster environmental error and profiling load, which improved optimization result. The experimental results show that it is indeed more efficient than the original programs by 5% to 14%. The interval value of 5 makes the best optimization result.

Key words: Nutch, Hadoop, MapReduce, workflow, performance optimization

中图分类号: