Journal of Computer Applications ›› 2013, Vol. 33 ›› Issue (10): 2792-2795.
• Advanced computing • Previous Articles Next Articles
ZHOU Shilong,CHEN Xingshu,LUO Yonggang
Received:
Revised:
Online:
Published:
Contact:
周世龙,陈兴蜀,罗永刚
通讯作者:
作者简介:
基金资助:
Abstract: Nutch crawling performance was optimized by tunning Nutch MapReduce job configurations. In order to optimize Nutch performance, firstly Nutch crawling processes were studied from the view of Hadoop. And based on that, the characters of Nutch jobs workflows were analyzed in detail. Then tunned job configurations were generated by profiling Nutch crawling process. The tunned configurations were set before the next job running of the same type. The appropriate profiling interval was selected to consider the balance between cluster environmental error and profiling load, which improved optimization result. The experimental results show that it is indeed more efficient than the original programs by 5% to 14%. The interval value of 5 makes the best optimization result.
Key words: Nutch, Hadoop, MapReduce, workflow, performance optimization
摘要: 通过对Nutch MapReduce job配置参数调优而优化Nutch爬行性能。以Hadoop视角梳理Nutch爬行过程,并基于此详细分析Nutch MapReduce job的工作流特性;对Nutch爬行时MapReduce job进行持续监测,生成优化参数并代入下一轮相同类型的job运行中,从而达到优化目的;通过选取合适的间隔监测值平衡集群环境误差和监测负载以改进优化效果。经过实验测试,Nutch的爬行性能提高了5%~14%,且当监测间隔值为5时有最好优化效果
关键词: Nutch, Hadoop, MapReduce, 工作流, 性能优化
CLC Number:
TP393.4
ZHOU Shilong CHEN Xingshu LUO Yonggang. Nutch crawling optimization from view of Hadoop[J]. Journal of Computer Applications, 2013, 33(10): 2792-2795.
周世龙 陈兴蜀 罗永刚. Hadoop视角下的Nutch爬行性能优化[J]. 计算机应用, 2013, 33(10): 2792-2795.
0 / Recommend
Add to citation manager EndNote|Ris|BibTeX
URL: https://www.joca.cn/EN/
https://www.joca.cn/EN/Y2013/V33/I10/2792