Journal of Computer Applications
• Articles • Previous Articles Next Articles
Wen-jie XU Qing-kui CHEN
Received:
Revised:
Online:
Published:
Contact:
徐文杰 陈庆奎
通讯作者:
Abstract: This paper discussed the architecture of parallel Web crawler system. Incremental crawling method was used to the system to improve the efficiency of massive information updating. Meanwhile, considering the difference of crawler in the system and with the aim of fully usage of crawler in cluster system, Cosine vector parallel crawling model was introduced to solve this problem. After giving the definitions of crawling task vector and crawler vector, relevant parallel crawling algorithms were designed. The results confirm that the system is effective in distribution adaptability and runs well in maintaining the "freshness" of the Web repository.
Key words: Web data mining, parallel crawler, increment update strategy, cosine vector
摘要: 介绍了并行Web爬虫系统的总体结构,引入了增量更新爬虫策略,在提高Web海量数据更新效率的同时,考虑到机群中各个爬虫的能力不一,为了使机群中爬虫的能力得到充分应用,又提出了向量度量技术,解决了抓取任务和爬虫能力匹配的问题。对抓取任务向量、爬虫向量进行了定义,并在此基础上给出了相关的并行算法。实践表明,系统具有良好的分配适应性,并可以在此基础上渐增式地提高网页库新鲜度。
关键词: Web数据抓掘, 并行爬虫, 增量更新策略, 余弦向量法
CLC Number:
TP391
Wen-jie XU Qing-kui CHEN. Parallel Web crawler system with increment update[J]. Journal of Computer Applications.
徐文杰 陈庆奎. 增量更新并行Web爬虫系统[J]. 计算机应用.
0 / Recommend
Add to citation manager EndNote|Ris|BibTeX
URL: http://www.joca.cn/EN/
http://www.joca.cn/EN/Y2009/V29/I4/1117