Journal of Computer Applications

• Articles • Previous Articles     Next Articles

Parallel Web crawler system with increment update

Wen-jie XU Qing-kui CHEN   

  • Received:2008-10-07 Revised:2008-12-09 Online:2009-04-01 Published:2009-04-01
  • Contact: Wen-jie XU

增量更新并行Web爬虫系统

徐文杰 陈庆奎   

  1. 上海理工大学 上海理工大学计算机工程学院
  • 通讯作者: 徐文杰

Abstract: This paper discussed the architecture of parallel Web crawler system. Incremental crawling method was used to the system to improve the efficiency of massive information updating. Meanwhile, considering the difference of crawler in the system and with the aim of fully usage of crawler in cluster system, Cosine vector parallel crawling model was introduced to solve this problem. After giving the definitions of crawling task vector and crawler vector, relevant parallel crawling algorithms were designed. The results confirm that the system is effective in distribution adaptability and runs well in maintaining the "freshness" of the Web repository.

Key words: Web data mining, parallel crawler, increment update strategy, cosine vector

摘要: 介绍了并行Web爬虫系统的总体结构,引入了增量更新爬虫策略,在提高Web海量数据更新效率的同时,考虑到机群中各个爬虫的能力不一,为了使机群中爬虫的能力得到充分应用,又提出了向量度量技术,解决了抓取任务和爬虫能力匹配的问题。对抓取任务向量、爬虫向量进行了定义,并在此基础上给出了相关的并行算法。实践表明,系统具有良好的分配适应性,并可以在此基础上渐增式地提高网页库新鲜度。

关键词: Web数据抓掘, 并行爬虫, 增量更新策略, 余弦向量法

CLC Number: