计算机应用 ›› 2011, Vol. 31 ›› Issue (12): 3264-3267.

• 数据库技术 • 上一篇    下一篇

基于维基百科和网页分块的主题爬行策略

熊忠阳,史艳,张玉芳   

  1. 重庆大学 计算机学院, 重庆 400044
  • 收稿日期:2011-06-20 修回日期:2011-08-11 发布日期:2011-12-12 出版日期:2011-12-01
  • 通讯作者: 史艳
  • 基金资助:
    中央高校研究生科技创新基金个人项目资助

Wikipedia-based focused crawling with page segmentation

XIONG Zhong-yang,SHI Yan,ZHANG Yu-fang   

  1. College of Computer Science, Chongqing University, Chongqing 400044, China
  • Received:2011-06-20 Revised:2011-08-11 Online:2011-12-12 Published:2011-12-01
  • Contact: SHI Yan

摘要: 针对传统主题爬行策略的不足和局限性,提出一种基于维基百科(Wikipedia)和网页分块的主题爬行策略,通过Wikipedia的主题分类树和主题描述文档获取主题向量,以此来描述主题;并在下载网页后引入网页分块,过滤噪声链接;在计算候选链接优先级时,引入块相关性,以弥补锚文本信息量有限的缺点;通过改变主题向量空间的大小来验证主题描述的详略对爬行性能的影响。实验结果表明,该策略有效,并且在一定限度内,对主题描述越详细,搜集的网页的相关度就越高。

关键词: 主题爬行, 维基百科, 主题描述, 网页分块, 相关度计算

Abstract: Against shortcomings and limitations of traditional focused crawling methods, a wikipedia-based focused crawling with page segmentation was proposed. It set up topic vector by category tree and topic descriptive document of wikipedia, which described topic; introduced page segmentation after downloading a web page, to filter noise nodes; took block relevance into consideration when computing the priority of candidate links,making up for limited information of anchor text; and validated whether different detailed degree of topic description would effect the performance of focused crawling or not, via changing the size of topic vector space. Experimental results show that this method is effective and scalable, and within a limited degree, the more detailed the topic description, the more related to the topic the collected web pages are.

Key words: focused crawling, wikipedia, topic description, page segmentation, relevance computation