计算机应用 ›› 2005, Vol. 25 ›› Issue (09): 1965-1969.DOI: 10.3724/SP.J.1087.2005.01965

• Web与数据库 •    下一篇

聚焦爬虫技术研究综述

周立柱,林玲   

  1. 清华大学计算机科学与技术系
  • 出版日期:2005-09-01 发布日期:2011-04-11
  • 基金资助:

    国家自然科学基金资助项目(60173008)

Survey on the research of focused crawling technique

ZHOU Li-zhu,LIN Ling   

  1. Department of Computer Science and Technology,Tsinghua University,Beijing 10084,China
  • Online:2005-09-01 Published:2011-04-11

摘要: 因特网的迅速发展对万维网信息的查找与发现提出了巨大的挑战。对于大多用户提出的与主题或领域相关的查询需求,传统的通用搜索引擎往往不能提供令人满意的结果网页。为了克服通用搜索引擎的以上不足,提出了面向主题的聚焦爬虫的研究。至今,聚焦爬虫已成为有关万维网的研究热点之一。文中对这一热点研究进行综述,给出聚焦爬虫(Focused Crawler)的基本概念,概述其工作原理;并根据研究的发展现状,对聚焦爬虫的关键技术(抓取目标描述,网页分析算法和网页搜索策略等)作系统介绍和深入分析。在此基础上,提出聚焦爬虫今后的一些研究方向,包括面向数据分析和挖掘的爬虫技术研究,主题的描述与定义,相关资源的发现,W eb数据清洗,以及搜索空间的扩展等。

关键词: 聚焦爬虫, 信息检索, 链接分析, 文本检索, 数据抽取, 协作抓取, 本体描述, 元搜索

Abstract: The survey of focused crawling starts with the motivation for this new research and an introduction on basic concepts of focused crawling.The key issues in focused crawling are reviewed,such as webpage analyzing algorithms and the searching strategy on the Web.How to crawl relevant data and information according to different requirements is discussed in detail and three representative architectures of focused crawler systems are analyzed.Some future works for focused crawling research are indicated,including crawling for data analysis and data mining,topic description,finding relevant Web pages,Web data cleaning,and the extension of search space.

Key words: focused crawler, information retrieval, link analysis, text retrieval, data extraction, collaborative crawling, ontology, metasearch

中图分类号: