基于维基百科和网页分块的主题爬行策略

计算机应用 ›› 2011, Vol. 31 ›› Issue (12): 3264-3267.

基于维基百科和网页分块的主题爬行策略

熊忠阳,史艳,张玉芳

重庆大学计算机学院, 重庆 400044

收稿日期:2011-06-20 修回日期:2011-08-11 发布日期:2011-12-12 出版日期:2011-12-01
通讯作者: 史艳
基金资助:
中央高校研究生科技创新基金个人项目资助

Wikipedia-based focused crawling with page segmentation

XIONG Zhong-yang,SHI Yan,ZHANG Yu-fang

College of Computer Science, Chongqing University, Chongqing 400044, China

Received:2011-06-20 Revised:2011-08-11 Online:2011-12-12 Published:2011-12-01
Contact: SHI Yan

摘要/Abstract

摘要： 针对传统主题爬行策略的不足和局限性，提出一种基于维基百科(Wikipedia)和网页分块的主题爬行策略，通过Wikipedia的主题分类树和主题描述文档获取主题向量，以此来描述主题；并在下载网页后引入网页分块，过滤噪声链接；在计算候选链接优先级时，引入块相关性，以弥补锚文本信息量有限的缺点；通过改变主题向量空间的大小来验证主题描述的详略对爬行性能的影响。实验结果表明,该策略有效，并且在一定限度内，对主题描述越详细，搜集的网页的相关度就越高。

关键词: 主题爬行, 维基百科, 主题描述, 网页分块, 相关度计算

Abstract: Against shortcomings and limitations of traditional focused crawling methods, a wikipedia-based focused crawling with page segmentation was proposed. It set up topic vector by category tree and topic descriptive document of wikipedia, which described topic; introduced page segmentation after downloading a web page, to filter noise nodes; took block relevance into consideration when computing the priority of candidate links，making up for limited information of anchor text; and validated whether different detailed degree of topic description would effect the performance of focused crawling or not, via changing the size of topic vector space. Experimental results show that this method is effective and scalable, and within a limited degree, the more detailed the topic description, the more related to the topic the collected web pages are.

Key words: focused crawling, wikipedia, topic description, page segmentation, relevance computation

熊忠阳史艳张玉芳. 基于维基百科和网页分块的主题爬行策略[J]. 计算机应用, 2011, 31(12): 3264-3267.

XIONG Zhong-yang SHI Yan ZHANG Yu-fang. Wikipedia-based focused crawling with page segmentation[J]. Journal of Computer Applications, 2011, 31(12): 3264-3267.

[1]	朱苏阳, 惠浩添, 钱龙华, 张民. 基于自监督学习的维基百科家庭关系抽取[J]. 计算机应用, 2015, 35(4): 1013-1016.
[2]	王静何婷婷衣马木艾山·阿布都力克木. 协同过滤在中文维基百科类别推荐上的应用[J]. 计算机应用, 2013, 33(03): 838-840.
[3]	刘晓亮. 基于维基百科的军事舆情论坛话题追踪方法[J]. 计算机应用, 2012, 32(11): 3026-3029.
[4]	于满泉，陈铁睿，许洪波. 基于分块的网页信息解析器的研究与设计[J]. 计算机应用, 2005, 25(04): 974-976.