计算机应用 ›› 2011, Vol. 31 ›› Issue (12): 3264-3267.
• 数据库技术 • 上一篇 下一篇
熊忠阳,史艳,张玉芳
收稿日期:
修回日期:
发布日期:
出版日期:
通讯作者:
基金资助:
XIONG Zhong-yang,SHI Yan,ZHANG Yu-fang
Received:
Revised:
Online:
Published:
Contact:
摘要: 针对传统主题爬行策略的不足和局限性,提出一种基于维基百科(Wikipedia)和网页分块的主题爬行策略,通过Wikipedia的主题分类树和主题描述文档获取主题向量,以此来描述主题;并在下载网页后引入网页分块,过滤噪声链接;在计算候选链接优先级时,引入块相关性,以弥补锚文本信息量有限的缺点;通过改变主题向量空间的大小来验证主题描述的详略对爬行性能的影响。实验结果表明,该策略有效,并且在一定限度内,对主题描述越详细,搜集的网页的相关度就越高。
关键词: 主题爬行, 维基百科, 主题描述, 网页分块, 相关度计算
Abstract: Against shortcomings and limitations of traditional focused crawling methods, a wikipedia-based focused crawling with page segmentation was proposed. It set up topic vector by category tree and topic descriptive document of wikipedia, which described topic; introduced page segmentation after downloading a web page, to filter noise nodes; took block relevance into consideration when computing the priority of candidate links,making up for limited information of anchor text; and validated whether different detailed degree of topic description would effect the performance of focused crawling or not, via changing the size of topic vector space. Experimental results show that this method is effective and scalable, and within a limited degree, the more detailed the topic description, the more related to the topic the collected web pages are.
Key words: focused crawling, wikipedia, topic description, page segmentation, relevance computation
熊忠阳 史艳 张玉芳. 基于维基百科和网页分块的主题爬行策略[J]. 计算机应用, 2011, 31(12): 3264-3267.
XIONG Zhong-yang SHI Yan ZHANG Yu-fang. Wikipedia-based focused crawling with page segmentation[J]. Journal of Computer Applications, 2011, 31(12): 3264-3267.
0 / 推荐
导出引用管理器 EndNote|Ris|BibTeX
链接本文: https://www.joca.cn/CN/
https://www.joca.cn/CN/Y2011/V31/I12/3264