Journal of Computer Applications ›› 2011, Vol. 31 ›› Issue (12): 3264-3267.
• Database technology • Previous Articles Next Articles
XIONG Zhong-yang,SHI Yan,ZHANG Yu-fang
Received:
Revised:
Online:
Published:
Contact:
熊忠阳,史艳,张玉芳
通讯作者:
基金资助:
Abstract: Against shortcomings and limitations of traditional focused crawling methods, a wikipedia-based focused crawling with page segmentation was proposed. It set up topic vector by category tree and topic descriptive document of wikipedia, which described topic; introduced page segmentation after downloading a web page, to filter noise nodes; took block relevance into consideration when computing the priority of candidate links,making up for limited information of anchor text; and validated whether different detailed degree of topic description would effect the performance of focused crawling or not, via changing the size of topic vector space. Experimental results show that this method is effective and scalable, and within a limited degree, the more detailed the topic description, the more related to the topic the collected web pages are.
Key words: focused crawling, wikipedia, topic description, page segmentation, relevance computation
摘要: 针对传统主题爬行策略的不足和局限性,提出一种基于维基百科(Wikipedia)和网页分块的主题爬行策略,通过Wikipedia的主题分类树和主题描述文档获取主题向量,以此来描述主题;并在下载网页后引入网页分块,过滤噪声链接;在计算候选链接优先级时,引入块相关性,以弥补锚文本信息量有限的缺点;通过改变主题向量空间的大小来验证主题描述的详略对爬行性能的影响。实验结果表明,该策略有效,并且在一定限度内,对主题描述越详细,搜集的网页的相关度就越高。
关键词: 主题爬行, 维基百科, 主题描述, 网页分块, 相关度计算
XIONG Zhong-yang SHI Yan ZHANG Yu-fang. Wikipedia-based focused crawling with page segmentation[J]. Journal of Computer Applications, 2011, 31(12): 3264-3267.
熊忠阳 史艳 张玉芳. 基于维基百科和网页分块的主题爬行策略[J]. 计算机应用, 2011, 31(12): 3264-3267.
0 / Recommend
Add to citation manager EndNote|Ris|BibTeX
URL: https://www.joca.cn/EN/
https://www.joca.cn/EN/Y2011/V31/I12/3264