Wikipedia-based focused crawling with page segmentation

Journal of Computer Applications ›› 2011, Vol. 31 ›› Issue (12): 3264-3267.

• Database technology • Previous Articles Next Articles

Wikipedia-based focused crawling with page segmentation

XIONG Zhong-yang,SHI Yan,ZHANG Yu-fang

College of Computer Science, Chongqing University, Chongqing 400044, China

Received:2011-06-20 Revised:2011-08-11 Online:2011-12-12 Published:2011-12-01
Contact: SHI Yan

基于维基百科和网页分块的主题爬行策略

熊忠阳,史艳,张玉芳

重庆大学计算机学院, 重庆 400044

通讯作者: 史艳
基金资助:
中央高校研究生科技创新基金个人项目资助

Abstract

Abstract: Against shortcomings and limitations of traditional focused crawling methods, a wikipedia-based focused crawling with page segmentation was proposed. It set up topic vector by category tree and topic descriptive document of wikipedia, which described topic; introduced page segmentation after downloading a web page, to filter noise nodes; took block relevance into consideration when computing the priority of candidate links，making up for limited information of anchor text; and validated whether different detailed degree of topic description would effect the performance of focused crawling or not, via changing the size of topic vector space. Experimental results show that this method is effective and scalable, and within a limited degree, the more detailed the topic description, the more related to the topic the collected web pages are.

Key words: focused crawling, wikipedia, topic description, page segmentation, relevance computation

摘要： 针对传统主题爬行策略的不足和局限性，提出一种基于维基百科(Wikipedia)和网页分块的主题爬行策略，通过Wikipedia的主题分类树和主题描述文档获取主题向量，以此来描述主题；并在下载网页后引入网页分块，过滤噪声链接；在计算候选链接优先级时，引入块相关性，以弥补锚文本信息量有限的缺点；通过改变主题向量空间的大小来验证主题描述的详略对爬行性能的影响。实验结果表明,该策略有效，并且在一定限度内，对主题描述越详细，搜集的网页的相关度就越高。

关键词: 主题爬行, 维基百科, 主题描述, 网页分块, 相关度计算

XIONG Zhong-yang SHI Yan ZHANG Yu-fang. Wikipedia-based focused crawling with page segmentation[J]. Journal of Computer Applications, 2011, 31(12): 3264-3267.

熊忠阳史艳张玉芳. 基于维基百科和网页分块的主题爬行策略[J]. 计算机应用, 2011, 31(12): 3264-3267.

[1]	ZHU Suyang, HUI Haotian, QIAN Longhua, ZHANG Min. Family relation extraction from Wikipedia by self-supervised learning [J]. Journal of Computer Applications, 2015, 35(4): 1013-1016.
[2]	WANG Jing HE Tingting Yimamu'aishan ABUDOULIKEMU. Application of cooperative filtering in categories recommendation of Chinese Wikipedia [J]. Journal of Computer Applications, 2013, 33(03): 838-840.
[3]	LIU Xiao-liang. BBS topic tracking method for military public opinion based on Wikipedia [J]. Journal of Computer Applications, 2012, 32(11): 3026-3029.
[4]	. Novel technology of customer review extraction [J]. Journal of Computer Applications, 2006, 26(10): 2509-2512.
[5]	YU Man-quan,CHEN Tie-rei,XU Hong-bo. Research and design of HTML parser based on page segmentation推 [J]. Journal of Computer Applications, 2005, 25(04): 974-976.

Wikipedia-based focused crawling with page segmentation

基于维基百科和网页分块的主题爬行策略

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 5

Recommended Articles

Metrics