计算机应用

• 软件开发与典型应用 • 上一篇    下一篇

一种基于锚文本和改进C4.5决策树算法的主题爬行方法

刘金红 陆余良   

  1. 解放军电子工程学院网络系 解放军电子工程学院网络系
  • 收稿日期:2006-06-05 修回日期:2006-07-26 发布日期:2006-12-01 出版日期:2006-12-01
  • 通讯作者: 刘金红

Focused crawling method based on improved C4.5 exploiting anchor text

<a href="http://www.joca.cn/EN/article/advancedSearchResult.do?searchSQL=((([Author]) AND 1[Journal]) AND year[Order])" target="_blank"></a> <a href="http://www.joca.cn/EN/article/advancedSearchResult.do?searchSQL=((([Author]) AND 1[Journal]) AND year[Order])" target="_blank"></a> <a href="http://www.joca.cn/EN/article/advancedSearchResult.do?searchSQL=((([Author]) AND 1[Journal]) AND year[Order])" target="_blank"></a>   

  • Received:2006-06-05 Revised:2006-07-26 Online:2006-12-01 Published:2006-12-01

摘要: 提出了一种基于锚文本和改进C4.5决策树算法的主题爬行方法:基于锚文本词项集训练决策树,然后基于决策树模型来计算网页的主题相关性和待爬行URL的优先级顺序。最后,应用该方法在四所大学网站网页数据集上针对“学术报告”主题进行了主题爬行实验,并与两种标准的网络爬虫进行了性能对比,实验结果验证了该方法的有效性。

关键词: 主题网络爬虫, 锚文本, 决策树

Abstract: A new focused crawling method based on anchor text and improved C4.5 decision tree algorithm was proposed. It exploited the anchor text of URL to train the decision tree, and then applied the decision tree model to decide whether a downloaded page was on topic and how to choose the next URL to visit. Finally, a prototype system named DTFC based on this method was implemented, and experiments in four university websites were carried out in allusion to "academic report". The experimental results show that DTFC outperforms two standard crawlers for focused crawling.

Key words: focused crawler, anchor text, decision tree