基于局部标签树匹配的改进网页聚类算法

计算机应用 ›› 2010, Vol. 30 ›› Issue (3): 818-820.

基于局部标签树匹配的改进网页聚类算法

李睿¹,曾俊瑀¹,周四望²

1. 湖南大学软件学院
2. 湖南大学

收稿日期:2009-09-24 修回日期:2009-11-12 发布日期:2010-03-14 出版日期:2010-03-01
通讯作者: 李睿
基金资助:
湖南省自然科学基金

Improved Web page clustering algorithm based on partial tag tree matching

Received:2009-09-24 Revised:2009-11-12 Online:2010-03-14 Published:2010-03-01
Supported by:
The Natural Science Foundation of Hunan Province

摘要/Abstract

摘要： Web信息抽取中需要对目标网站的网页进行聚类分析，以检测并生成信息抽取所需的模板。传统的基于DOM树编辑距离的网页聚类算法不适合文档对象模型(DOM)树结构复杂的动态模板网页，提出了一种基于局部标签树匹配的改进网页聚类算法，利用标签树中模板节点和非模板节点的层次差异性，根据节点对布局影响的大小赋予节点不同的匹配权值，使用局部树匹配完成对网页结构相似性的有效计算。实验结果表明，改进的算法较传统的基于DOM树编辑距离的网页聚类算法，在对采用模板生成的动态网页进行聚类分析时具有更高的准确率，且时间复杂度低。

关键词: Web信息抽取, 网页聚类, 树编辑距离, 局部标签树匹配

Abstract: In the process of Web information extraction, Web pages on the target websites should be clustered in order to detect and generate templates that are used to extract required information. Traditional page clustering algorithm based on DOM tree edit distance is not suitable for the complex Document Object Model (DOM) tree structure pages created from dynamic templates. In this paper, an improved Web page clustering algorithm was proposed based on partial tag tree matching. In the proposed algorithm, the appropriate weights were assigned to the nodes according to their effects on the layout of Web pages and the level difference between template nodes and non-template nodes. After that, the structure similarity between Web pages was computed efficiently based on partial tree matching approach. Compared with the traditional algorithms, the experimental results show that the proposed algorithm is of higher accuracy in clustering dynamic Web pages and lower computing complexity.

Key words: Web information extraction, Web page clustering, tree edit distance, partial tag tree matching

李睿曾俊瑀周四望. 基于局部标签树匹配的改进网页聚类算法[J]. 计算机应用, 2010, 30(3): 818-820.

[1]	贾楠付晓东黄袁刘晓燕代志华. 基于树编辑距离的工作流距离度量方法[J]. 计算机应用, 2012, 32(12): 3529-3533.
[2]	黄亮赵泽茂梁兴开. 基于编辑距离的Web数据挖掘[J]. 计算机应用, 2012, 32(06): 1662-1665.