基于网页正文结构和特征串的相似网页去重算法

doi:10.3724/SP.J.1087.2013.00554

计算机应用 ›› 2013, Vol. 33 ›› Issue (02): 554-557.DOI: 10.3724/SP.J.1087.2013.00554

基于网页正文结构和特征串的相似网页去重算法

熊忠阳,牙漫,张玉芳

重庆大学计算机学院，重庆 400044

收稿日期:2012-08-20 修回日期:2012-09-14 发布日期:2013-02-25 出版日期:2013-02-01
通讯作者: 牙漫
作者简介:熊忠阳(1962-),男,重庆人,教授,博士,主要研究方向:数据挖掘、并行处理;
牙漫(1986-),女,河北保定人，硕士研究生,主要研究方向:数据挖掘、搜索引擎;
张玉芳(1965-),女,上海人，教授，主要研究方向:数据挖掘。

Detection and elimination of similar Web pages based on text structure and string of feature code

XIONG Zhongyang,YA Man,ZHANG Yufang

College of Computer Science, Chongqing University, Chongqing 400044, China

Received:2012-08-20 Revised:2012-09-14 Online:2013-02-25 Published:2013-02-01
Contact: YA Man

摘要/Abstract

摘要： 为了减少重复网页对用户的干扰，提高去重效率，提出一种新的大规模网页去重算法。首先利用预定义网页标签值建立网页正文结构树，实现了层次计算指纹相似度;其次，提取网页中高频标点字符所在句子中的首尾汉字作为特征码;最后，利用Bloom Filter算法对获取的特征指纹进行网页相似度判别。实验表明，该算法将召回率提高到了90%以上，时间复杂度降低到了O(n)。

关键词: 网页去重, 网页标签值, 高频标点, 特征码, 网页指纹相似度

Abstract: In order to reduce the interference of the duplicated Web pages, and improve the efficiency of detection and elimination of similar Web pages, a new kind of large-scale Web page detection algorithm was proposed. Firstly, adopting the Web label values, the algorithm created the text structure trees to realize the fingerprint similarity calculation layer by layer. Secondly, the head and tail words of a certain sentence, in which high frequency punctuations occur, were extracted out as the feature code. Lastly, the fingerprint similarity of Web page features was discriminated with Bloom filter algorithm. The experimental results show that the algorithm can improve the recall rate up to more than 90%, and reduce the time complexity to O(n).

Key words: detection and elimination of similar Web pages, Web label value, high frequency punctuation, feature code, fingerprint similarity of Web page

中图分类号:

熊忠阳牙漫张玉芳. 基于网页正文结构和特征串的相似网页去重算法[J]. 计算机应用, 2013, 33(02): 554-557.

XIONG Zhongyang YA Man ZHANG Yufang. Detection and elimination of similar Web pages based on text structure and string of feature code[J]. Journal of Computer Applications, 2013, 33(02): 554-557.

[1]	王景中杜飞. 矩阵型布鲁姆过滤器在病毒过滤防火墙中的研究[J]. 计算机应用, 2009, 29(11): 2939-2941.
[2]	陈锦言孙济洲张亚平. 基于傅立叶变换的网页去重算法[J]. 计算机应用, 2008, 28(4): 948-950.
[3]	魏丽霞郑家恒. 基于网页文本结构的网页去重[J]. 计算机应用, 2007, 27(11): 2854-2856.
[4]	李志东，云晓春，杨武，辛毅. 基于公共特征集合的网络蠕虫特征码自动提取[J]. 计算机应用, 2005, 25(07): 1540-1542.

基于网页正文结构和特征串的相似网页去重算法

Detection and elimination of similar Web pages based on text structure and string of feature code

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 4

编辑推荐

Metrics