Detection and elimination of similar Web pages based on text structure and string of feature code

doi:10.3724/SP.J.1087.2013.00554

Journal of Computer Applications ›› 2013, Vol. 33 ›› Issue (02): 554-557.DOI: 10.3724/SP.J.1087.2013.00554

• Database technology • Previous Articles Next Articles

Detection and elimination of similar Web pages based on text structure and string of feature code

XIONG Zhongyang,YA Man,ZHANG Yufang

College of Computer Science, Chongqing University, Chongqing 400044, China

Received:2012-08-20 Revised:2012-09-14 Online:2013-02-25 Published:2013-02-01
Contact: YA Man

基于网页正文结构和特征串的相似网页去重算法

熊忠阳,牙漫,张玉芳

重庆大学计算机学院，重庆 400044

通讯作者: 牙漫
作者简介:熊忠阳(1962-),男,重庆人,教授,博士,主要研究方向:数据挖掘、并行处理;
牙漫(1986-),女,河北保定人，硕士研究生,主要研究方向:数据挖掘、搜索引擎;
张玉芳(1965-),女,上海人，教授，主要研究方向:数据挖掘。

Abstract

Abstract: In order to reduce the interference of the duplicated Web pages, and improve the efficiency of detection and elimination of similar Web pages, a new kind of large-scale Web page detection algorithm was proposed. Firstly, adopting the Web label values, the algorithm created the text structure trees to realize the fingerprint similarity calculation layer by layer. Secondly, the head and tail words of a certain sentence, in which high frequency punctuations occur, were extracted out as the feature code. Lastly, the fingerprint similarity of Web page features was discriminated with Bloom filter algorithm. The experimental results show that the algorithm can improve the recall rate up to more than 90%, and reduce the time complexity to O(n).

Key words: detection and elimination of similar Web pages, Web label value, high frequency punctuation, feature code, fingerprint similarity of Web page

摘要： 为了减少重复网页对用户的干扰，提高去重效率，提出一种新的大规模网页去重算法。首先利用预定义网页标签值建立网页正文结构树，实现了层次计算指纹相似度;其次，提取网页中高频标点字符所在句子中的首尾汉字作为特征码;最后，利用Bloom Filter算法对获取的特征指纹进行网页相似度判别。实验表明，该算法将召回率提高到了90%以上，时间复杂度降低到了O(n)。

关键词: 网页去重, 网页标签值, 高频标点, 特征码, 网页指纹相似度

CLC Number:

XIONG Zhongyang YA Man ZHANG Yufang. Detection and elimination of similar Web pages based on text structure and string of feature code[J]. Journal of Computer Applications, 2013, 33(02): 554-557.

熊忠阳牙漫张玉芳. 基于网页正文结构和特征串的相似网页去重算法[J]. 计算机应用, 2013, 33(02): 554-557.

Detection and elimination of similar Web pages based on text structure and string of feature code

基于网页正文结构和特征串的相似网页去重算法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 1

Recommended Articles

Metrics