计算机应用 ›› 2013, Vol. 33 ›› Issue (02): 554-557.DOI: 10.3724/SP.J.1087.2013.00554

• 数据库技术 • 上一篇    下一篇

基于网页正文结构和特征串的相似网页去重算法

熊忠阳,牙漫,张玉芳   

  1. 重庆大学 计算机学院,重庆 400044
  • 收稿日期:2012-08-20 修回日期:2012-09-14 出版日期:2013-02-01 发布日期:2013-02-25
  • 通讯作者: 牙漫
  • 作者简介:熊忠阳(1962-),男,重庆人,教授,博士,主要研究方向:数据挖掘、并行处理;
    牙漫(1986-),女,河北保定人,硕士研究生,主要研究方向:数据挖掘、搜索引擎;
    张玉芳(1965-),女,上海人,教授,主要研究方向:数据挖掘。

Detection and elimination of similar Web pages based on text structure and string of feature code

XIONG Zhongyang,YA Man,ZHANG Yufang   

  1. College of Computer Science, Chongqing University, Chongqing 400044, China
  • Received:2012-08-20 Revised:2012-09-14 Online:2013-02-01 Published:2013-02-25
  • Contact: YA Man

摘要: 为了减少重复网页对用户的干扰,提高去重效率,提出一种新的大规模网页去重算法。首先利用预定义网页标签值建立网页正文结构树,实现了层次计算指纹相似度;其次,提取网页中高频标点字符所在句子中的首尾汉字作为特征码;最后,利用Bloom Filter算法对获取的特征指纹进行网页相似度判别。实验表明,该算法将召回率提高到了90%以上,时间复杂度降低到了O(n)。

关键词: 网页去重, 网页标签值, 高频标点, 特征码, 网页指纹相似度

Abstract: In order to reduce the interference of the duplicated Web pages, and improve the efficiency of detection and elimination of similar Web pages, a new kind of large-scale Web page detection algorithm was proposed. Firstly, adopting the Web label values, the algorithm created the text structure trees to realize the fingerprint similarity calculation layer by layer. Secondly, the head and tail words of a certain sentence, in which high frequency punctuations occur, were extracted out as the feature code. Lastly, the fingerprint similarity of Web page features was discriminated with Bloom filter algorithm. The experimental results show that the algorithm can improve the recall rate up to more than 90%, and reduce the time complexity to O(n).

Key words: detection and elimination of similar Web pages, Web label value, high frequency punctuation, feature code, fingerprint similarity of Web page

中图分类号: