计算机应用

• 数据库 • 上一篇    下一篇

关于重复词句提取的两种算法分析

殷波 蒋华   

  1. 桂林电子科技大学 桂林电子科技大学
  • 收稿日期:2008-09-02 修回日期:2008-10-22 发布日期:2009-04-22 出版日期:2009-02-01
  • 通讯作者: 殷波

New algorithm based on repeat sequence deletion

bo yin hua jiang   

  • Received:2008-09-02 Revised:2008-10-22 Online:2009-04-22 Published:2009-02-01
  • Contact: bo yin

摘要: 针对重复网页的去重问题,对两种重复词句提取算法进行了系统分析比较。STC算法在时间成本上具有优秀性能,重复序列的倒排索引方法在空间复杂度方面更胜一筹。结合STC算法对重复序列方法进行了改进,而面向主题转载的重复网页,先抽取重复串,然后将重复串作索引进行STC算法的重复抽取。实验结果表明,改进算法在保持了原有空间特性的基础上极大地提高了时间效率。

关键词: 重复词句, 重复序列, 后缀树

Abstract: Aiming at the current de-duplication algorithms, two repeated sequences (RS)extracting algorithms were compared and analyzed. Since STC has favorable performance in considering time cost and the inverted index method is superior in terms of spatial complexity, STC was used to improve RS algorithm. Experiment results show that this method can find similar Web pages efficiently. This algorithm can reach a high precision in mono-language deletion of duplicated Web pages, and this algorithm can also reach a maximum precision when it is applied to deletion of duplicated web pages.

Key words: repeated sequences, repeated segments, suffix tree