关于重复词句提取的两种算法分析

计算机应用

关于重复词句提取的两种算法分析

殷波蒋华

桂林电子科技大学桂林电子科技大学

收稿日期:2008-09-02 修回日期:2008-10-22 发布日期:2009-04-22 出版日期:2009-02-01
通讯作者: 殷波

New algorithm based on repeat sequence deletion

bo yin hua jiang

Received:2008-09-02 Revised:2008-10-22 Online:2009-04-22 Published:2009-02-01
Contact: bo yin

摘要/Abstract

摘要： 针对重复网页的去重问题，对两种重复词句提取算法进行了系统分析比较。STC算法在时间成本上具有优秀性能，重复序列的倒排索引方法在空间复杂度方面更胜一筹。结合STC算法对重复序列方法进行了改进，而面向主题转载的重复网页，先抽取重复串，然后将重复串作索引进行STC算法的重复抽取。实验结果表明，改进算法在保持了原有空间特性的基础上极大地提高了时间效率。

关键词: 重复词句, 重复序列, 后缀树

Abstract: Aiming at the current de-duplication algorithms, two repeated sequences (RS)extracting algorithms were compared and analyzed. Since STC has favorable performance in considering time cost and the inverted index method is superior in terms of spatial complexity, STC was used to improve RS algorithm. Experiment results show that this method can find similar Web pages efficiently. This algorithm can reach a high precision in mono-language deletion of duplicated Web pages, and this algorithm can also reach a maximum precision when it is applied to deletion of duplicated web pages.

Key words: repeated sequences, repeated segments, suffix tree

殷波蒋华. 关于重复词句提取的两种算法分析[J]. 计算机应用.

bo yin hua jiang. New algorithm based on repeat sequence deletion[J]. Journal of Computer Applications.

[1]	程铃钫, 郭躬德, 陈黎飞. 符号序列多阶Markov分类[J]. 计算机应用, 2017, 37(7): 1977-1982.
[2]	肖艳丽, 张振宇, 袁江涛. 基于位置序列的广义后缀树用户相似性计算方法[J]. 计算机应用, 2015, 35(6): 1654-1658.
[3]	王兴蒋新华林劼熊金波. 基于概率后缀树的移动对象轨迹预测[J]. 计算机应用, 2013, 33(11): 3119-3122.
[4]	翟献民田生伟禹龙冯冠军. 面向维吾尔语文本的改进后缀树聚类[J]. 计算机应用, 2012, 32(04): 1078-1081.