Journal of Computer Applications ›› 2013, Vol. 33 ›› Issue (02): 554-557.DOI: 10.3724/SP.J.1087.2013.00554
• Database technology • Previous Articles Next Articles
XIONG Zhongyang,YA Man,ZHANG Yufang
Received:
Revised:
Online:
Published:
Contact:
熊忠阳,牙漫,张玉芳
通讯作者:
作者简介:
Abstract: In order to reduce the interference of the duplicated Web pages, and improve the efficiency of detection and elimination of similar Web pages, a new kind of large-scale Web page detection algorithm was proposed. Firstly, adopting the Web label values, the algorithm created the text structure trees to realize the fingerprint similarity calculation layer by layer. Secondly, the head and tail words of a certain sentence, in which high frequency punctuations occur, were extracted out as the feature code. Lastly, the fingerprint similarity of Web page features was discriminated with Bloom filter algorithm. The experimental results show that the algorithm can improve the recall rate up to more than 90%, and reduce the time complexity to O(n).
Key words: detection and elimination of similar Web pages, Web label value, high frequency punctuation, feature code, fingerprint similarity of Web page
摘要: 为了减少重复网页对用户的干扰,提高去重效率,提出一种新的大规模网页去重算法。首先利用预定义网页标签值建立网页正文结构树,实现了层次计算指纹相似度;其次,提取网页中高频标点字符所在句子中的首尾汉字作为特征码;最后,利用Bloom Filter算法对获取的特征指纹进行网页相似度判别。实验表明,该算法将召回率提高到了90%以上,时间复杂度降低到了O(n)。
关键词: 网页去重, 网页标签值, 高频标点, 特征码, 网页指纹相似度
CLC Number:
TP391.1
TP393.092
XIONG Zhongyang YA Man ZHANG Yufang. Detection and elimination of similar Web pages based on text structure and string of feature code[J]. Journal of Computer Applications, 2013, 33(02): 554-557.
熊忠阳 牙漫 张玉芳. 基于网页正文结构和特征串的相似网页去重算法[J]. 计算机应用, 2013, 33(02): 554-557.
0 / Recommend
Add to citation manager EndNote|Ris|BibTeX
URL: https://www.joca.cn/EN/10.3724/SP.J.1087.2013.00554
https://www.joca.cn/EN/Y2013/V33/I02/554