Search Result

Journals

Publication Years

Keywords

Please wait a minute...

For Selected:

Download Citations
EndNote Ris BibTeX

Toggle Thumbnails

Select

Detection and elimination of similar Web pages based on text structure and string of feature code

XIONG Zhongyang YA Man ZHANG Yufang

Journal of Computer Applications 2013, 33 (02): 554-557. DOI: 10.3724/SP.J.1087.2013.00554

Abstract （924）

PDF （661KB）（529）

Save

In order to reduce the interference of the duplicated Web pages, and improve the efficiency of detection and elimination of similar Web pages, a new kind of large-scale Web page detection algorithm was proposed. Firstly, adopting the Web label values, the algorithm created the text structure trees to realize the fingerprint similarity calculation layer by layer. Secondly, the head and tail words of a certain sentence, in which high frequency punctuations occur, were extracted out as the feature code. Lastly, the fingerprint similarity of Web page features was discriminated with Bloom filter algorithm. The experimental results show that the algorithm can improve the recall rate up to more than 90%, and reduce the time complexity to O(n).