|
Detection and elimination of similar Web pages based on text structure and string of feature code
XIONG Zhongyang YA Man ZHANG Yufang
Journal of Computer Applications
2013, 33 (02):
554-557.
DOI: 10.3724/SP.J.1087.2013.00554
In order to reduce the interference of the duplicated Web pages, and improve the efficiency of detection and elimination of similar Web pages, a new kind of large-scale Web page detection algorithm was proposed. Firstly, adopting the Web label values, the algorithm created the text structure trees to realize the fingerprint similarity calculation layer by layer. Secondly, the head and tail words of a certain sentence, in which high frequency punctuations occur, were extracted out as the feature code. Lastly, the fingerprint similarity of Web page features was discriminated with Bloom filter algorithm. The experimental results show that the algorithm can improve the recall rate up to more than 90%, and reduce the time complexity to O(n).
Related Articles |
Metrics
|
|