%0 Journal Article
%A QIU Tongxiang
%A WANG Jingzhong
%T Focused topic Web crawler based on improved TF-IDF alogorithm
%D 2015
%R 10.11772/j.issn.1001-9081.2015.10.2901
%J Journal of Computer Applications
%P 2901-2904
%V 35
%N 10
%X Considering a large number of irrelevant data in Web search results and low accuracy of semantic retrieval by using the traditional TF-IDF algorithm, *K*-means algorithm and the adaptive genetic algorithm, the improvement of the TF-IDF algorithm and its application in semantic retrieval were studied. The TF-IDF algorithm was improved successfully by applying the regular expression to the semantic analysis technique. The search topic was described by a semantic database. The similarity of the regular atoms in the documents was obtained by a weighted calculation, which was according to the importance of the regular atomic semantics and the different positions in the Web pages. The final results were obtained by a Cosine operation of the document similarity and subject mode through the space vector model. Finally, the calculating results were analyzed by applying the improved TF-IDF algorithm, the traditional TF-IDF algorithm, the *K*-means algorithm and the adaptive genetic algorithm to the focused topic Web crawler. The results show that the accuracy of the improved TF-IDF algorithm rose by 17.1 percentage points and the omission rate of that reduced by 7.76 percentage points in the vertical search of the focused topic web crawler. Compared with the *K*-means algorithm and the adaptive genetic algorithm, the accuracy of the improved TF-IDF algorithm rose by 6 percentage points and 8.1 percentage points, respectively. In summary, the improved TF-IDF algorithm can promote the accuracy of document similarity detection effectively and improve the defect of focused topic web crawler in the semantic analysis greatly.
%U http://www.joca.cn/EN/10.11772/j.issn.1001-9081.2015.10.2901