Journal of Computer Applications ›› 2015, Vol. 35 ›› Issue (10): 2901-2904.

### Focused topic Web crawler based on improved TF-IDF alogorithm

1. School of Computer, North China University of Technology, Beijing 100144, China
• Received:2015-05-11 Revised:2015-07-11 Online:2015-10-10 Published:2015-10-14

### 基于TF-IDF改进算法的聚焦主题网络爬虫

1. 北方工业大学 计算机学院, 北京 100144
• 通讯作者: 王景中(1962-),男,内蒙古通辽人,教授,主要研究方向:信息安全、图像处理、数据挖掘,1023623558@qq.com
• 作者简介:邱铜相(1989-),男,江西信丰人,硕士研究生,主要研究方向:信息安全、数据挖掘。
• 基金资助:
国家自然科学基金资助项目(61371142);北京市创新团队建设提升计划项目(IDHT20130502)。

Abstract: Considering a large number of irrelevant data in Web search results and low accuracy of semantic retrieval by using the traditional TF-IDF algorithm, K-means algorithm and the adaptive genetic algorithm, the improvement of the TF-IDF algorithm and its application in semantic retrieval were studied. The TF-IDF algorithm was improved successfully by applying the regular expression to the semantic analysis technique. The search topic was described by a semantic database. The similarity of the regular atoms in the documents was obtained by a weighted calculation, which was according to the importance of the regular atomic semantics and the different positions in the Web pages. The final results were obtained by a Cosine operation of the document similarity and subject mode through the space vector model. Finally, the calculating results were analyzed by applying the improved TF-IDF algorithm, the traditional TF-IDF algorithm, the K-means algorithm and the adaptive genetic algorithm to the focused topic Web crawler. The results show that the accuracy of the improved TF-IDF algorithm rose by 17.1 percentage points and the omission rate of that reduced by 7.76 percentage points in the vertical search of the focused topic web crawler. Compared with the K-means algorithm and the adaptive genetic algorithm, the accuracy of the improved TF-IDF algorithm rose by 6 percentage points and 8.1 percentage points, respectively. In summary, the improved TF-IDF algorithm can promote the accuracy of document similarity detection effectively and improve the defect of focused topic web crawler in the semantic analysis greatly.

CLC Number: