Journal of Computer Applications ›› 2015, Vol. 35 ›› Issue (10): 2901-2904.DOI: 10.11772/j.issn.1001-9081.2015.10.2901

Previous Articles     Next Articles

Focused topic Web crawler based on improved TF-IDF alogorithm

WANG Jingzhong, QIU Tongxiang   

  1. School of Computer, North China University of Technology, Beijing 100144, China
  • Received:2015-05-11 Revised:2015-07-11 Online:2015-10-10 Published:2015-10-14

基于TF-IDF改进算法的聚焦主题网络爬虫

王景中, 邱铜相   

  1. 北方工业大学 计算机学院, 北京 100144
  • 通讯作者: 王景中(1962-),男,内蒙古通辽人,教授,主要研究方向:信息安全、图像处理、数据挖掘,1023623558@qq.com
  • 作者简介:邱铜相(1989-),男,江西信丰人,硕士研究生,主要研究方向:信息安全、数据挖掘。
  • 基金资助:
    国家自然科学基金资助项目(61371142);北京市创新团队建设提升计划项目(IDHT20130502)。

Abstract: Considering a large number of irrelevant data in Web search results and low accuracy of semantic retrieval by using the traditional TF-IDF algorithm, K-means algorithm and the adaptive genetic algorithm, the improvement of the TF-IDF algorithm and its application in semantic retrieval were studied. The TF-IDF algorithm was improved successfully by applying the regular expression to the semantic analysis technique. The search topic was described by a semantic database. The similarity of the regular atoms in the documents was obtained by a weighted calculation, which was according to the importance of the regular atomic semantics and the different positions in the Web pages. The final results were obtained by a Cosine operation of the document similarity and subject mode through the space vector model. Finally, the calculating results were analyzed by applying the improved TF-IDF algorithm, the traditional TF-IDF algorithm, the K-means algorithm and the adaptive genetic algorithm to the focused topic Web crawler. The results show that the accuracy of the improved TF-IDF algorithm rose by 17.1 percentage points and the omission rate of that reduced by 7.76 percentage points in the vertical search of the focused topic web crawler. Compared with the K-means algorithm and the adaptive genetic algorithm, the accuracy of the improved TF-IDF algorithm rose by 6 percentage points and 8.1 percentage points, respectively. In summary, the improved TF-IDF algorithm can promote the accuracy of document similarity detection effectively and improve the defect of focused topic web crawler in the semantic analysis greatly.

Key words: Web spider, semantic analysis, search engine, Term Frequency-Inverse Document Frequency (TF-IDF), title spider, document correlation degree

摘要: 针对传统的TF-IDF算法、K-means算法、自适应遗传算法在网络检索结果中含有大量不相关数据、语义检索准确性不高的问题,研究了TF-IDF算法的改进及其在语义检索中的应用。将正则表达式和语义分析技术相结合,从而实现对TF-IDF算法的改进。利用语义库对搜索主题进行描述,根据正则原子语义的重要性和在网页标签中的不同位置进行加权计算,得到正则原子在文档中的相似度。通过空间向量模型对文档相似度和主题模型进行余弦运算,从而获取最终的搜索结果。最后,将改进的TF-IDF算法、传统的TF-IDF算法、K-means算法和自适应遗传算法运用于聚焦主题网络爬虫中,对其检索结果进行了对比分析。计算结果表明,在聚焦主题网络爬虫语义分析的垂直搜索中,改进TF-IDF算法的相似度准确率比传统的TF-IDF算法检索准确率提高了17.1个百分点,遗漏率降低了7.76个百分点;比K-means算法检索准确率提高6个百分点;比自适应遗传算法检索准确率提高了8.1个百分点。总之,改进的TF-IDF算法可以有效地提高文档相似度检测的准确率,很好地改善聚焦主题网络爬虫在语义分析中的缺陷。

关键词: 网络爬虫, 语义分析, 搜索引擎, TF-IDF, 主题爬虫, 文档相似度

CLC Number: