计算机应用 ›› 2020, Vol. 40 ›› Issue (8): 2255-2261.DOI: 10.11772/j.issn.1001-9081.2019122238

• 数据科学与技术 • 上一篇    下一篇

融合本体和改进禁忌搜索策略的气象灾害主题爬虫方法

刘景发1,2, 顾瑶平1, 刘文杰1   

  1. 1. 南京信息工程大学 计算机与软件学院, 南京 210044;
    2. 广东外语外贸大学 信息科学与技术学院, 广州 510006
  • 收稿日期:2020-01-07 修回日期:2020-03-10 出版日期:2020-08-10 发布日期:2020-05-14
  • 通讯作者: 刘景发(1972-),男,湖南衡阳人,教授,博士生导师,博士,CCF高级会员,主要研究方向:网络爬虫、智能计算、本体;jfliu@gdufs.edu.cn
  • 作者简介:顾瑶平(1995-),女,江苏盐城人,硕士研究生,主要研究方向:智能计算、网络爬虫;刘文杰(1979-),男,湖北荆州人,副教授,博士,主要研究方向:量子机器学习、量子算法与软件。
  • 基金资助:
    国家社会科学基金重大招标项目(16ZDA047);江苏省自然科学基金资助项目(BK20181409);广州市基础与应用基础研究项目。

Focused crawler method combining ontology and improved Tabu search for meteorological disaster

LIU Jingfa1,2, GU Yaoping1, LIU Wenjie1   

  1. 1. School of Computer and Software, Nanjing University of Information Science and Technology, Nanjing Jiangsu 210044, China;
    2. School of Information Science and Technology, Guangdong University of Foreign Studies, Guangzhou Guangdong 510006, China
  • Received:2020-01-07 Revised:2020-03-10 Online:2020-08-10 Published:2020-05-14
  • Supported by:
    This work is partially supported by the Major Program of the National Social Science Foundation of China (16ZDA047), the Natural Science Foundation of Jiangsu Province (BK20181409), the Project of Guangzhou Basic and Applied Fundamental Research.

摘要: 针对传统主题爬虫方法容易陷入局部最优和主题描述不足的问题,提出一种融合本体和改进禁忌搜索策略(On-ITS)的主题爬虫方法。首先利用本体语义相似度计算主题语义向量,基于超级文本标记语言(HTML)网页文本特征位置加权构建网页文本特征向量,然后采用向量空间模型计算网页的主题相关度。在此基础上,计算锚文本主题相关度以及链接指向网页的PR值,综合分析链接优先度。另外,为了避免爬虫陷入局部最优,设计了基于ITS的主题爬虫,优化爬行队列。以暴雨灾害和台风灾害为主题,在相同的实验环境下,基于On-ITS的主题爬虫方法比对比算法的爬准率最多高58%,最少高8%,其他评价指标也很好。基于On-ITS的主题爬虫方法能有效提高获取领域信息的准确性,抓取更多与主题相关的网页。

关键词: 主题爬虫, 禁忌搜索, 本体, 主题相关度, 气象灾害

Abstract: Considering the problems that the traditional focused crawler is easy to fall into local optimum and has insufficient topic description, a focused crawler method combining Ontology and Improved Tabu Search (On-ITS) was proposed. First, the topic semantic vector was calculated by ontology semantic similarity, and the Web page text feature vector was constructed by Hyper Text Markup Language (HTML) Web page text feature position weighting. Then, the vector space model was used to calculate the topic relevance of Web pages. On this basis, in order to analyze the comprehensive priority of link, the topic relevance of the link anchor text and the PR (PageRank) value of Web page to the link were calculated. In addition, to avoid the crawler falling into local optimum, the focused crawler based on ITS was designed to optimize the crawling queue. Experimental results of the focused crawler on the topics of rainstorm disaster and typhoon disaster show that, under the same environment, the accuracy of the On-ITS method is higher than those of the contrast algorithms by maximum of 58% and minimum of 8%, and other evaluation indicators of the proposed algorithm are also very excellent. On-ITS focused crawler method can effectively improve the accuracy of obtaining domain information and catch more topic-related Web pages.

Key words: focused crawler, Tabu search, ontology, topic relevance, meteorological disaster

中图分类号: