《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (6): 1779-1784.DOI: 10.11772/j.issn.1001-9081.2022060929

• CCF第37届中国计算机应用大会 (CCF NCCA 2022) • 上一篇    下一篇

基于网页源码结构理解的自适应爬虫代码生成方法

刘耀1(), 刘茹2, 翟雨2   

  1. 1.中国科学技术信息研究所 信息技术支持中心,北京 100038
    2.北京大学 软件与微电子学院,北京 102600
  • 收稿日期:2022-06-28 修回日期:2022-08-22 接受日期:2022-08-25 发布日期:2022-09-22 出版日期:2023-06-10
  • 通讯作者: 刘耀
  • 作者简介:刘耀(1972—),男,山东菏泽人,研究员,博士,CCF杰出会员,主要研究方向:自然语言处理、知识工程Email:liuy@istic.ac.cn
    刘茹(1998—),女,安徽亳州人,硕士,主要研究方向:自然语言处理、网络爬虫
    翟雨(1998—),女,山东菏泽人,硕士研究生,主要研究方向:自然语言处理、计算机辅助翻译。
  • 基金资助:
    国家社会科学基金资助项目(21BTQ011);国家重点研发计划项目(2018YFB143502)

Self-adaptive Web crawler code generation method based on webpage source code structure comprehension

Yao LIU1(), Ru LIU2, Yu ZHAI2   

  1. 1.Information Technology Support Center,Institute of Scientific and Technical Information of China,Beijing 100038,China
    2.School of Software and Microelectronics,Peking University,Beijing 102600,China
  • Received:2022-06-28 Revised:2022-08-22 Accepted:2022-08-25 Online:2022-09-22 Published:2023-06-10
  • Contact: Yao LIU
  • About author:LIU Ru, born in 1998, M. S. Her research interests include natural language processing, Web crawler.
    ZHAI Yu, born in 1998, M. S. candidate. Her research interests include natural language processing, computer-aided translation.
  • Supported by:
    National Social Science Foundation of China(21BTQ011);National Key Research and Development Program of China(2018YFB143502)

摘要:

针对网页频繁改版带来的网页源码变动,尤其是文章日期、正文或来源机构等网页源码中目标实体的元素结构或属性标识变动所引起的爬虫代码失效、人力维护成本过高的问题,提出一种基于网页源码结构理解的自适应爬虫代码生成方法。首先,通过分析网页结构特征变动规律提取相应爬虫代码;然后,利用Encoder-Decoder模型表征网页源码及代码的变动,通过融合网页源码自身结构语义特征、网页源码变动特征及网页代码变动特征,得到自适应代码生成模型;最后,完善自适应系统的感知、生成和激活机制,从而形成具有自适应处理能力的爬虫系统。经实验验证,所提自适应代码生成模型的最终准确率为78.5%,与TF-IDF+Seq2Seq和TriDNR+Seq2Seq两种生成模型相比,所提模型在网页源码变动的表示和代码生成的有效性上具有一定的优越性。因此,所提方法能够解决网页源码变动引起的爬虫代码运行问题,为网络资源获取即爬虫技术的自适应处理能力提供新思路。

关键词: 资源获取, 网页改版, 超文本标记语言, 网页源码理解, 自适应网络爬虫

Abstract:

To address the problems of Web crawler code failure and high manual maintenance cost caused by webpage source code changes led by frequent webpage redesigns, especially changes in element structures or attribute identifiers of target entities such as article dates, main body of text or source organizations, a self-adaptive Web crawler code generation method based on webpage source code structure comprehension was proposed. Firstly, the corresponding Web crawler code was extracted by analyzing the change patterns of webpage structural characteristics. Secondly, the changes in the webpage source code and code were represented by the Encoder-Decoder model. By fusing the semantic features of the webpage source code structure, the features of webpage source code changes and the features of webpage code changes, an adaptive code generation model was obtained. Finally, the perception, generation and activation mechanisms of the adaptive system were improved to form a Web crawler system with adaptive processing capability. Compared with TF-IDF+Seq2Seq and TriDNR+Seq2Seq models, the proposed adaptive code generation model was experimentally verified to show the superiority in the representation of webpage source code changes and the effectiveness of code generation with a final accuracy of 78.5%. With the proposed method, the Web crawler code operation problems caused by the webpage source code changes could be solved, and a new idea for the adaptive processing capability of Web resource acquisition — Web crawler technique was provided.

Key words: resource acquisition, webpage redesign, Hyper Text Markup Language (HTML), webpage source code comprehension, self-adaptive Web crawler

中图分类号: