基于网页源码结构理解的自适应爬虫代码生成方法

doi:10.11772/j.issn.1001-9081.2022060929

《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (6): 1779-1784.DOI: 10.11772/j.issn.1001-9081.2022060929

所属专题： CCF第37届中国计算机应用大会 (CCF NCCA 2022)

• CCF第37届中国计算机应用大会 (CCF NCCA 2022) • 上一篇下一篇

基于网页源码结构理解的自适应爬虫代码生成方法

刘耀¹(), 刘茹², 翟雨²

^1.中国科学技术信息研究所信息技术支持中心，北京 100038
^2.北京大学软件与微电子学院，北京 102600

收稿日期:2022-06-28 修回日期:2022-08-22 接受日期:2022-08-25 发布日期:2022-09-22 出版日期:2023-06-10
通讯作者: 刘耀
作者简介:刘耀（1972—），男，山东菏泽人，研究员，博士，CCF杰出会员，主要研究方向：自然语言处理、知识工程Email：liuy@istic.ac.cn
刘茹（1998—），女，安徽亳州人，硕士，主要研究方向：自然语言处理、网络爬虫
翟雨（1998—），女，山东菏泽人，硕士研究生，主要研究方向：自然语言处理、计算机辅助翻译。
基金资助:
国家社会科学基金资助项目(21BTQ011);国家重点研发计划项目(2018YFB143502)

Self-adaptive Web crawler code generation method based on webpage source code structure comprehension

Yao LIU¹(), Ru LIU², Yu ZHAI²

^1.Information Technology Support Center，Institute of Scientific and Technical Information of China，Beijing 100038，China
^2.School of Software and Microelectronics，Peking University，Beijing 102600，China

Received:2022-06-28 Revised:2022-08-22 Accepted:2022-08-25 Online:2022-09-22 Published:2023-06-10
Contact: Yao LIU
About author:LIU Ru， born in 1998， M. S. Her research interests include natural language processing， Web crawler.
ZHAI Yu， born in 1998， M. S. candidate. Her research interests include natural language processing， computer-aided translation.
Supported by:
National Social Science Foundation of China(21BTQ011);National Key Research and Development Program of China(2018YFB143502)

摘要/Abstract

摘要：

针对网页频繁改版带来的网页源码变动，尤其是文章日期、正文或来源机构等网页源码中目标实体的元素结构或属性标识变动所引起的爬虫代码失效、人力维护成本过高的问题，提出一种基于网页源码结构理解的自适应爬虫代码生成方法。首先，通过分析网页结构特征变动规律提取相应爬虫代码；然后，利用Encoder-Decoder模型表征网页源码及代码的变动，通过融合网页源码自身结构语义特征、网页源码变动特征及网页代码变动特征，得到自适应代码生成模型；最后，完善自适应系统的感知、生成和激活机制，从而形成具有自适应处理能力的爬虫系统。经实验验证，所提自适应代码生成模型的最终准确率为78.5%，与TF-IDF+Seq2Seq和TriDNR+Seq2Seq两种生成模型相比，所提模型在网页源码变动的表示和代码生成的有效性上具有一定的优越性。因此，所提方法能够解决网页源码变动引起的爬虫代码运行问题，为网络资源获取即爬虫技术的自适应处理能力提供新思路。

关键词: 资源获取, 网页改版, 超文本标记语言, 网页源码理解, 自适应网络爬虫

Abstract:

To address the problems of Web crawler code failure and high manual maintenance cost caused by webpage source code changes led by frequent webpage redesigns， especially changes in element structures or attribute identifiers of target entities such as article dates， main body of text or source organizations， a self-adaptive Web crawler code generation method based on webpage source code structure comprehension was proposed. Firstly， the corresponding Web crawler code was extracted by analyzing the change patterns of webpage structural characteristics. Secondly， the changes in the webpage source code and code were represented by the Encoder-Decoder model. By fusing the semantic features of the webpage source code structure， the features of webpage source code changes and the features of webpage code changes， an adaptive code generation model was obtained. Finally， the perception， generation and activation mechanisms of the adaptive system were improved to form a Web crawler system with adaptive processing capability. Compared with TF-IDF+Seq2Seq and TriDNR+Seq2Seq models， the proposed adaptive code generation model was experimentally verified to show the superiority in the representation of webpage source code changes and the effectiveness of code generation with a final accuracy of 78.5%. With the proposed method， the Web crawler code operation problems caused by the webpage source code changes could be solved， and a new idea for the adaptive processing capability of Web resource acquisition — Web crawler technique was provided.

Key words: resource acquisition, webpage redesign, Hyper Text Markup Language (HTML), webpage source code comprehension, self-adaptive Web crawler

中图分类号:

TP391.1

刘耀, 刘茹, 翟雨. 基于网页源码结构理解的自适应爬虫代码生成方法[J]. 计算机应用, 2023, 43(6): 1779-1784.

Yao LIU, Ru LIU, Yu ZHAI. Self-adaptive Web crawler code generation method based on webpage source code structure comprehension[J]. Journal of Computer Applications, 2023, 43(6): 1779-1784.

图/表 8

参考文献 19

1	张明悦，金芝，赵海燕，等. 机器学习赋能的软件自适应性综述［J］. 软件学报， 2020， 31（8）：2404-2431.
	ZHANG M Y， JIN Z， ZHAO H Y， et al. Survey of machine learning enabled software self-adaptation ［J］. Journal of Software， 2020， 31（8）： 2404-2431.
2	LEOTTA M， STOCCO A， RICCA F， et al. ROBULA+： an algorithm for generating robust XPath locators for web testing ［J］. Journal of Software： Evolution and Process， 2016， 28（3）： 177-204. 10.1002/smr.1771
3	EDWARDS J， McCURLEY K， TOMLIN J. An adaptive model for optimizing performance of an incremental web crawler［C］// Proceedings of the 10th International World Wide Web Conference. New York： ACM， 2001： 106-113. 10.1145/371920.371960
4	SHARMA D K， KHAN M A. SAFSB： a self-adaptive focused crawler［C］// Proceedings of the 1st International Conference on Next Generation Computing Technologies. Piscataway： IEEE， 2015： 719-724. 10.1109/ngct.2015.7375215
5	COHEN J P， DING W， BAGHERJEIRAN A. XTreePath： a generalization of XPath to handle real world structural variation ［EB/OL］. （2017-12-27）［2022-08-01］..
6	CHOUDHARY S R， ZHAO D， VERSEE H， et al. WATER： web application test repair［C］// Proceedings of the 1st International Workshop on End-to-End Test Script Engineering. New York： ACM， 2011： 24-29. 10.1145/2002931.2002935
7	JUNDT O， van KEULEN M. Sample-based XPath ranking for Web information extraction ［C］// Proceedings of the 8th Conference of the European Society for Fuzzy Logic and Technology. Dordrecht： Atlantis Press， 2013： 187-194. 10.2991/eusflat.2013.27
8	吴共庆，胡骏，李莉，等. 基于标签路径特征融合的在线Web新闻内容抽取［J］. 软件学报， 2016， 27（3）：714-735.
	WU G Q， HU J， LI L， et al. Online Web news extraction via tag path feature fusion ［J］. Journal of Software， 2016， 27（3）：714-735.
9	GOGAR T， HUBACEK O， SEDIVY J. Deep neural networks for web page information extraction ［C］// Proceedings of the 2016 IFIP International Conference on Artificial Intelligence Applications and Innovations， IFIPAICT 475. Cham： Springer， 2016： 154-163. 10.1007/978-3-319-44944-9_14
10	TAN C L， CHIEW K L， YONG K S C. A graph-theoretic approach for the detection of phishing webpages ［J］. Computers and Security， 2020， 95： No.101793. 10.1016/j.cose.2020.101793
11	ALON U， BRODY S， LEVY O， et al. code2seq： generating sequences from structured representations of code ［EB/OL］. （2019-02-21）［2022-08-02］..
12	LI X C， JIANG H， KAMEI Y， et al. Bridging semantic gaps between natural languages and APIs with word embedding［J］. IEEE Transactions on Software Engineering， 2020， 46（10）： 1081-1097. 10.1109/tse.2018.2876006
13	JIN W G， YANG K， BARZILAY R， et al. Learning multimodal graph-to-graph translation for molecular optimization ［EB/OL］. （2019-01-28）［2022-08-02］..
14	HU B T， LU Z D， LI H， et al. Convolutional neural network architectures for matching natural language sentences［C］// Proceedings of the 27th International Conference on Neural Information Processing Systems. Cambridge： MIT Press， 2014， 2： 2042-2050.
15	HAMILTON W L， YING R， LESKOVEC J. Inductive representation learning on large graphs ［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2017： 1025-1035. 10.7551/mitpress/11474.003.0014
16	汪洋，代浩，任宏帅. 人机物融合环境下数据驱动的应用自适应初探［J］. 中国计算机学会通讯， 2020， 16（4）：25-30.
	WANG Y， DAI H， REN H S. A preliminary study of data-driven application adaptation in the environment of human-machine fusion［J］. Communications of the CCF， 2020， 16（4）： 25-30.
17	杨启亮，马晓星，邢建春，等. 软件自适应：基于控制理论的方法［J］. 计算机学报， 2016， 39（11）：2189-2215. 10.11897/SP.J.1016.2016.02189
	YANG Q L， MA X X， XING J C， et al. Software self-adaptation： control theory based approach ［J］. Chinese Journal of Computers， 2016， 39（11）： 2189-2215. 10.11897/SP.J.1016.2016.02189
18	CHEN T， GUESTRIN C. XGBoost： a scalable tree boosting system ［C］// Proceedings of the 22nd SIGKDD Conference on Knowledge Discovery and Data Mining. New York：ACM，2016： 785-794. 10.1145/2939672.2939785
19	魏伟，郭崇慧，陈静锋. 国务院政府工作报告（1954—2017）文本挖掘及社会变迁研究［J］. 情报学报， 2018， 37（4）：406-421.
	WEI W， GUO C H， CHEN J F. Text mining on the government work reports of the State Council （1954—2017） and social transformation research［J］. Journal of the China Society for Scientific and Technical Information， 2018， 37（4）： 406-421.

源码变动类型	源码变动类型细分	源码变动特征	解决方法
大规模结构变动	大规模结构变动	1）整体布局结构发生变动，原父元素移动，目标元素随父元素移动	重新识别网页布局结构，锁定父节点范围，子节点分类
小规模结构变动	原父元素不移动，目标仍在原有的布局结构之中	2）以次序为标识的目标发生横向移动	锁定父节点，子节点分类
	原父元素不移动，目标仍在原有的布局结构之中	3）以属性为标识的目标发生纵向移动
	目标元素位置不变	4）元素标识变动
	目标元素位置不变	5）结构增加：目标数据存在于多个标签中	以父节点代替多个子节点
结构不变动	元素标识不变动	6）日期格式变动	重新识别内容格式
结构不变动	数据内容变动	7）数据内容被删除	重新识别内容格式

源码变动类型	源码变动类型细分	源码变动特征	解决方法
大规模结构变动	大规模结构变动	1）整体布局结构发生变动，原父元素移动，目标元素随父元素移动	重新识别网页布局结构，锁定父节点范围，子节点分类
小规模结构变动	原父元素不移动，目标仍在原有的布局结构之中	2）以次序为标识的目标发生横向移动	锁定父节点，子节点分类
	原父元素不移动，目标仍在原有的布局结构之中	3）以属性为标识的目标发生纵向移动
	目标元素位置不变	4）元素标识变动
	目标元素位置不变	5）结构增加：目标数据存在于多个标签中	以父节点代替多个子节点
结构不变动	元素标识不变动	6）日期格式变动	重新识别内容格式
结构不变动	数据内容变动	7）数据内容被删除	重新识别内容格式

日期	日志报错数（去重）	报错网站数	错误实体类型数				准确率/%	召回率/%
日期	日志报错数（去重）	报错网站数	标题	日期	正文	其他	准确率/%	召回率/%
02-22	22	7	40	422	210		92.6	95.6
02-23	25	8	33	357	180		89.9	97.0
02-24	19	4	27	293	148		90.1	97.2
02-25	23	5	21	234	118		91.6	96.3
02-26	22	5	15	173	89		90.0	96.1
02-27	18	4	10	119	57	11	88.5	95.4
02-28	13	3	5	58	28	5	88.7	94.8

日期	日志报错数（去重）	报错网站数	错误实体类型数				准确率/%	召回率/%
日期	日志报错数（去重）	报错网站数	标题	日期	正文	其他	准确率/%	召回率/%
02-22	22	7	40	422	210		92.6	95.6
02-23	25	8	33	357	180		89.9	97.0
02-24	19	4	27	293	148		90.1	97.2
02-25	23	5	21	234	118		91.6	96.3
02-26	22	5	15	173	89		90.0	96.1
02-27	18	4	10	119	57	11	88.5	95.4
02-28	13	3	5	58	28	5	88.7	94.8

日期	分类准确率/%	XPath代码转换率/%	代码生成准确率/%
02-22	82.1	95.0	70.4
02-23	80.5	91.9	67.6
02-24	73.3	94.2	72.1
02-25	79.9	96.3	71.8
02-26	75.4	97.1	75.5
02-27	83.0	93.6	79.3
02-28	79.8	94.9	73.8

基于网页源码结构理解的自适应爬虫代码生成方法

Self-adaptive Web crawler code generation method based on webpage source code structure comprehension

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 8

参考文献 19

相关文章 15

编辑推荐

Metrics

模型	准确率	召回率	最终准确率
TF-IDF+Seq2Seq	75.58	68.31	61.7
TriDNR+Seq2Seq	81.32	79.49	69.9
（TriDNR+ED）+Seq2Seq	83.70	80.06	78.5

[1]	张庆杨凡方宇涵. 基于多模态信息融合的中文拼写纠错算法[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[2]	高颖杰, 林民, 斯日古楞null, 李斌, 张树钧. 基于片段抽取原型网络的古籍文本断句标点提示学习方法[J]. 《计算机应用》唯一官方网站, 2024, 44(12): 3815-3822.
[3]	王猛张大千周冰艳马倩影吕继东. 基于时序知识图谱补全的CTCS-3级列控车载接口设备故障诊断方法[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[4]	杨青朱焱. 改进语言规则中的表示的隐喻识别技术[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[5]	余婧陈艳平扈应黄瑞章秦永彬. 结合实体边界偏移的序列标注优化方法[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[6]	张伟牛家祥马继超沈琼霞. 深层语义特征增强的ReLM中文拼写纠错模型[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[7]	徐章杰陈艳平扈应黄瑞章秦永彬. 联合边界生成的多目标学习嵌套命名实体识别[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[8]	代震龙韩萌杨文艳朱诗能杨书蓉. 序列模式挖掘综述[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[9]	徐乐黄瑞章白瑞娜秦永彬. 基于意图正则化的深度半监督文本聚类[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[10]	彭一峰朱焱. 结合预处理方法和对抗学习的公平链接预测[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[11]	赵彪秦玉华田荣坤胡月航陈芳锐. 依赖类型及距离增强的方面级情感分析模型[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[12]	任登燃王淑营. 基于差分边界增强的风电装备嵌套实体识别模型[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[13]	田海燕黄赛豪张栋李寿山. 视觉指导的分词和词性标注[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[14]	帅健王中卿陈嘉沥. 基于代码生成的细粒度情感分析方法[J]. 《计算机应用》唯一官方网站, 0, (): 0-0.
[15]	姜雨杉, 张仰森. 大语言模型驱动的立场感知事实核查[J]. 《计算机应用》唯一官方网站, 2024, 44(10): 3067-3073.