Journal of Computer Applications ›› 2023, Vol. 43 ›› Issue (6): 1779-1784.DOI: 10.11772/j.issn.1001-9081.2022060929
Special Issue: CCF第37届中国计算机应用大会 (CCF NCCA 2022)
• The 37 CCF National Conference of Computer Applications (CCF NCCA 2022) • Previous Articles Next Articles
Received:
2022-06-28
Revised:
2022-08-22
Accepted:
2022-08-25
Online:
2022-09-22
Published:
2023-06-10
Contact:
Yao LIU
About author:
LIU Ru, born in 1998, M. S. Her research interests include natural language processing, Web crawler.Supported by:
通讯作者:
刘耀
作者简介:
刘耀(1972—),男,山东菏泽人,研究员,博士,CCF杰出会员,主要研究方向:自然语言处理、知识工程Email:liuy@istic.ac.cn基金资助:
CLC Number:
Yao LIU, Ru LIU, Yu ZHAI. Self-adaptive Web crawler code generation method based on webpage source code structure comprehension[J]. Journal of Computer Applications, 2023, 43(6): 1779-1784.
刘耀, 刘茹, 翟雨. 基于网页源码结构理解的自适应爬虫代码生成方法[J]. 《计算机应用》唯一官方网站, 2023, 43(6): 1779-1784.
Add to citation manager EndNote|Ris|BibTeX
URL: https://www.joca.cn/EN/10.11772/j.issn.1001-9081.2022060929
源码变动类型 | 源码变动类型细分 | 源码变动特征 | 解决方法 |
---|---|---|---|
大规模结构变动 | 大规模结构变动 | 1)整体布局结构发生变动, 原父元素移动, 目标元素随父元素移动 | 重新识别网页布局结构, 锁定父节点范围, 子节点分类 |
小规模结构变动 | 原父元素不移动, 目标仍在原有的布局结构之中 | 2)以次序为标识的目标发生横向移动 | 锁定父节点,子节点分类 |
3)以属性为标识的目标发生纵向移动 | |||
目标元素位置不变 | 4)元素标识变动 | ||
5)结构增加:目标数据存在于多个标签中 | 以父节点代替多个子节点 | ||
结构不变动 | 元素标识不变动 | 6)日期格式变动 | 重新识别内容格式 |
数据内容变动 | 7)数据内容被删除 |
Tab. 1 Seven change types of source code
源码变动类型 | 源码变动类型细分 | 源码变动特征 | 解决方法 |
---|---|---|---|
大规模结构变动 | 大规模结构变动 | 1)整体布局结构发生变动, 原父元素移动, 目标元素随父元素移动 | 重新识别网页布局结构, 锁定父节点范围, 子节点分类 |
小规模结构变动 | 原父元素不移动, 目标仍在原有的布局结构之中 | 2)以次序为标识的目标发生横向移动 | 锁定父节点,子节点分类 |
3)以属性为标识的目标发生纵向移动 | |||
目标元素位置不变 | 4)元素标识变动 | ||
5)结构增加:目标数据存在于多个标签中 | 以父节点代替多个子节点 | ||
结构不变动 | 元素标识不变动 | 6)日期格式变动 | 重新识别内容格式 |
数据内容变动 | 7)数据内容被删除 |
日期 | 日志 报错数 (去重) | 报错 网站数 | 错误实体类型数 | 准确 率/% | 召回率/% | |||
---|---|---|---|---|---|---|---|---|
标题 | 日期 | 正文 | 其他 | |||||
02-22 | 22 | 7 | 40 | 422 | 210 | 92.6 | 95.6 | |
02-23 | 25 | 8 | 33 | 357 | 180 | 89.9 | 97.0 | |
02-24 | 19 | 4 | 27 | 293 | 148 | 90.1 | 97.2 | |
02-25 | 23 | 5 | 21 | 234 | 118 | 91.6 | 96.3 | |
02-26 | 22 | 5 | 15 | 173 | 89 | 90.0 | 96.1 | |
02-27 | 18 | 4 | 10 | 119 | 57 | 11 | 88.5 | 95.4 |
02-28 | 13 | 3 | 5 | 58 | 28 | 5 | 88.7 | 94.8 |
Tab. 2 Data statistics of Web crawler log database
日期 | 日志 报错数 (去重) | 报错 网站数 | 错误实体类型数 | 准确 率/% | 召回率/% | |||
---|---|---|---|---|---|---|---|---|
标题 | 日期 | 正文 | 其他 | |||||
02-22 | 22 | 7 | 40 | 422 | 210 | 92.6 | 95.6 | |
02-23 | 25 | 8 | 33 | 357 | 180 | 89.9 | 97.0 | |
02-24 | 19 | 4 | 27 | 293 | 148 | 90.1 | 97.2 | |
02-25 | 23 | 5 | 21 | 234 | 118 | 91.6 | 96.3 | |
02-26 | 22 | 5 | 15 | 173 | 89 | 90.0 | 96.1 | |
02-27 | 18 | 4 | 10 | 119 | 57 | 11 | 88.5 | 95.4 |
02-28 | 13 | 3 | 5 | 58 | 28 | 5 | 88.7 | 94.8 |
日期 | 分类准确率/% | XPath代码转换率/% | 代码生成准确率/% |
---|---|---|---|
02-22 | 82.1 | 95.0 | 70.4 |
02-23 | 80.5 | 91.9 | 67.6 |
02-24 | 73.3 | 94.2 | 72.1 |
02-25 | 79.9 | 96.3 | 71.8 |
02-26 | 75.4 | 97.1 | 75.5 |
02-27 | 83.0 | 93.6 | 79.3 |
02-28 | 79.8 | 94.9 | 73.8 |
Tab. 3 Accuracy of adaptive entity extraction code generation method
日期 | 分类准确率/% | XPath代码转换率/% | 代码生成准确率/% |
---|---|---|---|
02-22 | 82.1 | 95.0 | 70.4 |
02-23 | 80.5 | 91.9 | 67.6 |
02-24 | 73.3 | 94.2 | 72.1 |
02-25 | 79.9 | 96.3 | 71.8 |
02-26 | 75.4 | 97.1 | 75.5 |
02-27 | 83.0 | 93.6 | 79.3 |
02-28 | 79.8 | 94.9 | 73.8 |
外部依据 | 内外部结合依据 | 内部依据 |
---|---|---|
网页源码变动情况 | 网站实际更新数目 | 爬虫运行日志统计 |
XPath代码的 定位效果 | 爬虫Redis缓存池中的 链接数(成功捕获) | 爬虫线程停滞时节点 |
爬虫MongoDB的 链接数(成功解析) | 爬虫线程停滞时 报错信息 |
Tab. 4 Internal and external basis for judging webpage source code change events
外部依据 | 内外部结合依据 | 内部依据 |
---|---|---|
网页源码变动情况 | 网站实际更新数目 | 爬虫运行日志统计 |
XPath代码的 定位效果 | 爬虫Redis缓存池中的 链接数(成功捕获) | 爬虫线程停滞时节点 |
爬虫MongoDB的 链接数(成功解析) | 爬虫线程停滞时 报错信息 |
模型 | 准确率 | 召回率 | 最终准确率 |
---|---|---|---|
TF-IDF+Seq2Seq | 75.58 | 68.31 | 61.7 |
TriDNR+Seq2Seq | 81.32 | 79.49 | 69.9 |
(TriDNR+ED)+Seq2Seq | 83.70 | 80.06 | 78.5 |
Tab. 5 Results of adaptive Web crawler code generation
模型 | 准确率 | 召回率 | 最终准确率 |
---|---|---|---|
TF-IDF+Seq2Seq | 75.58 | 68.31 | 61.7 |
TriDNR+Seq2Seq | 81.32 | 79.49 | 69.9 |
(TriDNR+ED)+Seq2Seq | 83.70 | 80.06 | 78.5 |
1 | 张明悦,金芝,赵海燕,等. 机器学习赋能的软件自适应性综述[J]. 软件学报, 2020, 31(8):2404-2431. |
ZHANG M Y, JIN Z, ZHAO H Y, et al. Survey of machine learning enabled software self-adaptation [J]. Journal of Software, 2020, 31(8): 2404-2431. | |
2 | LEOTTA M, STOCCO A, RICCA F, et al. ROBULA+: an algorithm for generating robust XPath locators for web testing [J]. Journal of Software: Evolution and Process, 2016, 28(3): 177-204. 10.1002/smr.1771 |
3 | EDWARDS J, McCURLEY K, TOMLIN J. An adaptive model for optimizing performance of an incremental web crawler[C]// Proceedings of the 10th International World Wide Web Conference. New York: ACM, 2001: 106-113. 10.1145/371920.371960 |
4 | SHARMA D K, KHAN M A. SAFSB: a self-adaptive focused crawler[C]// Proceedings of the 1st International Conference on Next Generation Computing Technologies. Piscataway: IEEE, 2015: 719-724. 10.1109/ngct.2015.7375215 |
5 | COHEN J P, DING W, BAGHERJEIRAN A. XTreePath: a generalization of XPath to handle real world structural variation [EB/OL]. (2017-12-27) [2022-08-01].. |
6 | CHOUDHARY S R, ZHAO D, VERSEE H, et al. WATER: web application test repair[C]// Proceedings of the 1st International Workshop on End-to-End Test Script Engineering. New York: ACM, 2011: 24-29. 10.1145/2002931.2002935 |
7 | JUNDT O, van KEULEN M. Sample-based XPath ranking for Web information extraction [C]// Proceedings of the 8th Conference of the European Society for Fuzzy Logic and Technology. Dordrecht: Atlantis Press, 2013: 187-194. 10.2991/eusflat.2013.27 |
8 | 吴共庆,胡骏,李莉,等. 基于标签路径特征融合的在线Web新闻内容抽取[J]. 软件学报, 2016, 27(3):714-735. |
WU G Q, HU J, LI L, et al. Online Web news extraction via tag path feature fusion [J]. Journal of Software, 2016, 27(3):714-735. | |
9 | GOGAR T, HUBACEK O, SEDIVY J. Deep neural networks for web page information extraction [C]// Proceedings of the 2016 IFIP International Conference on Artificial Intelligence Applications and Innovations, IFIPAICT 475. Cham: Springer, 2016: 154-163. 10.1007/978-3-319-44944-9_14 |
10 | TAN C L, CHIEW K L, YONG K S C. A graph-theoretic approach for the detection of phishing webpages [J]. Computers and Security, 2020, 95: No.101793. 10.1016/j.cose.2020.101793 |
11 | ALON U, BRODY S, LEVY O, et al. code2seq: generating sequences from structured representations of code [EB/OL]. (2019-02-21) [2022-08-02].. |
12 | LI X C, JIANG H, KAMEI Y, et al. Bridging semantic gaps between natural languages and APIs with word embedding[J]. IEEE Transactions on Software Engineering, 2020, 46(10): 1081-1097. 10.1109/tse.2018.2876006 |
13 | JIN W G, YANG K, BARZILAY R, et al. Learning multimodal graph-to-graph translation for molecular optimization [EB/OL]. (2019-01-28) [2022-08-02].. |
14 | HU B T, LU Z D, LI H, et al. Convolutional neural network architectures for matching natural language sentences[C]// Proceedings of the 27th International Conference on Neural Information Processing Systems. Cambridge: MIT Press, 2014, 2: 2042-2050. |
15 | HAMILTON W L, YING R, LESKOVEC J. Inductive representation learning on large graphs [C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY: Curran Associates Inc., 2017: 1025-1035. 10.7551/mitpress/11474.003.0014 |
16 | 汪洋,代浩,任宏帅. 人机物融合环境下数据驱动的应用自适应初探[J]. 中国计算机学会通讯, 2020, 16(4):25-30. |
WANG Y, DAI H, REN H S. A preliminary study of data-driven application adaptation in the environment of human-machine fusion[J]. Communications of the CCF, 2020, 16(4): 25-30. | |
17 | 杨启亮,马晓星,邢建春,等. 软件自适应:基于控制理论的方法[J]. 计算机学报, 2016, 39(11):2189-2215. 10.11897/SP.J.1016.2016.02189 |
YANG Q L, MA X X, XING J C, et al. Software self-adaptation: control theory based approach [J]. Chinese Journal of Computers, 2016, 39(11): 2189-2215. 10.11897/SP.J.1016.2016.02189 | |
18 | CHEN T, GUESTRIN C. XGBoost: a scalable tree boosting system [C]// Proceedings of the 22nd SIGKDD Conference on Knowledge Discovery and Data Mining. New York:ACM,2016: 785-794. 10.1145/2939672.2939785 |
19 | 魏伟,郭崇慧,陈静锋. 国务院政府工作报告(1954—2017)文本挖掘及社会变迁研究[J]. 情报学报, 2018, 37(4):406-421. |
WEI W, GUO C H, CHEN J F. Text mining on the government work reports of the State Council (1954—2017) and social transformation research[J]. Journal of the China Society for Scientific and Technical Information, 2018, 37(4): 406-421. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||