基于注意力机制和集成学习的网页黑名单判别方法

doi:10.11772/j.issn.1001-9081.2020081379

计算机应用 ›› 2021, Vol. 41 ›› Issue (1): 133-138.DOI: 10.11772/j.issn.1001-9081.2020081379

所属专题：第八届中国数据挖掘会议(CCDM 2020)

• 第八届中国数据挖掘会议(CCDM 2020) • 上一篇下一篇

基于注意力机制和集成学习的网页黑名单判别方法

周超然, 赵建平, 马太, 周欣

长春理工大学计算机科学技术学院, 长春 130022

收稿日期:2020-07-31 修回日期:2020-10-15 发布日期:2020-11-25 出版日期:2021-01-10
通讯作者: 赵建平
作者简介:周超然(1994-),男,吉林吉林人,博士研究生,主要研究方向:数据挖掘、城市计算;赵建平(1964-),男,吉林榆树人,教授,博士,主要研究方向:数据挖掘、计算机网络;马太(1996-),男,河北秦皇岛人,硕士研究生,主要研究方向:城市计算、深度学习;周欣(1997-),女,吉林磐石人,硕士研究生,主要研究方向:自然语言处理、行人建模与仿真。
基金资助:
吉林省科技发展计划项目（20190303133SF）；吉林省教育厅“十三五”科学技术项目（JJKH20200796KJ）。

Web page blacklist discrimination method based on attention mechanism and ensemble learning

ZHOU Chaoran, ZHAO Jianping, MA Tai, ZHOU Xin

School of Computer Science and Technology, Changchun University of Science and Technology, Changchun Jilin 130022, China

Received:2020-07-31 Revised:2020-10-15 Online:2020-11-25 Published:2021-01-10
Supported by:
This work is partially supported by the Science and Technology Development Program of Jilin Province (20190303133SF), the "13th Five-Year Plan" Science and Technology Project of Jilin Provincial Education Department (JJKH20200796KJ).

摘要/Abstract

摘要： 搜索引擎作为互联网主要应用之一，能够根据用户需求从互联网资源中检索并返回有效信息。然而，得到的返回列表往往包含广告和失效网页等噪声信息，而这些信息会干扰用户的检索与查询。针对复杂的网页结构特征和丰富的语义信息，提出了一种基于注意力机制和集成学习的网页黑名单判别方法，并采用本方法构建了一种基于集成学习和注意力机制的卷积神经网络（EACNN）模型来过滤无用的网页。首先，根据网页上不同种类的HTML标签数据，构建多个基于注意力机制的卷积神经网络（CNN）基学习器；然后，采用基于网页结构特征的集成学习方法对不同基学习器的输出结果执行不同的权重计算，从而实现EACNN的构建；最后，将EACNN的输出结果作为网页内容分析结果，从而实现网页黑名单的判别。所提方法通过注意力机制来关注网页语义信息，并通过集成学习的方式引入网页结构特征。实验结果表明，与支持向量机（SVM）、K近邻（KNN）、CNN、长短期记忆（LSTM）网络、GRU、结合注意力机制的卷积神经网络（ACNN）等基线模型相比，所提模型在所构建的面向地理信息领域的判别数据集上具有最高的准确率（0.97）、召回率（0.95）和F₁分值（0.96），验证了EACNN在网页黑名单判别工作中的优势。

关键词: 网页黑名单, 判别模型, 网页结构特征, 语义信息, 注意力机制, 集成学习, 深度学习

Abstract: As one of the main Internet applications, search engine can retrieve and return effective information from Internet resources according to user needs. However, the obtained returned list often contains noisy information such as advertisements and invalid Web pages, which interfere the user's search and query. Aiming at the complex structural features and rich semantic information of Web pages, a Web page blacklist discrimination method based on attention mechanism and ensemble learning was proposed. And, by using this method, an Ensemble learning and Attention mechanism-based Convolutional Neural Network (EACNN) model was built to filter useless Web pages. First, according to different categories of HTML tag data on Web pages, multiple Convolutional Neural Network (CNN) base learners based on attention mechanism were established. Second, an ensemble learning method based on Web page structural features was used to perform different weight computation to the output results of different base learners to realize the construction of EACNN. Finally, the output result of EACNN was used as the analysis result of Web page content to realize the discrimination of Web page blacklist. The proposed method focuses on the semantic information of Web pages through attention mechanism, and introduces the structural features of Web pages through ensemble learning. Experimental results show that, compared with baseline models such as Support Vector Machine (SVM), K-Nearest Neighbor (KNN), CNN, Long Short-Term Memory (LSTM) network, Gate Recurrent Unit (GRU) and Attention-based CNN (ACNN), EACNN has the highest accuracy (0.97), recall (0.95) and F₁ score (0.96) on the geographic information field-oriented discrimination dataset constructed. It verifies the advantages of EACNN in the task of discriminating Web page blacklist.

Key words: Web page blacklist, discrimination model, Web structural feature, semantic information, attention mechanism, ensemble learning, deep learning

中图分类号:

TP391.1

周超然, 赵建平, 马太, 周欣. 基于注意力机制和集成学习的网页黑名单判别方法[J]. 计算机应用, 2021, 41(1): 133-138.

ZHOU Chaoran, ZHAO Jianping, MA Tai, ZHOU Xin. Web page blacklist discrimination method based on attention mechanism and ensemble learning[J]. Journal of Computer Applications, 2021, 41(1): 133-138.

参考文献

[1] JAVED M A,YOUNIS M S,LATIF S,et al. Community detection in networks:a multidisciplinary review[J]. Journal of Network and Computer Applications,2018,108:87-111.
[2] 沈昌祥, 张焕国, 冯登国, 等. 信息安全综述[J]. 中国科学(E辑:信息科学), 2007, 37(2):129-150.(SHEN C X,ZHANG H G, FENG D G,et al. Information security overview[J]. Science in China(Series E:Information Sciences),2007,37(2):129-150.)
[3] 胡燕, 吴虎子, 钟珞. 基于改进的kNN算法的中文网页自动分类方法研究[J]. 武汉大学学报(工学版), 2007, 40(4):141-144. (HU Y,WU H Z,ZHONG L. Research on Chinese Web page automatic classification method based on improved kNN algorithm[J]. Engineering Journal of Wuhan University,2007,40(4):141-144.)
[4] 贺海军, 王建芬, 周青, 等. 基于决策支持向量机的中文网页分类器[J]. 计算机工程, 2003, 29(2):47-48.(HE H J,WANG J F, ZHOU Q,et al. A Chinese Web page classifier based on SVMdecision tree[J]. Computer Engineering,2003,29(2):47-48.)
[5] 邓玺. 基于深度学习的网页分类技术研究[D]. 北京:中国地质大学(北京),2019:27-41. (DENG X. Research on Web classification technology based on deep learning[D]. Beijing:China University of Geosciences(Beijing),2019:27-41.)
[6] BUBER E,DIRI B. Web page classification using RNN[J]. Procedia Computer Science,2019,154:62-72.
[7] RAY A,RAJESWAR S,CHAUDHURY S. Text recognition using deep BLSTM networks[C]//Proceedings of the 8th International Conference on Advances in Pattern Recognition. Piscataway:IEEE,2015:1-6.
[8] LIU P,QIU X,HUANG X. Recurrent neural network for text classification with multi-task learning[C]//Proceedings of the 25th International Joint Conference on Artificial Intelligence. Palo Alto, CA:AAAI Press,2016:2873-2879.
[9] 赵富, 杨洋, 蒋瑞, 等. 融合词性的双注意力Bi-LSTM情感分析[J]. 计算机应用, 2018, 38(S2):103-106.(ZHAO F,YANG Y, JIANG R,et al. Sentiment analysis based on double-attention BiLSTM using part-of-speech[J]. Journal of Computer Applications, 2018,38(S2):103-106.)
[10] 苏贵洋, 李建华, 马颖华, 等. 用于中文色情文本过滤的近邻法构造算法[J]. 上海交通大学学报, 2004, 38(S1):76-79.(SU G Y,LI J H,MA Y H,et al. A KNN algorithm on Chinese erotic text filtering[J]. Journal of Shanghai Jiaotong University,2004, 38(S1):76-79.)
[11] SHEU J J. Distinguishing medical Web pages from pornographic ones:an efficient pornography websites filtering method[J]. International Journal of Network Security,2017,19(5):839-850.
[12] 徐雅斌, 李卓, 陈俊伊. 基于改进K最近邻分类算法的不良网页并行识别[J]. 计算机应用, 2013, 33(12):3368-3371, 3379. (XU Y B,LI Z,CHEN J Y. Parallel recognition of illegal Web pages based on improved KNN classification algorithm[J]. Journal of Computer Applications,2013,33(12):3368-3371,3379.)
[13] 顾敏, 郭庆, 曹野, 等. 基于结构和文本特征的网页分类技术研究[J]. 中国科学技术大学学报, 2017, 47(4):290-296.(GU M, GUO Q, CAO Y, et al. Research on Web page automatic categorization based on structural and text information[J]. Journal of University of Science and Technology of China,2017,47(4):290-296.)
[14] KAN M Y,THI H O N. Fast webpage classification using URL features[C]//Proceedings of the 14th ACM International Conference on Information and Knowledge Management. New York:ACM,2005:325-326.
[15] MNIH V,HEESS N,GRAVES A,et al. Recurrent models of visual attention[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems. Cambridge:MIT Press,2014:2204-2212.
[16] BAHDANAU D,CHO K,BENGIO Y. Neural machine translation by jointly learning to align and translate[EB/OL].[2020-03-22]. https://arxiv.org/pdf/1409.0473.pdf.
[17] CHOROWSKI J,BAHDANAU D,SERDYUK D,et al. Attentionbased models for speech recognition[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems. Cambridge:MIT Press,2015:577-585.
[18] VASWANI A,SHAZEER N,PARMAR N,et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook,NY:Curran Associates Inc.,2017:6000-6010.
[19] YIN W,SCHÜTZE H,XIANG B,et al. ABCNN:attention-based convolutional neural network for modeling sentence pairs[J]. Transactions of the Association for Computational Linguistics, 2016,4:259-272.
[20] ZHOU Z,WU J,TANG W. Ensembling neural networks:many could be better than all[J]. Artificial Intelligence,2002,137(1/2):239-263.
[21] BREIMAN L. Bagging predictors[J]. Machine Learning,1996, 24:123-140.
[22] 蒋芸, 陈娜, 明利特, 等. 基于Bagging的概率神经网络集成分类算法[J]. 计算机科学, 2013, 40(5):242-246.(JIANG Y,CHEN N,MING L T,et al. Bagging-based probabilistic neural network ensemble classification algorithm[J]. Computer Science,2013, 40(5):242-246.)
[23] MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY:Curran Associates Inc.,2013:3111-3119.
[24] LI S,ZHAO Z,HU R,et al. Analogical reasoning on Chinese morphological and semantic relations[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA:Association for Computational Linguistics, 2018:138-143.
[25] SRIVASTAVA N, HINTON G, KRIZHEVSKY A, et al. Dropout:a simple way to prevent neural networks from overfitting[J]. Journal of Machine Learning Research,2014,15(1):1929-1958.
[26] KENNEDY J,EBERHART R. Particle swarm optimization[C]//Proceedings of 1995 International Conference on Neural Networks. Piscataway:IEEE,1995:1942-1948.
[27] 黄磊, 杜昌顺. 基于递归神经网络的文本分类研究[J]. 北京化工大学学报(自然科学版), 2017, 44(1):98-104.(HUANG L, DU C S. Application of recurrent neural networks in text classification[J]. Journal of Beijing University of Chemical Technology(Natural Science),2017,44(1):98-104.)

基于注意力机制和集成学习的网页黑名单判别方法

Web page blacklist discrimination method based on attention mechanism and ensemble learning

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	黄云川, 江永全, 黄骏涛, 杨燕. 基于元图同构网络的分子毒性预测[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2964-2969.
[2]	秦璟, 秦志光, 李发礼, 彭悦恒. 基于概率稀疏自注意力神经网络的重性抑郁疾患诊断[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2970-2974.
[3]	王熙源, 张战成, 徐少康, 张宝成, 罗晓清, 胡伏原. 面向手术导航3D/2D配准的无监督跨域迁移网络[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2911-2918.
[4]	李力铤, 华蓓, 贺若舟, 徐况. 基于解耦注意力机制的多变量时序预测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2732-2738.
[5]	潘烨新, 杨哲. 基于多级特征双向融合的小目标检测优化模型[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2871-2877.
[6]	赵志强, 马培红, 黑新宏. 基于双重注意力机制的人群计数方法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2886-2892.
[7]	李顺勇, 李师毅, 胥瑞, 赵兴旺. 基于自注意力融合的不完整多视图聚类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2696-2703.
[8]	薛凯鹏, 徐涛, 廖春节. 融合自监督和多层交叉注意力的多模态情感分析网络[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2387-2392.
[9]	汪雨晴, 朱广丽, 段文杰, 李书羽, 周若彤. 基于交互注意力机制的心理咨询文本情感分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2393-2399.
[10]	高鹏淇, 黄鹤鸣, 樊永红. 融合坐标与多头注意力机制的交互语音情感识别[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2400-2406.
[11]	刘禹含, 吉根林, 张红苹. 基于骨架图与混合注意力的视频行人异常检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2551-2557.
[12]	李钟华, 白云起, 王雪津, 黄雷雷, 林初俊, 廖诗宇. 基于图像增强的低照度人脸检测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2588-2594.
[13]	莫尚斌, 王文君, 董凌, 高盛祥, 余正涛. 基于多路信息聚合协同解码的单通道语音增强[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2611-2617.
[14]	顾焰杰, 张英俊, 刘晓倩, 周围, 孙威. 基于时空多图融合的交通流量预测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2618-2625.
[15]	石乾宏, 杨燕, 江永全, 欧阳小草, 范武波, 陈强, 姜涛, 李媛. 面向空气质量预测的多粒度突变拟合网络[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2643-2650.