计算机应用 ›› 2021, Vol. 41 ›› Issue (1): 133-138.DOI: 10.11772/j.issn.1001-9081.2020081379

所属专题: 第八届中国数据挖掘会议(CCDM 2020)

• 第八届中国数据挖掘会议(CCDM 2020) • 上一篇    下一篇

基于注意力机制和集成学习的网页黑名单判别方法

周超然, 赵建平, 马太, 周欣   

  1. 长春理工大学 计算机科学技术学院, 长春 130022
  • 收稿日期:2020-07-31 修回日期:2020-10-15 出版日期:2021-01-10 发布日期:2020-11-25
  • 通讯作者: 赵建平
  • 作者简介:周超然(1994-),男,吉林吉林人,博士研究生,主要研究方向:数据挖掘、城市计算;赵建平(1964-),男,吉林榆树人,教授,博士,主要研究方向:数据挖掘、计算机网络;马太(1996-),男,河北秦皇岛人,硕士研究生,主要研究方向:城市计算、深度学习;周欣(1997-),女,吉林磐石人,硕士研究生,主要研究方向:自然语言处理、行人建模与仿真。
  • 基金资助:
    吉林省科技发展计划项目(20190303133SF);吉林省教育厅“十三五”科学技术项目(JJKH20200796KJ)。

Web page blacklist discrimination method based on attention mechanism and ensemble learning

ZHOU Chaoran, ZHAO Jianping, MA Tai, ZHOU Xin   

  1. School of Computer Science and Technology, Changchun University of Science and Technology, Changchun Jilin 130022, China
  • Received:2020-07-31 Revised:2020-10-15 Online:2021-01-10 Published:2020-11-25
  • Supported by:
    This work is partially supported by the Science and Technology Development Program of Jilin Province (20190303133SF), the "13th Five-Year Plan" Science and Technology Project of Jilin Provincial Education Department (JJKH20200796KJ).

摘要: 搜索引擎作为互联网主要应用之一,能够根据用户需求从互联网资源中检索并返回有效信息。然而,得到的返回列表往往包含广告和失效网页等噪声信息,而这些信息会干扰用户的检索与查询。针对复杂的网页结构特征和丰富的语义信息,提出了一种基于注意力机制和集成学习的网页黑名单判别方法,并采用本方法构建了一种基于集成学习和注意力机制的卷积神经网络(EACNN)模型来过滤无用的网页。首先,根据网页上不同种类的HTML标签数据,构建多个基于注意力机制的卷积神经网络(CNN)基学习器;然后,采用基于网页结构特征的集成学习方法对不同基学习器的输出结果执行不同的权重计算,从而实现EACNN的构建;最后,将EACNN的输出结果作为网页内容分析结果,从而实现网页黑名单的判别。所提方法通过注意力机制来关注网页语义信息,并通过集成学习的方式引入网页结构特征。实验结果表明,与支持向量机(SVM)、K近邻(KNN)、CNN、长短期记忆(LSTM)网络、GRU、结合注意力机制的卷积神经网络(ACNN)等基线模型相比,所提模型在所构建的面向地理信息领域的判别数据集上具有最高的准确率(0.97)、召回率(0.95)和F1分值(0.96),验证了EACNN在网页黑名单判别工作中的优势。

关键词: 网页黑名单, 判别模型, 网页结构特征, 语义信息, 注意力机制, 集成学习, 深度学习

Abstract: As one of the main Internet applications, search engine can retrieve and return effective information from Internet resources according to user needs. However, the obtained returned list often contains noisy information such as advertisements and invalid Web pages, which interfere the user's search and query. Aiming at the complex structural features and rich semantic information of Web pages, a Web page blacklist discrimination method based on attention mechanism and ensemble learning was proposed. And, by using this method, an Ensemble learning and Attention mechanism-based Convolutional Neural Network (EACNN) model was built to filter useless Web pages. First, according to different categories of HTML tag data on Web pages, multiple Convolutional Neural Network (CNN) base learners based on attention mechanism were established. Second, an ensemble learning method based on Web page structural features was used to perform different weight computation to the output results of different base learners to realize the construction of EACNN. Finally, the output result of EACNN was used as the analysis result of Web page content to realize the discrimination of Web page blacklist. The proposed method focuses on the semantic information of Web pages through attention mechanism, and introduces the structural features of Web pages through ensemble learning. Experimental results show that, compared with baseline models such as Support Vector Machine (SVM), K-Nearest Neighbor (KNN), CNN, Long Short-Term Memory (LSTM) network, Gate Recurrent Unit (GRU) and Attention-based CNN (ACNN), EACNN has the highest accuracy (0.97), recall (0.95) and F1 score (0.96) on the geographic information field-oriented discrimination dataset constructed. It verifies the advantages of EACNN in the task of discriminating Web page blacklist.

Key words: Web page blacklist, discrimination model, Web structural feature, semantic information, attention mechanism, ensemble learning, deep learning

中图分类号: