计算机应用 ›› 2018, Vol. 38 ›› Issue (11): 3063-3068.DOI: 10.11772/j.issn.1001-9081.2018041356

• 第七届中国数据挖掘会议(CCDM 2018) • 上一篇    下一篇

基于分层注意力机制的神经网络垃圾评论检测模型

刘雨心1, 王莉2, 张昊1   

  1. 1. 太原理工大学 信息与计算机学院, 山西 晋中 030600;
    2. 太原理工大学 大数据学院, 山西 晋中 030600
  • 收稿日期:2018-04-30 修回日期:2018-06-26 出版日期:2018-11-10 发布日期:2018-11-10
  • 通讯作者: 王莉
  • 作者简介:刘雨心(1984-),女,山西太原人,博士研究生,主要研究方向:数据挖掘、机器学习、深度学习;王莉(1971-),女,山西太原人,教授,博士,主要研究方向:大数据计算与分析、知识图谱、数据挖掘、人工智能;张昊(1988-),男,山西太原人,讲师,博士,主要研究方向:复杂网络。
  • 基金资助:
    国家863计划项目(2014AA015204);国家自然科学基金资助项目(61702356);山西省自然科学基金资助项目(201703D421013);中国科学院计算技术研究所网络数据科学重点实验室课题(CASNDST20140X)。

Hierarchical attention-based neural network model for spam review detection

LIU Yuxin1, WANG Li2, ZHANG Hao1   

  1. 1. College of Information and Computer, Taiyuan University of Technology, Jinzhong Shanxi 030600, China;
    2. College of Data Science, Taiyuan University of Technology, Jinzhong Shanxi 030600, China
  • Received:2018-04-30 Revised:2018-06-26 Online:2018-11-10 Published:2018-11-10
  • Supported by:
    This work is partially supported by the National High Technology Research and Development Program (2014AA015204), the National Natural Science Foundation of China (61702356), the Natural Science Foundation of Shanxi Province (201703D421013), the Key Laboratory Project of Network Data Science and Technology in the Institute of Computing Technology, Chinese Academy of Sciences (CASNDST20140X).

摘要: 针对现有垃圾评论识别方法很难揭示用户评论的潜在语义信息这一问题,提出一种基于层次注意力的神经网络检测(HANN)模型。该模型主要由以下两部分组成:Word2Sent层,在词向量表示的基础上,采用卷积神经网络(CNN)生成连续的句子表示;Sent2Doc层,基于上一层产生的句子表示,使用注意力池化的神经网络生成文档表示。生成的文档表示直接作为垃圾评论的最终特征,采用softmax分类器分类。此模型通过完整地保留评论的位置和强度特征,并从中提取重要的和综合的信息(文档任何位置的历史、未来和局部上下文),挖掘用户评论的潜在语义信息,从而提高垃圾评论检测准确率。实验结果表明,与仅基于神经网络的方法相比,该模型准确率平均提高5%,分类效果显著改善。

关键词: 垃圾评论, 表示学习, 注意力机制, 卷积神经网络, 双向长短时记忆

Abstract: Existing measures to detect spam reviews mainly focus on designing features from the perspective of linguistic and psychological clues, which hardly reveal the latent semantic information of the reviews. A Hierarchical Attention-based Neural Network (HANN) model was proposed to mine latent semantic information. The model mainly consisted of the following two layers:the Word2Sent layer, which used a Convolutional Neural Network (CNN) to produce continuous sentence representations on the basis of word embedding, and the Sent2Doc layer, which utilized an attention pooling-based neural network to generate document representations on the basis of sentence representations. The generated document representations were directly employed as features to identify spam reviews. The proposed hierarchical attention mechanism enables our model to preserve position and intensity information completely. Thus, the comprehensive information, history, future, and local context of any position in a document can be extracted. The experimental results show that our method can achieve higher accuracy, compared with neural network-based methods only, the accuracy is increased by 5% on average, and the classification effect is improved significantly.

Key words: spam review, representation learning, attention mechanism, Convolutional Neural Network (CNN), Bidirectional Long-short Term Memory (BLSTM)

中图分类号: