《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (3): 770-777.DOI: 10.11772/j.issn.1001-9081.2021040791

• 2021年中国计算机学会人工智能会议(CCFAI 2021) • 上一篇    

基于神经网络的复杂垃圾信息过滤算法分析

张建, 严珂(), 马祥   

  1. 中国计量大学 信息工程学院,杭州 310018
  • 收稿日期:2021-05-17 修回日期:2021-06-04 接受日期:2021-06-09 发布日期:2021-11-09 出版日期:2022-03-10
  • 通讯作者: 严珂
  • 作者简介:张建(1997—),男,江西高安人,硕士研究生,主要研究方向:文本分类、多任务学习情感识别
    马祥(1984—),男,河北石家庄人,讲师,博士,CCF会员,主要研究方向:机器视觉、人机交互。
  • 基金资助:
    浙江省自然科学基金资助项目(LY19F020016)

Analysis of complex spam filtering algorithm based on neural network

Jian ZHANG, Ke YAN(), Xiang MA   

  1. College of Information Engineering,China Jiliang University,Hangzhou Zhejiang 310018,China
  • Received:2021-05-17 Revised:2021-06-04 Accepted:2021-06-09 Online:2021-11-09 Published:2022-03-10
  • Contact: Ke YAN
  • About author:ZHANG Jian, born in 1997, M. S. candidate. His research interests include text classification, recognition of multi-task learning.
    MA Xiang, born in 1984, Ph. D., lecturer. His research interests include machine vision, human-computer interaction.
  • Supported by:
    Zhejiang Provincial Natural Science Foundation(LY19F020016)

摘要:

垃圾信息的识别是自然语言处理方面主要的任务之一。传统方法是基于文本特征或词频的方法,其识别准确率主要依赖于特定关键词的出现与否,存在对关键词识别错误或对未出现关键词的垃圾信息文本识别能力较差的问题,提出基于神经网络的方法。首先,利用传统方法针对这一类垃圾信息文本进行识别训练和测试;然后,利用从垃圾短信、广告和垃圾邮件数据集中挑选出传统方法识别困难的垃圾信息,再从原数据集中随机挑选出同样数量的正常信息,将其组成三个无重复数据的新数据集;最后,以卷积神经网络和循环神经网络为基础,建立了三个模型,并在新数据集上进行识别训练。实验结果表明,基于神经网络的方法可以从文本中学习到更好的语义特征,在三个数据集上均能达到98%以上的准确率,高于朴素贝叶斯(NB)、随机森林(RF)、支持向量机(SVM)等传统方法。实验结果还显示,不同的神经网络适用于不同长度的文本分类,由循环神经网络组成的模型擅长识别句子长度的文本,由卷积神经网络组成的模型擅长识别段落长度的文本,由两者共同组成的模型擅长识别篇章长度的文本

关键词: 垃圾信息, 识别与过滤, 文本特征, 词频, 神经网络

Abstract:

The recognition of spam is one of the main tasks in natural language processing. The traditional methods are based on text features or word frequency, which recognition accuracies mainly depend on the presence or absence of specific keywords. When there are no keywords or errors in recognizing keywords in the spam, the traditional methods have poor recognition performance. Neural network-based methods were proposed. Recognition training and testing were conducted on complex spam. The spams that cannot be recognized by traditional methods were collected and the same amount of normal information was randomly selected from spam messages, advertisement and spam email datasets to form three new datasets without duplicate data. Three models were proposed based on convolutional neural network and recurrent neural network and tested on three new datasets for spam recognition. The experimental results show that the neural network-based models learned better semantic features from the text and achieved the accuracies of more than 98% on all three datasets, which are significantly higher than those of the traditional methods, such as Naive Bayes (NB), Random Forest (RF) and Support Vector Machine (SVM). The experimental results also show that different neural networks are suitable for text classification with different lengths. The models composed of recurrent neural networks are good at recognizing text with sentence length, the models composed of convolutional neural networks are good at recognizing text with paragraph length, and the models composed of both neural networks are good at recognizing text with chapter length.

Key words: spam, recognition and filtering, text feature, word frequency, neural network

中图分类号: