计算机应用 ›› 2016, Vol. 36 ›› Issue (1): 158-162.DOI: 10.11772/j.issn.1001-9081.2016.01.0158

• 人工智能 • 上一篇    下一篇

基于改进堆叠自动编码机的垃圾邮件分类

沈承恩, 何军, 邓扬   

  1. 四川大学 计算机学院, 成都 610065
  • 收稿日期:2015-07-29 修回日期:2015-09-06 出版日期:2016-01-10 发布日期:2016-01-09
  • 通讯作者: 何军(1970-),男,江西萍乡人,副教授,博士,主要研究方向:计算机网络、智能机器
  • 作者简介:沈承恩(1990-),男,安徽六安人,硕士研究生,CCF会员,主要研究方向:机器学习、神经网络;邓扬(1983-),男,四川南充人,硕士研究生,主要研究方向:机器智能。
  • 基金资助:
    国家科技重大专项(2015ZX01040101-002);国家自然科学基金资助项目(91338107)。

Spam filtering based on modified stack auto-encoder

SHEN Cheng'en, HE Jun, DENG Yang   

  1. College of Computer Science, Sichuan University, Chengdu Sichuan 610065, China
  • Received:2015-07-29 Revised:2015-09-06 Online:2016-01-10 Published:2016-01-09
  • Supported by:
    This work is partially supported by the National Science and Technology Major Project (2015ZX01040101-002) and the National Natural Science Foundation of China (91338107).

摘要: 针对堆叠自动编码机(SA)容易产生过拟合而降低垃圾邮件分类精度的问题,提出了一种基于动态dropout的改进堆叠自动编码机方法。首先分析了垃圾邮件分类问题的特殊性,将dropout算法引入到堆叠自动编码机算法中;同时,根据传统dropout算法容易使部分节点长期处于熄火状态的缺陷,提出了一种动态dropout改进算法,使用动态函数将传统静态熄火率修改为随着迭代次数逐渐减小的动态熄火率;最后,利用动态dropout算法改进堆叠自动编码机的预训练模型。仿真结果表明,相比支持向量机(SVM)和反向传播(BP)神经网络,改进的堆叠自动编码机平均准确率达到了97.66%,各个数据集上马修斯系数都大于89%;与传统堆叠自动编码机相比,改进的堆叠自动编码机的马修斯系数在Error1~6数据集上分别提高了3.27%、1.68%、2.16%、1.51%、1.58%、1.07%。实验结果表明,基于动态dropout算法的改进堆叠自动编码机具有更高的分类精度和更好的稳定性。

关键词: 深度学习, 堆叠自动编码机, dropout, 支持向量机, 垃圾邮件, 分类

Abstract: Concerning the problem that Stack Auto-encoder (SA) easily traps to overfitting, which may reduce the accuracy of spam classification, a modified SA method based on dynamic dropout was proposed. Firstly, the specificity of the spam classification was analyzed, and the dropout algorithm was employed in SA to handle overfitting. Then according to the fault of dropout algorithm that making some nodes be in the stall state for a long time, an improved algorithm of dropout was proposed. The static dropout rate was replaced by dynamic dropout rate which decreased with training steps using dynamic function. Finally, the dynamic dropout algorithm was used to improve the pretraining model of SA. The simulation results show that compared with Support Vector Machine (SVM) and Back Propagation (BP) neural network, the average accuracy of the modified SA is 97.66%. And the Matthews correlation coefficient of every dataset is higher than 89%. Matthews correlation coefficient of the modified SA on every dataset is 3.27%, 1.68%, 2.16%, 1.51%, 1.58% and 1.07% higher than that of the conventional SA separately. The experimental results show that the modified SA using dynamic dropout has higher accuracy and better robustness.

Key words: deep learning, Stack Auto-encoder (SA), dropout, Support Vector Machine (SVM), spam, classification

中图分类号: