堆叠去噪自编码器在垃圾邮件过滤中的应用

doi:10.11772/j.issn.1001-9081.2015.11.3256

计算机应用 ›› 2015, Vol. 35 ›› Issue (11): 3256-3260.DOI: 10.11772/j.issn.1001-9081.2015.11.3256

堆叠去噪自编码器在垃圾邮件过滤中的应用

李艳涛, 冯伟森

四川大学计算机学院, 成都 610065

收稿日期:2015-05-29 修回日期:2015-07-05 发布日期:2015-11-13
通讯作者: 冯伟森(1962-),男,四川眉山人,副教授,硕士研究生,CCF会员,主要研究方向:数据挖掘、人工智能.
作者简介:李艳涛(1987-),男,安徽亳州人,硕士研究生,CCF会员,主要研究方向:机器学习、数据挖掘.

Application of stacked denoising autoencoder in spamming filtering

LI Yantao, FENG Weisen

College of Computer Science, Sichuan University, Chengdu Sichuan 610065, China

Received:2015-05-29 Revised:2015-07-05 Online:2015-11-13

摘要/Abstract

摘要： 针对垃圾邮件数量日益攀升的问题,提出了将堆叠去噪自编码器应用到垃圾邮件分类中.首先,在无标签数据集上,使用无监督学习方法最小化重构误差,对堆叠去噪自编码器进行贪心逐层预训练,从而获得原始数据更加抽象和健壮的特征表示; 然后,在堆叠去噪自编码器的最上层添加一个分类器后,在有标签数据集上,利用有监督学习方法最小化分类误差,对预训练获得的网络参数进行微调,获得最优化的模型; 最后, 利用训练完成的堆叠去噪编码器在6个不同的公开数据集上进行测试.将准确率、召回率、更具有平衡性的马修斯相关系数作为实验性能评价标准,实验结果表明,相比支持向量机算法、贝叶斯方法和深度置信网络的分类效果,基于堆叠去噪自编码器的垃圾邮件分类器的准确率都高于95%,马修斯相关系数都大于0.88,在应用中具有更高的准确率和更好的健壮性.

关键词: 堆叠去噪自编码器, 垃圾邮件, 分类, 支持向量机, 贝叶斯方法

Abstract: Aiming at the continually increasing number of spams, an approach for spam filtering based on the use of Stacked Denoising Autoencoder (SDA) was proposed. Firstly, to get more abstract and robust feature representation of raw data, greedy layer-wise unsupervised algorithm was used to train the SDA by minimizing the construction error on unlabeled data set. Then a classifier was added on the top level of SDA. Next, the parameters of SDA were optimized with supervised algorithm by minimizing the classification error to obtain a optimal model on labeled data set. Lastly, experiments were performed on six different public corpora using the trained SDA. The performance of SDA algorithm was compared with Support Vector Machine (SVM), Bayes approach and Deep Belief Network (DBN), by using precision, recall, Matthews Correlation Coefficient (MCC) with more balanced performance measure as the experimental measures. The experimental results indicate that using SDA to filter spams has higher precision and more robustness. Since it not only acquires best average performance with all precision greater than 95%, but also gets close to prefect prediction with all MCC greater than 0.88.

Key words: Stacked Denoising Autoencoder (SDA), spam, classification, Support Vector Machine (SVM), Bayesian approach

中图分类号:

TP393.098

李艳涛, 冯伟森. 堆叠去噪自编码器在垃圾邮件过滤中的应用[J]. 计算机应用, 2015, 35(11): 3256-3260.

LI Yantao, FENG Weisen. Application of stacked denoising autoencoder in spamming filtering[J]. Journal of Computer Applications, 2015, 35(11): 3256-3260.

参考文献

[1] GARTNER. Gartner survey shows phishing attacks escalated in 2007; more than ＄3 billion lost to these attacks[EB/OL].[2015-02-20].http://www.gartner.com/it/page.jsp?id=565125.
[2] CORMACK G V. Email spam filtering: a systematic review[J]. Foundations and Trends in Information Retrieval, 2007, 1(4): 335-455.
[3] ALMEIDA T A, YAMAKAMI A. Advances in spam filtering techniques[M]. Berlin: Springer, 2012: 199-214.
[4] SONG Y, KOŁCZ A, GILES C L. Better Naive Bayes classification for high-precision spam detection[J]. Software: Practice and Experience, 2009, 39(11): 1003-1024.
[5] CHOUHAN S. Behavior analysis of SVM based spam filtering using various kernel functions and data representations[C]// Proceedings of the 2013 International Journal of Engineering Research and Technology. Gandhinagar: ESRSA Publications, 2013: 3029-3036.
[6] HSU W C, YU T Y. Support vector machines parameter selection based on combined Taguchi method and Staelin method for E-mail spam filtering[J]. International Journal of Engineering and Technology Innovation, 2012, 2(2): 113-125.
[7] CARUANA G, LI M. A survey of emerging approaches to spam filtering[J]. ACM Computing Surveys, 2012, 44(2): Article 9.
[8] ALMEIDA T A, YAMAKAMI A, ALMEIDA J. Evaluation of approaches for dimensionality reduction applied with naive Bayes anti-spam filters[C]// Proceedings of the 2009 IEEE International Conference on Machine Learning and Applications. Piscataway: IEEE, 2009: 517-522.
[9] BENGIO Y. Learning deep architectures for AI[J]. Foundations and trends in Machine Learning, 2009, 2(1): 1-127.
[10] VINCENT P, LAROCHELLE H, BENGIO Y, et al.Extracting and composing robust features with denoising autoencoders[C]// Proceedings of the 25th International Conference on Machine Learning. New York: ACM, 2008: 1096-1103.
[11] VINCENT P, LAROCHELLE H, LAJOIE I, et al.Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion[J]. Journal of Machine Learning Research, 2010, 11(6): 3371-3408.
[12] HINTON G, OSINDERO S, TEH Y W. A fast learning algorithm for deep belief nets[J]. Neural Computation, 2006, 18(7): 1527-1554.
[13] BENGIO Y, LAMBLIN P, POPOVICI D, et al.Greedy layer-wise training of deep networks[C]// Proceedings of the 2006 Conference Advances in Neural Information Processing Systems 19. Cambridge: MIT Press, 2007: 153-160.
[14] KLIMT B, YANG Y. The Enron corpus: a new dataset for email classification research[C]// Proceedings of the 15th European Conference on Machine Learning. Berlin: Springer, 2004: 217-226.
[15] BALDI P, BRUNAK S, CHAUVIN Y, et al.Assessing the accuracy of prediction algorithms for classification: an overview[J]. Bioinformatics, 2000, 16(5): 412-424.

堆叠去噪自编码器在垃圾邮件过滤中的应用

Application of stacked denoising autoencoder in spamming filtering

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	黄于欣, 徐佳龙, 余正涛, 侯书楷, 周家啟. 基于生成提示的无监督文本情感转换方法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2667-2673.
[2]	孙淳, 胡春龙, 黄树成. 一致性保留的集成排序年龄估计方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2381-2386.
[3]	冷强奎, 孙薛梓, 孟祥福. 基于样本势和噪声进化的不平衡数据过采样方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2466-2475.
[4]	张全梅, 黄润萍, 滕飞, 张海波, 周南. 融合异构信息的自动国际疾病分类编码方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2476-2482.
[5]	葛焌迟, 赵为华. 矩阵数据基于鲁棒主成分分析的距离加权判别分析[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2073-2079.
[6]	陆潜慧, 张羽, 王梦灵, 吴庭伟, 单玉忠. 基于改进循环池化网络的核电装备质量文本分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2034-2040.
[7]	王东炜, 刘柏辰, 韩志, 王艳美, 唐延东. 基于低秩分解和向量量化的深度网络压缩方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 1987-1994.
[8]	姚迅, 秦忠正, 杨捷. 生成式标签对抗的文本分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1781-1785.
[9]	袁子璇, 翁小清, 戈宁振. 基于正交局部保持映射和成本优化的多变量时间序列早期分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1832-1841.
[10]	李旭, 何玉林, 崔来中, 黄哲学, PHILIPPE Fournier‑Viger. 基于大数据随机样本划分的分布式观测点分类器[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1727-1733.
[11]	黎施彬, 龚俊, 汤圣君. 基于Graph Transformer的半监督异配图表示学习模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1816-1823.
[12]	余新言, 曾诚, 王乾, 何鹏, 丁晓玉. 基于知识增强和提示学习的小样本新闻主题分类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1767-1774.
[13]	翟飞宇, 马汉达. 基于DenseNet的经典-量子混合分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1905-1910.
[14]	高文烁, 陈晓云. 基于节点结构的点云分类网络[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1471-1478.
[15]	郑文萍, 葛慧琳, 刘美麟, 杨贵. 融合二连通模体结构信息的节点分类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1464-1470.