Analysis of complex spam filtering algorithm based on neural network

doi:10.11772/j.issn.1001-9081.2021040791

Abstract

Abstract:

The recognition of spam is one of the main tasks in natural language processing. The traditional methods are based on text features or word frequency， which recognition accuracies mainly depend on the presence or absence of specific keywords. When there are no keywords or errors in recognizing keywords in the spam， the traditional methods have poor recognition performance. Neural network-based methods were proposed. Recognition training and testing were conducted on complex spam. The spams that cannot be recognized by traditional methods were collected and the same amount of normal information was randomly selected from spam messages， advertisement and spam email datasets to form three new datasets without duplicate data. Three models were proposed based on convolutional neural network and recurrent neural network and tested on three new datasets for spam recognition. The experimental results show that the neural network-based models learned better semantic features from the text and achieved the accuracies of more than 98% on all three datasets， which are significantly higher than those of the traditional methods， such as Naive Bayes （NB）， Random Forest （RF） and Support Vector Machine （SVM）. The experimental results also show that different neural networks are suitable for text classification with different lengths. The models composed of recurrent neural networks are good at recognizing text with sentence length， the models composed of convolutional neural networks are good at recognizing text with paragraph length， and the models composed of both neural networks are good at recognizing text with chapter length.

Key words: spam, recognition and filtering, text feature, word frequency, neural network

摘要：

垃圾信息的识别是自然语言处理方面主要的任务之一。传统方法是基于文本特征或词频的方法，其识别准确率主要依赖于特定关键词的出现与否，存在对关键词识别错误或对未出现关键词的垃圾信息文本识别能力较差的问题，提出基于神经网络的方法。首先，利用传统方法针对这一类垃圾信息文本进行识别训练和测试；然后，利用从垃圾短信、广告和垃圾邮件数据集中挑选出传统方法识别困难的垃圾信息，再从原数据集中随机挑选出同样数量的正常信息，将其组成三个无重复数据的新数据集；最后，以卷积神经网络和循环神经网络为基础，建立了三个模型，并在新数据集上进行识别训练。实验结果表明，基于神经网络的方法可以从文本中学习到更好的语义特征，在三个数据集上均能达到98%以上的准确率，高于朴素贝叶斯（NB）、随机森林（RF）、支持向量机（SVM）等传统方法。实验结果还显示，不同的神经网络适用于不同长度的文本分类，由循环神经网络组成的模型擅长识别句子长度的文本，由卷积神经网络组成的模型擅长识别段落长度的文本，由两者共同组成的模型擅长识别篇章长度的文本。

关键词: 垃圾信息, 识别与过滤, 文本特征, 词频, 神经网络

CLC Number:

TP183

Jian ZHANG, Ke YAN, Xiang MA. Analysis of complex spam filtering algorithm based on neural network[J]. Journal of Computer Applications, 2022, 42(3): 770-777.

张建, 严珂, 马祥. 基于神经网络的复杂垃圾信息过滤算法分析[J]. 《计算机应用》唯一官方网站, 2022, 42(3): 770-777.

Figures/Tables 8

References 40

1	EL-ALFY E-S M， ALHASAN A A. Spam filtering framework for multimodal mobile communication based on dendritic cell algorithm ［J］. Future Generation Computer Systems， 2016， 64： 98-107. 10.1016/j.future.2016.02.018
2	FERNANDES D， COSTA K A P D， ALMEIDA T A， et al. SMS spam filtering through optimum-path forest-based classifiers［C］// Proceedings of the 2015 International Conference on Machine Learning and Applications. Piscataway： IEEE， 2015： 133-137. 10.1109/icmla.2015.71
3	RAZAK M F AB， ANUAR N B， SALLEH R， et al. The rise of “malware”： Bibliometric analysis of malware study［J］. Journal of Network and Computer Applications， 2016， 75： 58-76. 10.1016/j.jnca.2016.08.022
4	ALMEIDA T， HIDALGO J M G， SILVA T P. Towards SMS spam filtering： results under a new dataset［J］. International Journal of Information Security Science， 2013， 2（1）： 1-18.
5	JUNAID M B， FAROOQ M. Using evolutionary learning classifiers to do MobileSpam （SMS） filtering ［C］// Proceedings of the 13th Annual Conference on Genetic and Evolutionary Computation. New York： ACM， 2011： 1795-1802. 10.1145/2001576.2001817
6	SILVA R M， ALMEIDA T A， YAMAKAMI A. MDLText： an efficient and lightweight text classifier ［J］. Knowledge-Based Systems， 2017， 118： 152-164. 10.1016/j.knosys.2016.11.018
7	ADEWOLE K S， ANUAR N B， KAMSIN A， et al. SMSAD： a framework for spam message and spam account detection ［J］. Kluwer Academic Publishers， 2019， 78（4）： 3925-3960. 10.1007/s11042-017-5018-x
8	BOUJNOUNI M E. SMS spam filtering using N-gram method， information gain metric and an improved version of SVDD classifier ［J］. Journal of Engineering Science and Technology Review， 2017， 10（1）： 131-137. 10.25103/jestr.101.18
9	RUANO-ORDÁS D， FDEZ-GLEZ J， FDEZ-RIVEROLA F， et al. RuleSIM： a toolkit for simulating the operation and improving throughput of rule-based spam filters ［J］. Software Practice & Experience， 2016，46（8）： 1091-1108. 10.1002/spe.2342
10	郝苗苗，徐秀娟，于红，等. 基于中文微博的情绪分类与预测算法［J］. 计算机应用， 2018， 38（S2）： 89-96.
	HAO M M， XU X J， YU H， et al. Emotion classification and prediction algorithm based on Chinese microblog ［J］. Journal of Computer Applications， 2018， 38（S2）： 89-96.
11	焦庆争，蔚承建. 分布权值调节概率标准差的文本分类方法［J］. 计算机应用， 2009， 29（12）： 3303-3306. 10.3724/sp.j.1087.2009.03303
	JIAO Q Z， WEI C J. Text categorization approach based on probability standard deviation with evaluation of distribution information ［J］. Journal of Computer Applications， 2009， 29（12）： 3303-3306. 10.3724/sp.j.1087.2009.03303
12	RUANO-ORDAS D， FDEZ-GLEZ J， FDEZ-RIVEROLA F， et al. Effective scheduling strategies for boosting performance on rule-based spam filtering frameworks ［J］. Journal of Systems & Software， 2013， 86（12）： 3151-3161. 10.1016/j.jss.2013.07.036
13	RUANO-ORDÁS D， FDEZ-GLEZ J， FDEZ-RIVEROLA F， et al. Using new scheduling heuristics based on resource consumption information for increasing throughput on rule-based spam filtering systems ［J］. Software Practice & Experience， 2016， 46（8）： 1035-1051. 10.1002/spe.2343
14	WU C-H. Behavior-based spam detection using a hybrid method of rule-based techniques and neural networks ［J］. Expert Systems with Applications， 2009， 36（3）： 4321-4330. 10.1016/j.eswa.2008.03.002
15	LUO Q， LIU B， YAN J， et al. Design and implement a rule-based spam filtering system using neural network ［C］// Proceedings of the 2011 International Conference on Computational and Information Sciences. Piscataway： IEEE， 2011： 398-401. 10.1109/iccis.2011.125
16	CUTLER A， CUTLER D R， STEVENS J R. Random forests ［J］. Machine Learning， 2011， 45： 157-176. 10.1007/978-1-4419-9326-7_5
17	CORTES C， VAPNIK V. Support-vector networks ［J］. Machine Learning， 1995， 20（3）： 273-297. 10.1007/bf00994018
18	高秀梅，陈芳，宋枫溪，等. 特征权对贝叶斯分类器文本分类性能的影响［J］. 计算机应用， 2008， 28（12）： 3080-3083. 10.3724/sp.j.1087.2008.03080
	GAO X M， CHEN F， SONG F X， et al. Influence of feature weight on text categorization performance of Bayesian classifier ［J］. Journal of Computer Applications，2008，28（12）：3080-3083. 10.3724/sp.j.1087.2008.03080
19	董才正，刘柏嵩. 面向问答社区的中文问题分类［J］. 计算机应用， 2016， 36（4）： 1060-1065. 10.11772/j.issn.1001-9081.2016.04.1060
	DONG C Z， LIU B S. Community question answering-oriented Chinese question classification ［J］. Journal of Computer Applications， 2016， 36（4）： 1060-1065. 10.11772/j.issn.1001-9081.2016.04.1060
20	MARTINEAU J， FININ T. Delta TFIDF： an improved feature space for sentiment analysis ［C］// Proceedings of the 2009 International Conference on Weblogs and Social Media. Palo Alto， CA： AAAI， 2009：258-261. 10.1109/cse.2009.584
21	KIM Y. Convolutional neural networks for sentence classification ［C］// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： Association for Computational Linguistics， 2014： 1746-1751. 10.3115/v1/d14-1181
22	LIU P， QIU X， HUANG X. Recurrent neural network for text classification with multi-task learning ［C］// Proceedings of the 25th International Joint Conference on Artificial Intelligence. Palo Alto， CA： AAAI， 2016： 2873-2879.
23	LAI S， XU L， LIU K， et al. Recurrent convolutional neural networks for text classification ［C］// Proceedings of the 29th AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI， 2015： 2267-2273. 10.1609/aaai.v33i01.33017370
24	LUONG T， PHAM H， MANNING C D. Effective approaches to attention-based neural machine translation ［C］// Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： Association for Computational Linguistics， 2015： 1412-1421. 10.18653/v1/d15-1166
25	YANG Z， YANG D， DYER C， et al. Hierarchical attention networks for document classification ［C］// Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies. Stroudsburg， PA： Association for Computational Linguistics， 2016： 1480-1489. 10.18653/v1/n16-1174
26	JOHNSON R， ZHANG T. Deep pyramid convolutional neural networks for text categorization ［C］// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Stroudsburg， PA： Association for Computational Linguistics， 2017： 562-570. 10.18653/v1/p17-1052
27	MIKOLOV T， CHEN K， CORRADO G， et al. Efficient estimation of word representations in vector space ［EB/OL］.［2020-06-22］ . 10.3126/jiee.v3i1.34327
28	PENNINGTON J， SOCHER R， MANNING C D. Glove： global vectors for word representation ［C］// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： Association for Computational Linguistics， 2014： 1532-1543. 10.3115/v1/d14-1162
29	PETERS M， NEUMANN M， IYYER M， et al. Deep contextualized word representations ［C］// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies. Stroudsburg， PA： Association for Computational Linguistics， 2018： 2227-2237. 10.18653/v1/n18-1202
30	DEVLIN J， CHANG M-W， LEE K， et al. BERT： pre-training of deep bidirectional transformers for language understanding ［C］// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies. Stroudsburg， PA： Association for Computational Linguistics， 2019： 4171-4186. 10.18653/v1/n19-1423
31	JIAO X， YIN Y， SHANG L， et al. TinyBERT： Distilling BERT for natural language understanding［C］// Proceedings of the 2020 Findings of the Association for Computational Linguistics. Stroudsburg， PA： Association for Computational Linguistics， 2020： 4163-4174. 10.18653/v1/2020.findings-emnlp.372
32	DUA D， GRAFF C. UCI machine learning repository ［D］. Irvine， California： University of California， 2017.
33	PISHARADY P K， VADAKKEPAT P， POH L A. Hand posture and face recognition using fuzzy-rough approach［C］// Computational Intelligence in Multi-Feature Visual Pattern Recognition. Singapore： Springer， 2014： 63-80. 10.1007/978-981-287-056-8_5
34	BOUJON C. Playing with text classified ads ［EB/OL］. ［2020-07-24］. .
35	KROONBERG M VAN DEN， VEST B.Index of /old/publiccorpus ［EB/OL］. ［2020-07-21］. .
36	HOCHREITER S， SCHMIDHUBER J. Long short-term memory ［J］. Neural Computation， 1997， 9（8）： 1735-1780. 10.1162/neco.1997.9.8.1735
37	TAGG C. A corpus linguistics study of SMS text messaging ［D］. Birmingham： University of Birmingham， 2009.
38	杨国峰，杨勇. 基于BERT的常见作物病害问答系统问句分类［J］. 计算机应用， 2020， 40（6）： 1580-1586. 10.1109/wacv45572.2020.9093596
	YANG G F， YANG Y. Question classification of common crop disease question answering system based on BERT ［J］. Journal of Computer Applications， 2020， 40（6）： 1580-1586. 10.1109/wacv45572.2020.9093596
39	YANG Z， DAI Z， YANG Y， et al. XLNet： generalized autoregressive pretraining for language understanding ［C］// Proceedings of the 2019 Advances in Neural Information Processing Systems. New York： Curran Associates， 2019， 32： 5754-5764. 10.1016/j.ymssp.2019.106289
40	NURUZZAMAN M T， LEE C， CHOI D. Independent and personal SMS spam filtering ［C］// Proceedings of the 2011 International Conference on Computer and Information Technology. Piscataway： IEEE， 2011： 429-435. 10.1109/cit.2011.23

真实	预测
真实	1	0
1	TP	FN
0	FP	TN

真实	预测
真实	1	0
1	TP	FN
0	FP	TN

数据集	文本
垃圾短信（SMS Spam）	CALL 09090900040 & LISTEN TO EXTREME DIRTY LIVE CHAT GOING ON IN THE OFFICE RIGHT NOW TOTAL PRIVACY NO ONE KNOWS YOUR ［sic］ LISTENING 60P MIN
	Hungry gay guys feeling hungry and up 4 it， now. Call 08718730555 just 10p/min. To stop texts call 08712460324 （10p/min）
	（Bank of Granite issues Strong-Buy） EXPLOSIVE PICK FOR OUR MEMBERS ***UP OVER 300% ********* Nasdaq Symbol CDGT That is a $5.00 per..
垃圾广告（Ads Spam）	facial lines along with loose skin color could be enhanced by a single skin care product. Elliskin The idea is included with Supplements C as well as some various other needed nutritional requirements along with healthy antioxidants distinguished for cor
垃圾广告（Ads Spam）	Albuminoidal is what ultimately conceals the age spots and collectively the discoloration of your skin. It additionally aids in adjustment the skin so on deflate wrinkles. On exploitation of times many of its users have according that they give the impre
垃圾邮件（Email Spam）	Norton AD ATTENTION： This is a MUST for ALL Computer Users！！！ NEW - Special Package Deal！ ……

数据集	文本
垃圾短信（SMS Spam）	CALL 09090900040 & LISTEN TO EXTREME DIRTY LIVE CHAT GOING ON IN THE OFFICE RIGHT NOW TOTAL PRIVACY NO ONE KNOWS YOUR ［sic］ LISTENING 60P MIN
	Hungry gay guys feeling hungry and up 4 it， now. Call 08718730555 just 10p/min. To stop texts call 08712460324 （10p/min）
	（Bank of Granite issues Strong-Buy） EXPLOSIVE PICK FOR OUR MEMBERS ***UP OVER 300% ********* Nasdaq Symbol CDGT That is a $5.00 per..
垃圾广告（Ads Spam）	facial lines along with loose skin color could be enhanced by a single skin care product. Elliskin The idea is included with Supplements C as well as some various other needed nutritional requirements along with healthy antioxidants distinguished for cor
垃圾广告（Ads Spam）	Albuminoidal is what ultimately conceals the age spots and collectively the discoloration of your skin. It additionally aids in adjustment the skin so on deflate wrinkles. On exploitation of times many of its users have according that they give the impre
垃圾邮件（Email Spam）	Norton AD ATTENTION： This is a MUST for ALL Computer Users！！！ NEW - Special Package Deal！ ……

方法	分类器	类别	SMS数据集			Ads数据集			Email数据集
方法	分类器	类别	Precision	Recall	F1-Score	Precision	Recall	F1-Score	Precision	Recall	F1-Score
传统方法	NB	Spam	0.825	0.867	0.846	0.964	0.900	0.931	0.851	0.950	0.898
	NB	Ham	0.860	0.817	0.838	0.906	0.967	0.935	0.943	0.833	0.885
	RF	Spam	0.918	0.750	0.826	0.938	1.000	0.968	0.866	0.967	0.913
	RF	Ham	0.789	0.933	0.855	1.000	0.933	0.966	0.962	0.850	0.903
	SVM	Spam	0.942	0.817	0.875	0.935	0.967	0.951	0.965	0.917	0.940
	SVM	Ham	0.838	0.950	0.891	0.966	0.933	0.949	0.921	0.967	0.943
	LR	Spam	0.940	0.783	0.855	0.951	0.967	0.959	0.921	0.967	0.943
	LR	Ham	0.814	0.950	0.877	0.966	0.950	0.958	0.965	0.917	0.940
	DT	Spam	0.843	0.717	0.775	0.966	0.933	0.949	0.786	0.917	0.846
	DT	Ham	0.754	0.867	0.806	0.935	0.967	0.951	0.900	0.750	0.818
当前主流方法	DPCNN	Spam	0.965	0.917	0.940	0.952	1.000	0.976	0.930	0.993	0.906
	DPCNN	Ham	0.921	0.967	0.943	1.000	0.950	0.974	0.889	0.933	0.911
	BERT	Spam	0.931	0.900	0.915	0.967	0.983	0.975	0.944	0.850	0.895
	BERT	Ham	0.903	0.933	0.918	0.983	0.967	0.975	0.864	0.950	0.905
	TinyBERT	Spam	0.903	0.933	0.918	0.967	0.967	0.967	0.906	0.800	0.850
	TinyBERT	Ham	0.931	0.900	0.915	0.967	0.967	0.967	0.821	0.917	0.866
神经网络方法	TextCNN	Spam	0.967	0.967	0.967	0.984	1.000	0.992	0.967	0.983	0.975
	TextCNN	Ham	0.967	0.967	0.967	1.000	0.983	0.992	0.983	0.967	0.975
	TextRNN	Spam	0.983	0.983	0.983	0.967	0.983	0.975	—	—	—
	TextRNN	Ham	0.983	0.983	0.983	0.983	0.967	0.975	—	—	—
	TextRCNN	Spam	0.952	0.983	0.967	0.968	1.000	0.984	0.968	1.000	0.984
	TextRCNN	Ham	0.983	0.950	0.966	1.000	0.967	0.983	1.000	0.967	0.983