计算机应用 ›› 2011, Vol. 31 ›› Issue (09): 2412-2416.DOI: 10.3724/SP.J.1087.2011.02412

• 数据库技术 • 上一篇    下一篇

基于MapReduce的贝叶斯垃圾邮件过滤机制

陶永才1,薛正元1,石磊2   

  1. 1. 郑州大学 信息工程学院,郑州 450001
    2. 郑州大学 信息工程学院
  • 收稿日期:2011-03-21 修回日期:2011-05-30 发布日期:2011-09-01 出版日期:2011-09-01
  • 通讯作者: 陶永才
  • 作者简介:陶永才(1975-),男,河南武陟人,博士,主要研究方向:分布式计算、高性能计算;
    薛正元(1989-),男,河南南阳人,硕士研究生,主要研究方向:Web数据挖掘;
    石磊(1967-),男,河南郑州人,教授,博士,主要研究方向:高性能计算、Web数据挖掘。
  • 基金资助:
    国家863计划项目(2006AA01A115)

MapReduce-based Bayesian anti-spam filtering mechanism

TAO Yong-cai1,XUE Zheng-yuan1,SHI Lei   

  • Received:2011-03-21 Revised:2011-05-30 Online:2011-09-01 Published:2011-09-01
  • Contact: TAO Yong-cai

摘要: 贝叶斯邮件过滤器具有较强的分类能力和较高的准确性,但前期的邮件集训练与学习耗用大量系统资源和网络资源,影响系统效率。提出一种基于MapReduce技术的贝叶斯垃圾邮件过滤机制,一方面对传统贝叶斯过滤技术进行改进,另一方面利用MapReduce模型的海量数据处理优势优化邮件集训练与学习。实验表明,较之目前流行的传统贝叶斯算法、K最近邻(KNN)算法和支持向量机(SVM)算法,基于MapReduce的贝叶斯垃圾邮件过滤机制在召回率、查准率和精确率方面保持了较好的表现,同时降低了邮件学习和分类成本,提高了系统执行效率。

关键词: 垃圾邮件, 邮件过滤, 贝叶斯算法, MapReduce, 数据处理

Abstract: The Bayesian anti-spam filter has strong classification capacity and high accuracy, but the mail training and learning at early stage consume mass system and network resources and affect system efficiency. A MapReduce-based Bayesian anti-spam filtering mechanism was proposed, which first improved the traditional Bayesian filtering technique, and then optimized the mail training and learning by taking advantage of mass data processing of MapReduce. The experimental results show that, compared with the traditional Bayesian filtering technique, K-Nearest Neighbor (KNN) and Support Vector Machine (SVM) algorithms, the MapReduce-based Bayesian anti-spam filtering mechanism performs better in recall, precision and accuracy, reduces the cost of mail learning and classifying and improves the system efficiency.

Key words: spam E-mail, E-mail filter, Bayesian algorithm, MapReduce, data processing

中图分类号: