计算机应用 ›› 2011, Vol. 31 ›› Issue (03): 698-701.DOI: 10.3724/SP.J.1087.2011.00698

• 数据库技术 • 上一篇    下一篇

应用于垃圾邮件过滤的词序列核

陈孝礼,刘培玉   

  1. 山东师范大学 信息科学与工程学院,济南250014;2.山东师范大学 山东省分布式计算机软件新技术重点实验室,济南250014
  • 收稿日期:2010-09-10 修回日期:2010-11-02 发布日期:2011-03-03 出版日期:2011-03-01
  • 通讯作者: 陈孝礼
  • 作者简介:陈孝礼(1986-),男,山东临朐人,硕士研究生,主要研究方向:网络信息安全、信息过滤;刘培玉(1960-),男,山东潍坊人,教授,博士生导师,主要研究方向:计算机网络信息安全、网络系统规划、信息过滤。
  • 基金资助:
    国家自然科学基金资助项目(60873247);山东省高新自主创新专项工程项目(2008ZZ28)

Word sequence kernel applied in spam-filtering

CHEN Xiao-li,LIU Pei-yu   

  1. School of Information Science and Engineering, Shandong Normal University, Jinan Shandong 250014, China; 2. Shandong Provincial Key Laboratory for Distributed Computer Software Novel Technology, Shandong Normal University, Jinan Shandong 250014, China
  • Received:2010-09-10 Revised:2010-11-02 Online:2011-03-03 Published:2011-03-01
  • Contact: CHEN Xiao-li

摘要: 针对支持向量机(SVM)中常用核函数由于忽略文本结构而导致大量语义信息丢失的现象,提出一种类别相关度量的词序列核(WSK),并将其应用于垃圾邮件过滤。首先提取邮件文本特征并计算特征的类别相关度量,然后利用词序列核作为核函数训练支持向量机,训练过程中利用类别相关度量计算词的衰减系数,最后对邮件进行分类。实验结果表明,与常用核函数和字符串核相比,改进的词序列核分类准确率更高,提高了垃圾邮件过滤的准确率。

关键词: 支持向量机, 词序列核, 相关度量, 邮件过滤

Abstract: The structure of the text is neglected by using the majority of used kernels to classification, so that a lot of semantic information is lost. In order to solve this problem, a Word Sequence Kernel (WSK) based on dependence measure was proposed and used in the field of spam filtering in this paper. Firstly, the features of each E-mail were extracted and the dependence measure of each feature was calculated; then the word sequence kernel was used as kernel function to train Support Vector Machine (SVM), and the decay factor of each feature was calculated by taking the dependence measure of each feature into account in the training process; finally, the optimized SVM filter was used to spam filtering. The experimental results show that the improved word sequence kernel gets higher accuracy compared to the commonly used kernels and string subsequence kernel. The proposed method improves the accuracy of spam filtering.

Key words: Support Vector Machine (SVM), Word-Sequence Kernel (WSK), dependence measure, spam filtering

中图分类号: