计算机应用 ›› 2013, Vol. 33 ›› Issue (05): 1334-1337.DOI: 10.3724/SP.J.1087.2013.01334

• 数据库技术 • 上一篇    下一篇

基于特征词的垃圾短信分类器模型

张永军,刘金岭   

  1. 淮阴工学院 计算机工程学院,江苏 淮安 223003
  • 收稿日期:2012-11-08 修回日期:2012-12-20 出版日期:2013-05-01 发布日期:2013-05-08
  • 通讯作者: 张永军
  • 作者简介:张永军(1978-),男,江苏扬州人,讲师,硕士, 主要研究方向:中文信息处理;刘金岭(1963-),男(回族),河北沧州人,教授,主要研究方向:中文信息处理。
  • 基金资助:

    国家级星火计划项目(2011GA690190)

Spam short message classifier model based on word terms

ZHANG Yongjun,LIU Jinling   

  1. Faculty of Computer Engineering, Huaiyin Institute of Technology, Huai'an Jiangsu 223003, China
  • Received:2012-11-08 Revised:2012-12-20 Online:2013-05-08 Published:2013-05-01
  • Contact: ZHANG Yongjun

摘要: 针对垃圾短信分类问题,提出一种计算词分类权重的方法,并以此为基础通过降维来得到分类特征词集合。提出了短信分类隶属度概念,通过计算短信分类隶属度和分类隶属度密度的方法来实现分类。为了提高分类的准确性,还对特征词进行了分类权重的迭代学习,从而保证了词分类权重取值的合理性。实验结果表明,该分类模型具有良好的分类效果和较低的时间复杂度。

关键词: 垃圾短信, 特征词, 文本分类, 降维, 权重学习

Abstract: A classifier model based on word terms was proposed to classify Spam Short Messages (SSM). The concept of word-category weight was introduced for representing a word effect of identifying the category a SSM belongs to and a method was put forward to calculate the word-category weight. Based on the word-category weight, a dimension reduction was carried out to get word items set.The Short message-Category Membership Value (SCMV) was used to illustrate how much a SSM belonged to a category, then a classifying algorithm was implemented by computing SCMV and SCMV density. To improve the accuracy of classification and make the word-category weight more reasonable, an word-weight iterative learning procedure was performed. The experimental results show that the proposed model is superior to other classification methods in terms of classification performance and time complexity.

Key words: spam short message, word term, text classification, dimensionality reduction, weight learning

中图分类号: