计算机应用 ›› 2014, Vol. 34 ›› Issue (11): 3279-3282.DOI: 10.11772/j.issn.1001-9081.2014.11.3279

• 人工智能 • 上一篇    下一篇

基于词频信息的改进信息增益文本特征选择算法

石慧1,贾代平2,苗培1   

  1. 1. 山东师范大学 信息科学与工程学院, 济南 250014
    2. 山东工商学院 计算机科学与技术学院,山东 烟台 264005
  • 收稿日期:2014-05-16 修回日期:2014-06-26 出版日期:2014-11-01 发布日期:2014-12-01
  • 通讯作者: 石慧
  • 作者简介:石慧(1989-),女,山东临沂人,硕士研究生,主要研究方向:数据挖掘;贾代平(1966-),男,安徽舒城人,教授,CCF高级会员,主要研究方向:动态数据、海量数据;苗培(1991-),女,山东菏泽人,硕士研究生,主要研究方向:差分隐私保护。
  • 基金资助:

    国家自然科学基金资助项目

Improved information gain text feature selection algorithm based on word frequency information

SHI Hui1,JIA Daiping2,MIAO Pei1   

  1. 1. School of Information Science and Engineering, Shandong Normal University, Jinan Shandong 250014, China;
    2. School of Computer Science and Technology, Shandong Institute of Business and Technology, Yantai Shandong 264005, China
  • Received:2014-05-16 Revised:2014-06-26 Online:2014-11-01 Published:2014-12-01
  • Contact: SHI Hui

摘要:

为克服传统信息增益(IG)算法对特征项的频数考虑不足的缺陷,在对传统算法和相关改进算法深入分析的基础上,提出一种基于词频信息的改进的IG文本特征选择算法。分别从特征项在类内出现的频数、类内位置分布、不同类间的分布等方面对传统IG算法的参数进行了修正,使特征频数信息得到充分利用。对文本分类的实验结果表明,所提算法的分类精度明显高于传统IG算法和加权的IG改进算法。

Abstract:

On the basis of elaborate analysis of traditional algorithm and relevant improved algorithms, an improved Information Gain (IG) algorithm based on word frequency information was proposed to solve the insufficient consideration of the frequency of features in traditional information gain feature selection algorithm. The improved algorithm modified parameters of the traditional IG algorithm, mainly from aspects of the frequency of features within category, distribution within category and the distribution among different categories, which can make full use of the frequency of features. The result of text categorization experiment compared with traditional IG algorithm and an improved IG algorithm of weighted indicates that the proposed algorithm has an obvious enhancement in accuracy of the text categorization.

中图分类号: