计算机应用 ›› 2009, Vol. 29 ›› Issue (12): 3303-3306.

• 数据库与数据挖掘 • 上一篇    下一篇

分布权值调节概率标准差的文本分类方法

焦庆争1,蔚承建2   

  1. 1. 安徽师范大学信息管理中心
    2. 南京工业大学
  • 收稿日期:2009-06-24 修回日期:2009-08-06 发布日期:2009-12-10 出版日期:2009-12-01
  • 通讯作者: 焦庆争
  • 基金资助:
    国家自然科学基金资助项目;安徽省高校省级自然科学研究重点项目

Text categorization approach based on probability standard deviation with evaluation of distribution information

  • Received:2009-06-24 Revised:2009-08-06 Online:2009-12-10 Published:2009-12-01
  • Contact: Jiao Qingzheng

摘要: 针对文本分类问题,基于特征分布评估权值调节特征概率标准差设计了一种无须特征选择的高效的线性文本分类器。该算法的基本思路是使用特征概率标准差量化特征在文档类中的离散度,并作为特征的基础权重,同时以后验概率的Beta分布函数为基础,运用概率确定性密度函数,评估特征在类别中的分布信息得到特征分布权值,将其调节基础权重得到特征权重,实现了线性文本分类器。在20Newsgroup、复旦中文分类语料、Reuters-21578三个语料集进行了比较实验,实验结果表明,新算法分类性能相对传统算法优势显著,且稳定、高效、实用,适于大规模文本分类任务。

关键词: 文本分类, 特征概率标准差, 特征离散度, 特征分布, Beta概率密度函数, 自然语言处理

Abstract: For text categorization, an approach was introduced to construct the simplest linear classifier, in which the feature weight was computed by probability standard deviation of features as a base line weight regulated with features distributed parameters. In the assessment process of weighting, the probability standard deviation was considered as feature base weighting to quantify dispersion degree of feature, while distributed parameters were evaluated by using beta probability density functions to measure feature distributed information. In the experiments, 20Newsgroup, Fudan Chinese evaluation data collection and Reuters-21578 were used to evaluate the effectiveness of the techniques proposed in this paper, respectively. The experimental results show the method can improve significantly the performance for text categorization, and is simple, stable and suitable for large-scale text categorization.

Key words: text categorization, probability standard deviation of feature, dispersion degree of feature, feature distribution, Beta probability density function, natural language processing