Journal of Computer Applications ›› 2009, Vol. 29 ›› Issue (12): 3303-3306.
• Database and data mining • Previous Articles Next Articles
Received:
Revised:
Online:
Published:
Contact:
焦庆争1,蔚承建2
通讯作者:
基金资助:
Abstract: For text categorization, an approach was introduced to construct the simplest linear classifier, in which the feature weight was computed by probability standard deviation of features as a base line weight regulated with features distributed parameters. In the assessment process of weighting, the probability standard deviation was considered as feature base weighting to quantify dispersion degree of feature, while distributed parameters were evaluated by using beta probability density functions to measure feature distributed information. In the experiments, 20Newsgroup, Fudan Chinese evaluation data collection and Reuters-21578 were used to evaluate the effectiveness of the techniques proposed in this paper, respectively. The experimental results show the method can improve significantly the performance for text categorization, and is simple, stable and suitable for large-scale text categorization.
Key words: text categorization, probability standard deviation of feature, dispersion degree of feature, feature distribution, Beta probability density function, natural language processing
摘要: 针对文本分类问题,基于特征分布评估权值调节特征概率标准差设计了一种无须特征选择的高效的线性文本分类器。该算法的基本思路是使用特征概率标准差量化特征在文档类中的离散度,并作为特征的基础权重,同时以后验概率的Beta分布函数为基础,运用概率确定性密度函数,评估特征在类别中的分布信息得到特征分布权值,将其调节基础权重得到特征权重,实现了线性文本分类器。在20Newsgroup、复旦中文分类语料、Reuters-21578三个语料集进行了比较实验,实验结果表明,新算法分类性能相对传统算法优势显著,且稳定、高效、实用,适于大规模文本分类任务。
关键词: 文本分类, 特征概率标准差, 特征离散度, 特征分布, Beta概率密度函数, 自然语言处理
焦庆争 蔚承建. 分布权值调节概率标准差的文本分类方法[J]. 计算机应用, 2009, 29(12): 3303-3306.
0 / Recommend
Add to citation manager EndNote|Ris|BibTeX
URL: http://www.joca.cn/EN/
http://www.joca.cn/EN/Y2009/V29/I12/3303