Text categorization approach based on probability standard deviation with evaluation of distribution information

Journal of Computer Applications ›› 2009, Vol. 29 ›› Issue (12): 3303-3306.

• Database and data mining • Previous Articles Next Articles

Text categorization approach based on probability standard deviation with evaluation of distribution information

Received:2009-06-24 Revised:2009-08-06 Online:2009-12-10 Published:2009-12-01
Contact: Jiao Qingzheng

分布权值调节概率标准差的文本分类方法

焦庆争¹,蔚承建²

1. 安徽师范大学信息管理中心
2. 南京工业大学

通讯作者: 焦庆争
基金资助:
国家自然科学基金资助项目;安徽省高校省级自然科学研究重点项目

Abstract

Abstract: For text categorization, an approach was introduced to construct the simplest linear classifier, in which the feature weight was computed by probability standard deviation of features as a base line weight regulated with features distributed parameters. In the assessment process of weighting, the probability standard deviation was considered as feature base weighting to quantify dispersion degree of feature, while distributed parameters were evaluated by using beta probability density functions to measure feature distributed information. In the experiments, 20Newsgroup, Fudan Chinese evaluation data collection and Reuters-21578 were used to evaluate the effectiveness of the techniques proposed in this paper, respectively. The experimental results show the method can improve significantly the performance for text categorization, and is simple, stable and suitable for large-scale text categorization.

Key words: text categorization, probability standard deviation of feature, dispersion degree of feature, feature distribution, Beta probability density function, natural language processing

摘要： 针对文本分类问题,基于特征分布评估权值调节特征概率标准差设计了一种无须特征选择的高效的线性文本分类器。该算法的基本思路是使用特征概率标准差量化特征在文档类中的离散度,并作为特征的基础权重,同时以后验概率的Beta分布函数为基础,运用概率确定性密度函数,评估特征在类别中的分布信息得到特征分布权值,将其调节基础权重得到特征权重,实现了线性文本分类器。在20Newsgroup、复旦中文分类语料、Reuters-21578三个语料集进行了比较实验,实验结果表明,新算法分类性能相对传统算法优势显著,且稳定、高效、实用,适于大规模文本分类任务。

关键词: 文本分类, 特征概率标准差, 特征离散度, 特征分布, Beta概率密度函数, 自然语言处理

焦庆争蔚承建. 分布权值调节概率标准差的文本分类方法[J]. 计算机应用, 2009, 29(12): 3303-3306.

[1]	Xingbin LIAO, Xiaolin QIN, Siqi ZHANG, Yangge QIAN. Review of interactive machine translation [J]. Journal of Computer Applications, 2023, 43(2): 329-334.
[2]	Ming XU, Linhao LI, Qiaoling QI, Liqin WANG. Abductive reasoning model based on attention balance list [J]. Journal of Computer Applications, 2023, 43(2): 349-355.
[3]	Yang WANG, Hongliang FU, Huawei TAO, Jing YANG, Yue XIE, Li ZHAO. Cross-corpus speech emotion recognition based on decision boundary optimized domain adaptation [J]. Journal of Computer Applications, 2023, 43(2): 374-379.
[4]	Yuanlong WANG, Xiaomin LIU, Hu ZHANG. Machine reading comprehension model based on event representation [J]. Journal of Computer Applications, 2022, 42(7): 1979-1984.
[5]	Yingjie WANG, Jiuqi ZHU, Zumin WANG, Fengbo BAI, Jian GONG. Review of applications of natural language processing in text sentiment analysis [J]. Journal of Computer Applications, 2022, 42(4): 1011-1020.
[6]	Yuxi LIU, Yuqi LIU, Zonglin ZHANG, Zhihua WEI, Ran MIAO. News recommendation model with deep feature fusion injecting attention mechanism [J]. Journal of Computer Applications, 2022, 42(2): 426-432.
[7]	You YANG, Lizhi CHEN, Xiaolong FANG, Longyue PAN. Image caption generation model with adaptive commonsense gate [J]. Journal of Computer Applications, 2022, 42(12): 3900-3905.
[8]	Yuqi DU, Jin ZHENG, Yang WANG, Cheng HUANG, Ping LI. Text segmentation model based on graph convolutional network [J]. Journal of Computer Applications, 2022, 42(12): 3692-3699.
[9]	Longchao GONG, Junjun GUO, Zhengtao YU. Neural machine translation method based on source language syntax enhanced decoding [J]. Journal of Computer Applications, 2022, 42(11): 3386-3394.
[10]	Yu PENG, Xiaoyu LI, Shijie HU, Xiaolei LIU, Weizhong QIAN. Three-stage question answering model based on BERT [J]. Journal of Computer Applications, 2022, 42(1): 64-70.
[11]	LIU Yaxuan, ZHONG Yong. Joint extraction method of entities and relations based on subject attention [J]. Journal of Computer Applications, 2021, 41(9): 2517-2522.
[12]	XIE Defeng, JI Jianmin. Syntax-enhanced semantic parsing with syntax-aware representation [J]. Journal of Computer Applications, 2021, 41(9): 2489-2495.
[13]	ZHOU Xianbing, FAN Xiaochao, REN Ge, YANG Yong. Automated English essay scoring method based on multi-level semantic features [J]. Journal of Computer Applications, 2021, 41(8): 2205-2211.
[14]	WANG Zhujun, WANG Shi, LI Xueqing, ZHU Junwu. Review of event causality extraction based on deep learning [J]. Journal of Computer Applications, 2021, 41(5): 1247-1255.
[15]	LI Xueqing, WANG Shi, WANG Zhujun, ZHU Junwu. Summarization of natural language generation [J]. Journal of Computer Applications, 2021, 41(5): 1227-1235.

Text categorization approach based on probability standard deviation with evaluation of distribution information

分布权值调节概率标准差的文本分类方法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics