计算机应用 ›› 2016, Vol. 36 ›› Issue (11): 2963-2968.DOI: 10.11772/j.issn.1001-9081.2016.11.2963

• 第十六届中国粗糙集与软计算联合学术会议(CRSSC 2016)论文 • 上一篇    下一篇

基于风险决策的文本语义分类算法

程玉胜, 梁辉, 王一宾, 黎康   

  1. 安庆师范大学 计算机与信息学院, 安徽 安庆 246011
  • 收稿日期:2016-06-03 修回日期:2016-06-06 出版日期:2016-11-10 发布日期:2016-11-12
  • 通讯作者: 程玉胜
  • 作者简介:程玉胜(1969-),男,安徽桐城人,教授,博士,主要研究方向:粗糙集理论与算法、数据挖掘;梁辉(1989-),男,安徽合肥人,硕士研究生,主要研究方向:数据挖掘、Web智能;王一宾(1970-),男,安徽安庆人,副教授,硕士,主要研究方向:数据挖掘;黎康(1990-),男,安徽合肥人,硕士研究生,主要研究方向:Web智能、数据挖掘。
  • 基金资助:
    安徽省高校省级自然科学研究项目(KJ2013A177);安徽省自然科学基金资助项目(10040606Q42)。

Text semantic classification algorithm based on risk decision

CHENG Yusheng, LIANG Hui, WANG Yibin, LI Kang   

  1. School of Computer and Information, Anqing Normal University, Anqing Anhui 246011, China
  • Received:2016-06-03 Revised:2016-06-06 Online:2016-11-10 Published:2016-11-12
  • Supported by:
    This work is partially supported by the Key University Science Research Project of Anhui Province (KJ2013A177), the Natural Science Foundation of Anhui Province (10040606Q42).

摘要: 传统的文本分类多以空间向量模型为基础,采用层次分类树模型进行统计分析,该模型多数没有结合特征项语义信息,因此可能产生大量频繁语义模式,增加了分类路径。结合基本显露模式(eEP)在分类上的良好区分特性和基于最小期望风险代价的决策粗糙集模型,提出了一种阈值优化的文本语义分类算法TSCTO:在获取文档特征项频率分布表之后,首先利用粗糙集联合决策分布密度矩阵,计算最小阈值,提取满足一定阈值的高频词;然后结合语义分析与逆向文档频率方法获取基于语义类内文档频率的高频词;采用eEP分类方法获得最简模式;最后利用相似性公式和《知网》提供的语义相关度,计算文本相似性得分,利用三支决策理论对阈值进行选择。实验结果表明,TSCTO算法在文本分类的性能上有一定提升。

关键词: 决策粗糙集模型, 文本分类, 语义, 特征项, 基本显露模式

Abstract: Most of traditional text classification algorithms are based on vector space model and hierarchical classification tree model is used for statistical analysis. The model mostly doesn't combine with the semantic information of characteristic items. Therefore it may produce a large number of frequent semantic modes and increase the paths of classification. Combining with the good distinguishment characteristic of essential Emerging Pattern (eEP) in the classification and the model of rough set based on minimum expected risk decision, a Text Semantic Classification algorithm with Threshold Optimization (TSCTO) was presented. Firstly, after obtaining the document feature frequency distribution table, the minimum threshold value was calculated by the rough set combined with distribution density matrix. Then the high frequency words of the semantic intra-class document frequency are obtained by combining semantic analysis and inverse document frequency method. In order to get the simplest model, the eEP pattern was used for classification. Finally, using similarity formula and HowNet semantic relevance degree, the score of text similarity was calculated, and some thresholds were optimized by the three-way decision theory. The experimental results show that the TSCTO algorithm has a certain improvement in the performance of text classification.

Key words: decision model of rough set, text classification, semantic, feature item, essential Emerging Pattern (eEP)

中图分类号: