基于风险决策的文本语义分类算法

doi:10.11772/j.issn.1001-9081.2016.11.2963

计算机应用 ›› 2016, Vol. 36 ›› Issue (11): 2963-2968.DOI: 10.11772/j.issn.1001-9081.2016.11.2963

• 第十六届中国粗糙集与软计算联合学术会议(CRSSC 2016)论文 • 上一篇下一篇

基于风险决策的文本语义分类算法

程玉胜, 梁辉, 王一宾, 黎康

安庆师范大学计算机与信息学院, 安徽安庆 246011

收稿日期:2016-06-03 修回日期:2016-06-06 发布日期:2016-11-12 出版日期:2016-11-10
通讯作者: 程玉胜
作者简介:程玉胜(1969-),男,安徽桐城人,教授,博士,主要研究方向:粗糙集理论与算法、数据挖掘;梁辉(1989-),男,安徽合肥人,硕士研究生,主要研究方向:数据挖掘、Web智能;王一宾(1970-),男,安徽安庆人,副教授,硕士,主要研究方向:数据挖掘;黎康(1990-),男,安徽合肥人,硕士研究生,主要研究方向:Web智能、数据挖掘。
基金资助:
安徽省高校省级自然科学研究项目（KJ2013A177）；安徽省自然科学基金资助项目（10040606Q42）。

Text semantic classification algorithm based on risk decision

CHENG Yusheng, LIANG Hui, WANG Yibin, LI Kang

School of Computer and Information, Anqing Normal University, Anqing Anhui 246011, China

Received:2016-06-03 Revised:2016-06-06 Online:2016-11-12 Published:2016-11-10
Supported by:
This work is partially supported by the Key University Science Research Project of Anhui Province (KJ2013A177), the Natural Science Foundation of Anhui Province (10040606Q42).

摘要/Abstract

摘要： 传统的文本分类多以空间向量模型为基础，采用层次分类树模型进行统计分析，该模型多数没有结合特征项语义信息，因此可能产生大量频繁语义模式，增加了分类路径。结合基本显露模式（eEP）在分类上的良好区分特性和基于最小期望风险代价的决策粗糙集模型，提出了一种阈值优化的文本语义分类算法TSCTO：在获取文档特征项频率分布表之后，首先利用粗糙集联合决策分布密度矩阵，计算最小阈值，提取满足一定阈值的高频词；然后结合语义分析与逆向文档频率方法获取基于语义类内文档频率的高频词；采用eEP分类方法获得最简模式；最后利用相似性公式和《知网》提供的语义相关度，计算文本相似性得分，利用三支决策理论对阈值进行选择。实验结果表明，TSCTO算法在文本分类的性能上有一定提升。

关键词: 决策粗糙集模型, 文本分类, 语义, 特征项, 基本显露模式

Abstract: Most of traditional text classification algorithms are based on vector space model and hierarchical classification tree model is used for statistical analysis. The model mostly doesn't combine with the semantic information of characteristic items. Therefore it may produce a large number of frequent semantic modes and increase the paths of classification. Combining with the good distinguishment characteristic of essential Emerging Pattern (eEP) in the classification and the model of rough set based on minimum expected risk decision, a Text Semantic Classification algorithm with Threshold Optimization (TSCTO) was presented. Firstly, after obtaining the document feature frequency distribution table, the minimum threshold value was calculated by the rough set combined with distribution density matrix. Then the high frequency words of the semantic intra-class document frequency are obtained by combining semantic analysis and inverse document frequency method. In order to get the simplest model, the eEP pattern was used for classification. Finally, using similarity formula and HowNet semantic relevance degree, the score of text similarity was calculated, and some thresholds were optimized by the three-way decision theory. The experimental results show that the TSCTO algorithm has a certain improvement in the performance of text classification.

Key words: decision model of rough set, text classification, semantic, feature item, essential Emerging Pattern (eEP)

中图分类号:

TP391.4

程玉胜, 梁辉, 王一宾, 黎康. 基于风险决策的文本语义分类算法[J]. 计算机应用, 2016, 36(11): 2963-2968.

CHENG Yusheng, LIANG Hui, WANG Yibin, LI Kang. Text semantic classification algorithm based on risk decision[J]. Journal of Computer Applications, 2016, 36(11): 2963-2968.

参考文献

[1] 彭京,杨冬青,唐世渭,等.基于概念相似度的文本相似计算[J]. 中国科学F辑:信息科学,2009,39(5):534-544.(PENG J, YANG D Q, TANG S W, et al. Text similarity computation based on concept similarity[J].Science in China (Series F:Information Sciences), 2009,39(5):534-544.)
[2] 郝水龙,吴共庆,胡学钢. 基于层次向量空间模型的用户兴趣表示及更新[J]. 南京大学学报(自然科学版),2012,48(2):190-197.(HAO S L,WU G Q,HU X G. Presentation and updation for user profile based on hierarchical vector space model[J]. Journal of Nanjing University (Natural Sciences Edition), 2012,48(2):190-197.)
[3] 肖雪,何中市. 基于向量空间模型的中文文本层次分类方法研究[J]. 计算机应用,2006,26(5):1125-1126.(XIAO X, HE Z S. Hierarchical categorization methods of Chinese text based on vector space model[J].Journal of Computer Applications, 2006,26(5):1125-1126.)
[4] 林伟,孟凡荣,王志晓. 基于概念特征的语义文本分类[J]. 计算机工程与应用,2011,47(28):139-142.(LIN W,MENG F R,WANG Z X. Concept-features-based semantic text classification[J]. Computer Engineering and Applications, 2011,47(28):139-142.)
[5] 宋胜利,王少龙,陈平. 面向文本分类的中文文本语义表示方法[J]. 西安电子科技大学学报,2013,40(2):89-97.(SONG S L,WANG S L,CHEN P. Chinese text semantic representation for text classification[J]. Journal of Xidian University, 2013,40(2):89-97.)
[6] 陈继文,杨红娟,董明晓,等. 基于本体语义块相似匹配的设计知识更新[J]. 机械工程学报,2014,50(7):161-166.(CHEN J W, YANG H J, DONG M X, et al. Design knowledge updating method based on similarity matching of ontology semantic block[J]. Journal of Mechanical Engineering, 2014,50(7):161-166.)
[7] 段磊,唐常杰,Guozhou DONG,等.基于显露模式的对比挖掘研究及应用进展[J].计算机应用,2012,32(2):304-308.(DUAN L,TANG C J,DONG G, et al. Survey on emerging pattern based contrast mining and applications[J]. Journal of Computer Applications, 2012,32(2):304-308.)
[8] 陆彦婷,陆建峰,杨静宇.层次分类方法综述[J]. 模式识别与人工智能,2013,26(12):1130-1139. (LU Y T,LU J F,YANG J Y. A survey of hierarchical classification methods[J]. Pattern Recognition and Artificial Intelligence, 2013,26(12):1130-1139.)
[9] GAO T, KOLLER D. Discriminative learning of relaxed hierarchy for large-scale visual recognition[C]//Proceedings of the 2011 IEEE International Conference on Computer Vision. Piscataway, NJ:IEEE, 2011:2072-2079.
[10] 姜芳,李国和,岳翔.基于语义的文档关键词提取方法[J].计算机应用研究,2015,32(1):142-145.(JIANG F,LI G H,YUE X. Semantic-based keyword extraction method for document[J]. Application Research of Computers, 2015,32(1):142-145.)
[11] LI Y H, MCLEAN D, BANDAR Z A,et al. Sentence similarity based on semantic nets and corpus statistics[J].IEEE Transactions on Knowledge and Data Engineering,2006,18(8):1138-1150.
[12] KULESZA T, STUMPF S, WONG W K, et al. Why-oriented end-user debugging of naive Bayes text classification[J]. ACM Transactions on Interactive Intelligent Systems,2011,1(1):Article No. 2.
[13] SUN A, LIM E P, NG W K, et al. Blocking reduction strategies in hierarchical text classification[J]. IEEE Transactions on Knowledge and Data Engineering,2004,16(10):1305-1308.
[14] YAO Y Y. The superiority of three-way decisions in probabilistic rough set models[J]. Information Sciences, 2011, 181(6):1080-1096.
[15] CHENG Y S, ZHAN W F, WU X D, et al. Automatic determination about precision parameter value based on inclusion degree with variable precision rough set model[J]. Information Sciences,2015,290(C):72-85.
[16] 程玉胜,詹文法,张玉州. 基于统计偏好的边界域重构方法[J]. 小型微型计算机系统,2013,34(11):2612-2614.(CHENG Y S,ZHAN W F,ZHANG Y Z. Approach of reconstruction about boundary region based on statistics strategy preferences[J]. Journal of Chinese Computer Systems, 2013,34(11):2612-2614.)
[17] 孙健. 开源Java中文分词器Ansj[EB/OL].[2016-06-01]. http://blog.csdn.net/blogdevteam/article/details/8148451. (SUN J. Open source Java for Chinese analyzer Ansj[EB/OL].[2016-06-01]. http://blog.csdn.net/blogdevteam/article/details/8148451.)
[18] 搜狗实验室[EB/OL].[2016-06-01].http://www.sogou.com/labs/dl/c.html.(Sogou Lab.[EB/OL].[2016-06-01] http://www.sogou.com/labs/dl/c.html.)
[19] 许洪涛,范明,昝红英.一种基于EP的中文文本自动分类算法[J].计算机研究与发展,2005,42(增刊):351-355. (XU H T,FAN M,ZAN H Y. An EP-based classifier for Chinese text categorization[J].Journal of Computer Research and Development, 2005,42(Supplement):351-355.)
[20] 彭京,杨冬青,唐世渭,等.一种基于语义内积空间模型的文本聚类算法[J].计算机学报,2007,30(8):1354-1363.(PENG J, YANG D Q,TANG S W. A novel text clustering algorithm based on inner product space model of semantic[J]. Chinese Journal of Computers, 2007,30(8):1354-1363.)

基于风险决策的文本语义分类算法

Text semantic classification algorithm based on risk decision

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	唐媛, 陈艳平, 扈应, 黄瑞章, 秦永彬. 基于多尺度混合注意力卷积神经网络的关系抽取模型[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2011-2017.
[2]	余新言, 曾诚, 王乾, 何鹏, 丁晓玉. 基于知识增强和提示学习的小样本新闻主题分类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1767-1774.
[3]	姚迅, 秦忠正, 杨捷. 生成式标签对抗的文本分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1781-1785.
[4]	赵征宇, 罗景, 涂新辉. 基于多粒度语义融合的信息检索方法[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1775-1780.
[5]	张鹏飞, 韩李涛, 冯恒健, 李洪梅. 基于注意力机制和全局特征优化的点云语义分割[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1086-1092.
[6]	王铂越, 李英祥, 钟剑丹. 基于改进Res-UNet的昼夜地基云图分割网络[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1310-1316.
[7]	万泽轩, 谢春丽, 吕泉润, 梁瑶. 基于依赖增强的分层抽象语法树的代码克隆检测[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1259-1268.
[8]	李威, 陈玲, 徐修远, 朱敏, 郭际香, 周凯, 牛颢, 张煜宸, 易珊烨, 章毅, 罗凤鸣. 基于多任务学习的间质性肺病分割算法[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1285-1293.
[9]	袁中臣, 马宗民. 基于UMCS树的UML类图的混合相似性度量[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 883-889.
[10]	吴宁, 罗杨洋, 许华杰. 基于多尺度特征融合的遥感图像语义分割方法[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 737-744.
[11]	郭磊, 贾真, 李天瑞. 面向方面级情感分析的交互式关系图注意力网络[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 696-701.
[12]	余杭, 周艳玲, 翟梦鑫, 刘涵. 基于预训练模型与标签融合的文本分类[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 709-714.
[13]	张家伟, 高冠东, 肖珂, 宋胜尊. 基于改进分层注意网络和TextCNN联合建模的暴力犯罪分级算法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 403-410.
[14]	王楷天, 叶青, 程春雷. 基于异构图表示的中医电子病历分类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 411-417.
[15]	李子怡, 曲婷婷, 崇乾鹏, 徐金东. 基于模糊多尺度特征的遥感图像分割网络[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3581-3586.