深层次分类中候选类别搜索算法

doi:10.11772/j.issn.1001-9081.2017.03.635

计算机应用 ›› 2017, Vol. 37 ›› Issue (3): 635-639.DOI: 10.11772/j.issn.1001-9081.2017.03.635

• 第四届大数据学术会议(CCF BIGDATA2016) • 上一篇下一篇

深层次分类中候选类别搜索算法

张忠林, 刘述昌, 江粉桃

兰州交通大学电子与信息工程学院, 兰州 730070

收稿日期:2016-09-18 修回日期:2016-10-18 发布日期:2017-03-22 出版日期:2017-03-10
通讯作者: 刘述昌
作者简介:张忠林(1965-),男,河北衡水人,教授,博士,CCF会员,主要研究方向:智能信息处理、软件工程;刘述昌(1989-),男,甘肃金昌人,硕士研究生,主要研究方向:数据挖掘;江粉桃(1991-),女,甘肃定西人,硕士研究生,主要研究方向:Web数据挖掘。
基金资助:
国家自然科学基金资助项目（61662043）。

Candidate category search algorithm in deep level classification

ZHANG Zhonglin, LIU Shuchang, JIANG Fentao

School of Electronic and Information Engineering, Lanzhou Jiaotong University, Lanzhou Gansu 730070, China

Received:2016-09-18 Revised:2016-10-18 Online:2017-03-22 Published:2017-03-10
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61662043).

摘要/Abstract

摘要： 针对深层次分类中分类准确率低、处理速度慢等问题，提出一种待分类文本的候选类别搜索算法。首先，引入搜索、分类两阶段的处理思想，结合类别层次树的结构特点和类别间的相关联系等隐含的领域知识，进行了类别层次权重分析和特征项的动态更新，为类树层次结构的各个节点构建更具分类判断力的特征项集合；进而，采用深度优先搜索算法并结合设定阈值的剪枝策略缩小搜索范围，搜索得到待分类文本的最优候选类别；最后，在候选类别的基础上应用经典的K最近邻（KNN）分类算法和支持向量机（SVM）分类算法进行分类测试和对比分析。实验结果显示，所提算法的总体分类性能优于传统的分类算法，而且使平均F₁值较基于贪心策略的启发式搜索算法提高了6%左右。该算法显著提高了深层次文本分类的分类准确度。

关键词: 深层文本分类, 类别层次, 类别层次树, 深度优先搜索, 候选类别

Abstract: Aiming at the problem of low classification accuracy and slow processing speed in deep classification, a candidate category searching algorithm for text classification was proposed. Firstly, the search, classification of two-stage processing ideas were introduced, and the weighting of the category hierarchy was analyzed and feature was updated dynamically by combining with the structure characteristics of the category hierarchy tree and the related link between categories as well as other implicit domain knowledge. Meanwhile feature set with more classification judgment was built for each node of the category hierarchy tree. In addition, depth first search algorithm was used to reduce the search range and the pruning strategy with setting threshold was applied to search the best candidate category for classified text. Finally, the classical K Nearest Neighbor (KNN) classification algorithm and Support Vector Machine (SVM) classification algorithm were applied to classification test and contrast analysis on the basis of candidate classes. The experimental results show that the overall classification performance of the proposed algorithm is superior to the traditional classification algorithm, and the average F₁ value is about 6% higher than the heuristic search algorithm based on greedy strategy. The algorithm improves the classification accuracy of deep text classification significantly.

Key words: deep text classification, class hierarchy, tree-structured class hierarchy, depth first search, candidate category

中图分类号:

TP301.6

张忠林, 刘述昌, 江粉桃. 深层次分类中候选类别搜索算法[J]. 计算机应用, 2017, 37(3): 635-639.

ZHANG Zhonglin, LIU Shuchang, JIANG Fentao. Candidate category search algorithm in deep level classification[J]. Journal of Computer Applications, 2017, 37(3): 635-639.

参考文献

[1] 严霄凤,张德馨.大数据研究[J].计算机技术与发展,2013,23(4):168-172.(YAN X F, ZHANG D X. Big data research[J]. Computer Technology and Development, 2013, 23(4):168-172.)
[2] 何力,贾焰,韩伟红,等.大规模层次分类问题研究及其进展[J].计算机学报, 2012,35(10):2101-2115.(HE L, JIA Y, HAN W H, et al. Research and development of large scale hierarchical classification problem[J]. Chinese Journal of Computers, 2012, 35(10):2101-2115.)
[3] 李保利.基于类别层次结构的多层文本分类样本扩展策略[J].北京大学学报(自然科学版),2015,51(2):357-366.(LI B L. Expanding training dataset with class hierarchy in hierarchical text categorization[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2015, 51(2):357-366.)
[4] 张永,浮盼盼,张玉婷.基于分层聚类及重采样的大规模数据分类[J].计算机应用,2013,33(10):2801-2803.(ZHANG Y, FU P P, ZHANG Y T. Large-scale data classification based on hierarchical clustering and resembling[J]. Journal of Computer Applications, 2013, 33(10):2801-2803.)
[5] SUN A, LIM E P. Hierarchical text classification and evaluation[C]//Proceedings of the 2001 IEEE International Conference on Data Mining. Washington, DC:IEEE Computer Society, 2001:521-528.
[6] LIU T-Y, YANG Y M, WAN H, et al. Support vector machines classification with a very large-scale taxonomy[J]. ACM SIGKDD Explorations Newsletter, 2005, 7(1):36-43.
[7] XUE G-R, XING D, YANG Q, et al. Deep classification in large-scale text hierarchies[C]//Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM, 2008:619-626.
[8] MALIK H. Improving hierarchical SVM by hierarchy flattening and lazy classification[C]//Proceedings of the Large-Scale Hierarchical Classification Workshop in 32nd European Conference on Information Retrial. Milton Keynes, UK:[s.n.], 2010:1-12.
[9] 何力,丁兆云,贾焰,等.大规模层次分类中的候选类别搜索[J].计算机学报,2014,37(1):41-49.(HE L, DING Z Y, JIA Y, et al. Category candidate search in large scale hierarchical classification[J]. Chinese Journal of Computers, 2014, 37(1):41-49.).
[10] CECI M, MALERBA D. Classifying Web documents in a hierarchy of categories:a comprehensive study[J]. Journal of Intelligent Information System, 2007, 28(1):37-78.
[11] OH H S, CHOI Y J, MYAENG S H. Combining global and local information for enhanced deep classification[C]//Proceeding of the 2010 ACM Symposium on Applied Computing. New York:ACM, 2010:1760-1768.
[12] WANG K, ZHOU S, HE Y. Hierarchical classification of real life documents[C]//Proceeding of the 1st SIAM International Conference on Data Mining. Philadelphia, PA:SIAM, 2001:33-46.
[13] 祝翠玲.基于类别结构的文本层次分类方法研究[D].济南:山东大学,2011:52-56.(ZHU C L. Research of hierarchical text classification methods based on category structure[D]. Jinan:Shandong University, 2011:52-56.)
[14] 盛骤,谢式千,潘承毅.概率论与数理统计[M].北京:高等教育出版社,2015:11-65.(SHENG Z, XIE S Q, PAN C Y. Probability and Mathematical Statistics[M]. Beijing:Higher Education Press, 2015:11-65.)
[15] 许少华.算法设计与分析[M].哈尔滨:哈尔滨工业大学出版社,2011:21-78. (XU S H. Algorithm Design and Analysis[M]. Harbin:Harbin Institute of Technology Press, 2011:21-78.)
[16] CHAKRABARTI S, JOSHI M, TAWDE V. Enhanced topic distillation using text, markup tags, and hyperlinks[C]//Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM, 2001:208-216.
[17] ZHU S, LIU Y, LIU M, et al. Research on feature extraction from Chinese text for opinion mining[C]//Proceedings of the 2009 International Conference on Asian Languages Processing. Washington, DC:IEEE Computer Society, 2009:7-10.
[18] The directory of the Web. The simplified chinese data set of open directory project[EB/OL].[2016-05-01]. http://www.dmoz.org/World/Chinese_Simplified/.
[19] Ken Lang. The data set of 20Newsgroups[EB/OL].[2016-09-07]. http://people.csail.mit.edu/jrennie/20Newsgroups/.

深层次分类中候选类别搜索算法

Candidate category search algorithm in deep level classification

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 4

编辑推荐

Metrics

[1]	屈立成, 吕娇, 赵明, 王海飞, 屈艺华. 基于三维时空地图和运动分解的多机器人路径规划算法[J]. 计算机应用, 2020, 40(12): 3499-3507.
[2]	李梓杨, 于炯, 卞琛, 王跃飞, 鲁亮. 基于负载感知的数据流动态负载均衡策略[J]. 计算机应用, 2017, 37(10): 2760-2766.
[3]	唐朝伟李超群燕凯严鸣. 基于LISOMAP的相关向量机入侵检测模型[J]. 计算机应用, 2012, 32(09): 2606-2608.
[4]	袁和金张艳宁周涛. 基于矢量量化和深度优先搜索的轨迹分布模式学习算法[J]. 计算机应用, 2007, 27(5): 1126-1128.