计算机应用 ›› 2017, Vol. 37 ›› Issue (3): 635-639.DOI: 10.11772/j.issn.1001-9081.2017.03.635

• 第四届大数据学术会议(CCF BIGDATA2016) • 上一篇    下一篇

深层次分类中候选类别搜索算法

张忠林, 刘述昌, 江粉桃   

  1. 兰州交通大学 电子与信息工程学院, 兰州 730070
  • 收稿日期:2016-09-18 修回日期:2016-10-18 出版日期:2017-03-10 发布日期:2017-03-22
  • 通讯作者: 刘述昌
  • 作者简介:张忠林(1965-),男,河北衡水人,教授,博士,CCF会员,主要研究方向:智能信息处理、软件工程;刘述昌(1989-),男,甘肃金昌人,硕士研究生,主要研究方向:数据挖掘;江粉桃(1991-),女,甘肃定西人,硕士研究生,主要研究方向:Web数据挖掘。
  • 基金资助:
    国家自然科学基金资助项目(61662043)。

Candidate category search algorithm in deep level classification

ZHANG Zhonglin, LIU Shuchang, JIANG Fentao   

  1. School of Electronic and Information Engineering, Lanzhou Jiaotong University, Lanzhou Gansu 730070, China
  • Received:2016-09-18 Revised:2016-10-18 Online:2017-03-10 Published:2017-03-22
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61662043).

摘要: 针对深层次分类中分类准确率低、处理速度慢等问题,提出一种待分类文本的候选类别搜索算法。首先,引入搜索、分类两阶段的处理思想,结合类别层次树的结构特点和类别间的相关联系等隐含的领域知识,进行了类别层次权重分析和特征项的动态更新,为类树层次结构的各个节点构建更具分类判断力的特征项集合;进而,采用深度优先搜索算法并结合设定阈值的剪枝策略缩小搜索范围,搜索得到待分类文本的最优候选类别;最后,在候选类别的基础上应用经典的K最近邻(KNN)分类算法和支持向量机(SVM)分类算法进行分类测试和对比分析。实验结果显示,所提算法的总体分类性能优于传统的分类算法,而且使平均F1值较基于贪心策略的启发式搜索算法提高了6%左右。该算法显著提高了深层次文本分类的分类准确度。

关键词: 深层文本分类, 类别层次, 类别层次树, 深度优先搜索, 候选类别

Abstract: Aiming at the problem of low classification accuracy and slow processing speed in deep classification, a candidate category searching algorithm for text classification was proposed. Firstly, the search, classification of two-stage processing ideas were introduced, and the weighting of the category hierarchy was analyzed and feature was updated dynamically by combining with the structure characteristics of the category hierarchy tree and the related link between categories as well as other implicit domain knowledge. Meanwhile feature set with more classification judgment was built for each node of the category hierarchy tree. In addition, depth first search algorithm was used to reduce the search range and the pruning strategy with setting threshold was applied to search the best candidate category for classified text. Finally, the classical K Nearest Neighbor (KNN) classification algorithm and Support Vector Machine (SVM) classification algorithm were applied to classification test and contrast analysis on the basis of candidate classes. The experimental results show that the overall classification performance of the proposed algorithm is superior to the traditional classification algorithm, and the average F1 value is about 6% higher than the heuristic search algorithm based on greedy strategy. The algorithm improves the classification accuracy of deep text classification significantly.

Key words: deep text classification, class hierarchy, tree-structured class hierarchy, depth first search, candidate category

中图分类号: