Abstract:Aiming at the problem of low classification accuracy and slow processing speed in deep classification, a candidate category searching algorithm for text classification was proposed. Firstly, the search, classification of two-stage processing ideas were introduced, and the weighting of the category hierarchy was analyzed and feature was updated dynamically by combining with the structure characteristics of the category hierarchy tree and the related link between categories as well as other implicit domain knowledge. Meanwhile feature set with more classification judgment was built for each node of the category hierarchy tree. In addition, depth first search algorithm was used to reduce the search range and the pruning strategy with setting threshold was applied to search the best candidate category for classified text. Finally, the classical K Nearest Neighbor (KNN) classification algorithm and Support Vector Machine (SVM) classification algorithm were applied to classification test and contrast analysis on the basis of candidate classes. The experimental results show that the overall classification performance of the proposed algorithm is superior to the traditional classification algorithm, and the average F1 value is about 6% higher than the heuristic search algorithm based on greedy strategy. The algorithm improves the classification accuracy of deep text classification significantly.
[1] 严霄凤,张德馨.大数据研究[J].计算机技术与发展,2013,23(4):168-172.(YAN X F, ZHANG D X. Big data research[J]. Computer Technology and Development, 2013, 23(4):168-172.) [2] 何力,贾焰,韩伟红,等.大规模层次分类问题研究及其进展[J].计算机学报, 2012,35(10):2101-2115.(HE L, JIA Y, HAN W H, et al. Research and development of large scale hierarchical classification problem[J]. Chinese Journal of Computers, 2012, 35(10):2101-2115.) [3] 李保利.基于类别层次结构的多层文本分类样本扩展策略[J].北京大学学报(自然科学版),2015,51(2):357-366.(LI B L. Expanding training dataset with class hierarchy in hierarchical text categorization[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2015, 51(2):357-366.) [4] 张永,浮盼盼,张玉婷.基于分层聚类及重采样的大规模数据分类[J].计算机应用,2013,33(10):2801-2803.(ZHANG Y, FU P P, ZHANG Y T. Large-scale data classification based on hierarchical clustering and resembling[J]. Journal of Computer Applications, 2013, 33(10):2801-2803.) [5] SUN A, LIM E P. Hierarchical text classification and evaluation[C]//Proceedings of the 2001 IEEE International Conference on Data Mining. Washington, DC:IEEE Computer Society, 2001:521-528. [6] LIU T-Y, YANG Y M, WAN H, et al. Support vector machines classification with a very large-scale taxonomy[J]. ACM SIGKDD Explorations Newsletter, 2005, 7(1):36-43. [7] XUE G-R, XING D, YANG Q, et al. Deep classification in large-scale text hierarchies[C]//Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM, 2008:619-626. [8] MALIK H. Improving hierarchical SVM by hierarchy flattening and lazy classification[C]//Proceedings of the Large-Scale Hierarchical Classification Workshop in 32nd European Conference on Information Retrial. Milton Keynes, UK:[s.n.], 2010:1-12. [9] 何力,丁兆云,贾焰,等.大规模层次分类中的候选类别搜索[J].计算机学报,2014,37(1):41-49.(HE L, DING Z Y, JIA Y, et al. Category candidate search in large scale hierarchical classification[J]. Chinese Journal of Computers, 2014, 37(1):41-49.). [10] CECI M, MALERBA D. Classifying Web documents in a hierarchy of categories:a comprehensive study[J]. Journal of Intelligent Information System, 2007, 28(1):37-78. [11] OH H S, CHOI Y J, MYAENG S H. Combining global and local information for enhanced deep classification[C]//Proceeding of the 2010 ACM Symposium on Applied Computing. New York:ACM, 2010:1760-1768. [12] WANG K, ZHOU S, HE Y. Hierarchical classification of real life documents[C]//Proceeding of the 1st SIAM International Conference on Data Mining. Philadelphia, PA:SIAM, 2001:33-46. [13] 祝翠玲.基于类别结构的文本层次分类方法研究[D].济南:山东大学,2011:52-56.(ZHU C L. Research of hierarchical text classification methods based on category structure[D]. Jinan:Shandong University, 2011:52-56.) [14] 盛骤,谢式千,潘承毅.概率论与数理统计[M].北京:高等教育出版社,2015:11-65.(SHENG Z, XIE S Q, PAN C Y. Probability and Mathematical Statistics[M]. Beijing:Higher Education Press, 2015:11-65.) [15] 许少华.算法设计与分析[M].哈尔滨:哈尔滨工业大学出版社,2011:21-78. (XU S H. Algorithm Design and Analysis[M]. Harbin:Harbin Institute of Technology Press, 2011:21-78.) [16] CHAKRABARTI S, JOSHI M, TAWDE V. Enhanced topic distillation using text, markup tags, and hyperlinks[C]//Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York:ACM, 2001:208-216. [17] ZHU S, LIU Y, LIU M, et al. Research on feature extraction from Chinese text for opinion mining[C]//Proceedings of the 2009 International Conference on Asian Languages Processing. Washington, DC:IEEE Computer Society, 2009:7-10. [18] The directory of the Web. The simplified chinese data set of open directory project[EB/OL].[2016-05-01]. http://www.dmoz.org/World/Chinese_Simplified/. [19] Ken Lang. The data set of 20Newsgroups[EB/OL].[2016-09-07]. http://people.csail.mit.edu/jrennie/20Newsgroups/.