Abstract:Focusing on the issue of feature selection in text categorization, an interaction maximum feature selection algorithm, called Max-Interaction, was proposed. Firstly, an information theoretic feature selection model was established based on Joint Mutual Information (JMI). Secondly, the assumptions of the existing feature selection algorithms were relaxed, and the feature selection problem was transformed into an interaction optimization problem. Thirdly, the maximum of the minimum method was employed to avoid the overestimation of higher-order interaction. Finally, a text categorization feature selection algorithm based on sequential forward search and high-order interaction was proposed. In the comparison experiments, the average classification accuracy of Max-Interaction over Interaction Weight Feature Selection (IWFS) was improved by 5.5%; the average classification accuracy of Max-Interaction over Chi-square was improved by 6%; and Max-Interaction outperformed other methods on 93% of the experiments. Therefore, Max-Interaction can effectively improve the performance of feature selection in text categorization.
唐小川, 邱曦伟, 罗亮. 基于交互作用的文本分类特征选择算法[J]. 计算机应用, 2018, 38(7): 1857-1861.
TANG Xiaochuan, QIU Xiwei, LUO Liang. Interaction based algorithm for feature selection in text categorization. Journal of Computer Applications, 2018, 38(7): 1857-1861.
[1] CAI D, HE X. Manifold adaptive experimental design for text categorization[J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(4):707-719. [2] 张海龙,王莲芝.自动文本分类特征选择方法研究[J].计算机工程与设计,2006,27(20):3840-3841.(ZHANG H L, WANG L Z. Automatic text categorization feature selection methods research[J]. Computer Engineering and Design, 2006, 27(20):3840-3841.) [3] LIU J, SHANG J, HAN J. Phrase Mining from Massive Text and Its Applications[M]. San Rafael, CA:Morgan & Claypool Publishers, 2017:1-89. [4] BROWN G, POCOCK A, ZHAO M J, et al. Conditional likelihood maximisation:a unifying framework for information theoretic feature selection[J]. Journal of Machine Learning Research, 2012, 13(1):27-66. [5] YANG Y, PEDERSEN J O. A comparative study on feature selection in text categorization[C]//ICML 1997:Proceedings of the 1997 International Conference on Machine Learning. San Francisco:Morgan Kaufmann, 1997:412-420. [6] TANG B, KAY S, HE H. Toward optimal feature selection in naive Bayes for text categorization[J]. IEEE Transactions on Knowledge and Data Engineering, 2016, 28(9):2508-2521. [7] WANG D, ZHANG H, LIU R, et al. Feature selection based on term frequency and t-test for text categorization[C]//CIKM 2012:Proceedings of the 21st International Conference on Information and Knowledge Management. New York:ACM, 2012:1482-1486. [8] 辛竹,周亚建.文本分类中互信息特征选择方法的研究与算法改进[J].计算机应用,2013,33(S2):116-118.(XIN Z, ZHOU Y J. Study and improvement of mutual information for feature selection in text categorization[J]. Journal of Computer Applications, 2013, 33(S2):116-118.) [9] VINH N X, ZHOU S, CHAN J, et al. Can high-order dependencies improve mutual information based feature selection?[J]. Pattern Recognition, 2016, 53:46-58. [10] BENNASAR M, HICKS Y, SETCHI R. Feature selection using joint mutual information maximization[J]. Expert Systems with Applications, 2015, 42(22):8520-8532. [11] ZENG Z, ZHANG H, ZHANG R, et al. A novel feature selection method considering feature interaction[J]. Pattern Recognition, 2015, 48(8):2656-2666. [12] JAKULIN A. Machine learning based on attribute interactions[D]. Ljubljana:University of Ljubljana, 2005:37-38. [13] BALAGANI K S, PHOHA V V. On the feature selection criterion based on an approximation of multidimensional mutual information[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010, 32(7):1342-1343. [14] HAGAR J D, WISSINK T L, KUHN D R, et al. Introducing combinatorial testing in a large organization[J]. Computer, 2015, 48(4):64-72. [15] MONTGOMERY D C. Design and Analysis of Experiments[M]. 9th ed. Hoboken, NJ:John Wiley & Sons, 2017:179-220. [16] SHISHKIN A, BEZZUBTSEVA A, DRUTSA A, et al. Efficient high-order interaction-aware feature selection based on conditional mutual information[C]//NIPS 2016:Proceedings of the 30th Annual Conference on Neural Information Processing Systems. Red Hook, NY:Curran Associates, 2016:4637-4645. [17] KLEEREKOPER A, PAPPAS M, POCOCK A, et al. A scalable implementation of information theoretic feature selection for high dimensional data[C]//IEEE BigData 2015:Proceedings of the 2015 IEEE International Conference on Big Data. Piscataway, NJ:IEEE, 2015:339-346. [18] RAMREZ-GALLEGO S, MOURIO-TALN H, MARTNEZ-REGO D, et al. An information theory-based feature selection framework for big data under apache spark[J/OL]. IEEE Transactions on Systems, Man, and Cybernetics:Systems, 2017:1-13[2018-01-15]. http://ieeexplore.ieee.org/document/7970198/. [19] LI J, CHENG K, WANG S, et al. Feature selection:a data perspective[J]. ACM Computing Surveys, 2018, 50(6):Article No. 94.