Journal of Computer Applications ›› 2018, Vol. 38 ›› Issue (7): 1857-1861.DOI: 10.11772/j.issn.1001-9081.2018010114

Previous Articles     Next Articles

Interaction based algorithm for feature selection in text categorization

TANG Xiaochuan, QIU Xiwei, LUO Liang   

  1. School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu Sichuan 611731, China
  • Received:2018-01-16 Revised:2018-02-28 Online:2018-07-10 Published:2018-07-12
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61602094).

基于交互作用的文本分类特征选择算法

唐小川, 邱曦伟, 罗亮   

  1. 电子科技大学 计算机科学与工程学院, 成都 611731
  • 通讯作者: 唐小川
  • 作者简介:唐小川(1986-),男,四川成都人,博士研究生,CCF会员,主要研究方向:特征选择、机器学习、大数据分析;邱曦伟(1980-),男,四川宜宾人,博士,主要研究方向:云计算、大数据、节能计算;罗亮(1980-),男,陕西汉中人,讲师,博士,主要研究方向:云计算、大数据、能耗建模。
  • 基金资助:
    国家自然科学基金资助项目(61602094)。

Abstract: Focusing on the issue of feature selection in text categorization, an interaction maximum feature selection algorithm, called Max-Interaction, was proposed. Firstly, an information theoretic feature selection model was established based on Joint Mutual Information (JMI). Secondly, the assumptions of the existing feature selection algorithms were relaxed, and the feature selection problem was transformed into an interaction optimization problem. Thirdly, the maximum of the minimum method was employed to avoid the overestimation of higher-order interaction. Finally, a text categorization feature selection algorithm based on sequential forward search and high-order interaction was proposed. In the comparison experiments, the average classification accuracy of Max-Interaction over Interaction Weight Feature Selection (IWFS) was improved by 5.5%; the average classification accuracy of Max-Interaction over Chi-square was improved by 6%; and Max-Interaction outperformed other methods on 93% of the experiments. Therefore, Max-Interaction can effectively improve the performance of feature selection in text categorization.

Key words: feature selection, text categorization, interaction, Mutual Information (MI), information measure

摘要: 针对文本分类中的特征选择问题,提出了一种考虑特征之间交互作用的文本分类特征选择算法——Max-Interaction。首先,通过联合互信息(JMI),建立基于信息论的文本分类特征选择模型;其次,放松现有特征选择算法的假设条件,将特征选择问题转化为交互作用优化问题;再次,通过最大最小法避免过高估计高阶交互作用;最后,提出一个基于前向搜索和高阶交互作用的文本分类特征选择算法。实验结果表明,Max-Interaction比交互作用权重特征选择(IWFS)的平均分类精度提升了5.5%,Max-Interaction比卡方统计法(Chi-square)的平均分类精度提升了6%,Max-Interaction在93%的实验中分类精度高于对比方法,因此,Max-Interaction能有效利用交互作用提升文本分类特征选择的性能。

关键词: 特征选择, 文本分类, 交互作用, 互信息, 信息测度

CLC Number: