Journal of Computer Applications ›› 2019, Vol. 39 ›› Issue (11): 3127-3133.DOI: 10.11772/j.issn.1001-9081.2019050822

• The 2019 China Conference on Granular Computing and Knowledge Discovery (CGCKD2019) • Previous Articles     Next Articles

Feature selection method for imbalanced text sentiment classification based on three-way decisions

WAN Zhichao1, HU Feng1,2, DENG Weibin2   

  1. 1. College of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing 400065, China;
    2. Chongqing Key Laboratory of Computational Intelligence(Chongqing University of Posts and Telecommunications), Chongqing 400065, China
  • Received:2019-05-06 Revised:2019-05-23 Online:2019-11-10 Published:2019-09-11
  • Supported by:
    This work is partially supported by the National Key Research and Development Program of China (2018YFC0832100, 2018YFC0832102), the National Natural Science Foundation of China (61751312, 61533020, 61309014), the Chongqing Research Program of Basic Research and Frontier Technology (cstc2017jcyjAX0408).

面向不平衡文本情感分类的三支决策特征选择方法

万志超1, 胡峰1,2, 邓维斌2   

  1. 1. 重庆邮电大学 计算机科学与技术学院, 重庆 400065;
    2. 计算智能重庆市重点实验室(重庆邮电大学), 重庆 400065
  • 通讯作者: 万志超
  • 作者简介:万志超(1995-),男,湖北汉川人,硕士研究生,主要研究方向:自然语言处理、机器学习;胡峰(1978-),男,湖北天门人,教授,博士,主要研究方向:粗糙集、数据挖掘;邓维斌(1978-),男,重庆人,教授,博士,主要研究方向:不确定性决策方法。
  • 基金资助:
    国家重点研发计划项目(2018YFC0832100,2018YFC0832102);国家自然科学基金资助项目(61533020,61751312,61309014);重庆市基础科学与前沿技术研究专项(cstc2017jcyjAX0408)。

Abstract: Traditional feature selection methods have great limitations in the imbalanced text sentiment tendency classification, which are mainly reflected in the high feature dimension, the sparse characteristics, and the imbalanced feature distribution, making the reduction of classification accuracy. According to the distribution of emotional features of imbalanced texts, a Three-Way Decisions-Feature Selection algorithm (TWD-FS) was proposed for imbalanced text sentiment classification based on three-way decisions. In order to reduce the number of feature words and reduce the feature dimension, two supervised feature selection methods were combined, and the feature words selected were further filtered in order to make them satisfy the characteristics of the maximum between-class scatter degree and the minimum within-class scatter degree. In addition, the imbalance of sentiment features was decreased and the classification accuracy of minority sentiment was effectively improved by combining positive and negative sentiment features. The experimental results on COAE2013 Chinese microblog imbalanced datasets and other datasets show that the proposed feature selection algorithm TWD-FS can effectively improve the accuracy of imbalanced text sentiment classification.

Key words: imbalanced text, feature selection, sentiment classification, supervised, three-way decisions

摘要: 传统的特征选择方法在面对不平衡文本情感倾向性分类时会有很大的局限性,这种局限性主要体现在特征维数过高、特征过于稀疏和特征分布不平衡,这会使得分类的准确度大幅度下降。根据不平衡文本情感特征分布的特点,结合三支决策的思想,提出了一种面向不平衡文本情感分类的三支决策特征选择方法(TWD-FS)。该方法将两种有监督特征选择方法相结合,将选择出的特征词进一步筛选,使得最终选择出的特征词同时满足类间离散度最大和类内离散度最小的特点,有效地减少了特征词的数量,降低了特征维度;此外,通过组合正负类情感特征,缓解了情感特征的不平衡性,有效提高了不平衡样本中少数类情感的分类效果。在COAE2013中文微博非平衡数据集等多个数据集上的实验结果表明,所提的特征选择算法TWD-FS可以有效提高不平衡文本情感分类的准确度。

关键词: 不平衡文本, 特征选择, 情感分类, 有监督, 三支决策

CLC Number: