Journal of Computer Applications ›› 2023, Vol. 43 ›› Issue (6): 1842-1854.DOI: 10.11772/j.issn.1001-9081.2022050691

• Data science and technology • Previous Articles     Next Articles

Feature selection for imbalanced data based on neighborhood tolerance mutual information and whale optimization algorithm

Lin SUN1,2(), Jinxu HUANG1, Jiucheng XU1   

  1. 1.College of Computer and Information Engineering,Henan Normal University,Xinxiang Henan 453007,China
    2.Engineering Lab of Intelligence Business and Internet of Things of Henan Province (Henan Normal University),Xinxiang Henan 453007,China
  • Received:2022-05-12 Revised:2022-11-05 Accepted:2022-11-15 Online:2023-06-08 Published:2023-06-10
  • Contact: Lin SUN
  • About author:HUANG Jinxu, born in 1995, M. S. candidate. His research interests include granular computing, data mining.
    XU Jiucheng, born in 1963, Ph. D., professor. His research interests include granular computing, data mining.
  • Supported by:
    National Natural Science Foundation of China(62076089);Key Scientific and Technological Project of Henan Province(212102210136)

基于邻域容差互信息和鲸鱼优化算法的非平衡数据特征选择

孙林1,2(), 黄金旭1, 徐久成1   

  1. 1.河南师范大学 计算机与信息工程学院, 河南 新乡 453007
    2.智慧商务与物联网技术河南省工程实验室(河南师范大学), 河南 新乡 453007
  • 通讯作者: 孙林
  • 作者简介:孙林(1979—),男,河南南阳人,副教授,博士,CCF会员,主要研究方向:粒计算、数据挖掘、机器学习、生物信息学Email:sunlin@htu.edu.cn
    黄金旭(1995—),男,河南周口人,硕士研究生,主要研究方向:粒计算、数据挖掘
    徐久成(1963—),男,河南洛阳人,教授,博士生导师,博士,CCF高级会员,主要研究方向:粒计算、数据挖掘。
  • 基金资助:
    国家自然科学基金资助项目(62076089);河南省科技攻关项目(212102210136)

Abstract:

Aiming at the problems that most feature selection algorithms do not fully consider class non-uniform distribution of data, the correlation between features and the influence of different parameters on the feature selection results, a feature selection method for imbalanced data based on neighborhood tolerance mutual information and Whale Optimization Algorithm (WOA) was proposed. Firstly, for the binary and multi-class datasets in incomplete neighborhood decision system, two kinds of feature importances of imbalanced data were defined on the basis of the upper and lower boundary regions. Then, to fully reflect the decision-making ability of features and the correlation between features, the neighborhood tolerance mutual information was developed. Finally, by integrating the feature importance of imbalanced data and the neighborhood tolerance mutual information, a Feature Selection for Imbalanced Data based on Neighborhood tolerance mutual information (FSIDN) algorithm was designed, where the optimal parameters of feature selection algorithm were obtained by using WOA, and the nonlinear convergence factor and adaptive inertia weight were introduced to improve WOA and avoid WOA from falling into the local optimum. Experiments were conducted on 8 benchmark functions, the results show that the improved WOA has good optimization performance; and the experimental results of feature selection on 13 binary and 4 multi-class imbalanced datasets show that the proposed algorithm can effectively select the feature subsets with good classification effect compared with the other related algorithms.

Key words: imbalanced data, feature selection, incomplete neighborhood decision system, mutual information, Whale Optimization Algorithm (WOA)

摘要:

针对大多数特征选择算法未充分考虑数据的类不均匀分布、特征之间的相关性和不同参数对特征选择结果的影响等问题,提出一种基于邻域容差互信息和鲸鱼优化算法(WOA)的非平衡数据特征选择方法。首先,在不完备邻域决策系统中,针对二分类数据集和多分类数据集,基于上、下边界域定义两种非平衡数据的特征重要度;然后,为充分反映特征的决策能力和特征之间的相关性,构建邻域容差互信息;最后,通过将非平衡数据特征重要度和邻域容差互信息相结合,提出基于邻域容差互信息的非平衡数据特征选择(FSIDN)算法,该算法采用WOA获取特征选择算法中的最优参数,并引入非线性收敛因子和自适应惯性权重来改进WOA,以解决WOA易陷入局部最优的问题。在8个基准函数上进行实验,结果表明改进的WOA具有较好的优化性能;在13个二分类和4个多分类的非平衡数据集上进行特征选择实验,实验结果表明,与其他相关算法相比,所提算法能够有效地选择出具有良好分类性能的特征子集。

关键词: 非平衡数据, 特征选择, 不完备邻域决策系统, 互信息, 鲸鱼优化算法

CLC Number: