《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (2): 475-484.DOI: 10.11772/j.issn.1001-9081.2021050957

• 数据科学与技术 • 上一篇    

基于伪标签一致度的不平衡数据特征选择算法

李懿恒, 杜晨曦, 杨燕燕(), 李翔宇   

  1. 北京交通大学 软件学院,北京 100044
  • 收稿日期:2021-03-25 修回日期:2021-07-21 接受日期:2021-07-21 发布日期:2022-02-21 出版日期:2022-02-10
  • 通讯作者: 杨燕燕
  • 作者简介:李懿恒(2001—),男,山西临汾人,主要研究方向:机器学习;
    杜晨曦(2001—),男,吉林梅河口人,主要研究方向:机器学习;
    杨燕燕(1986—),女,河南郑州人,讲师,博士,CCF会员,主要研究方向:机器学习、粒计算、粗糙集;
    李翔宇(1990—),女,河南郑州人,讲师,博士,主要研究方向:生物信息学、机器学习。
  • 基金资助:
    国家自然科学基金资助项目(61806108);中央高校基本科研业务费专项资金资助项目(2019RC055);北京市级大学生创新创业训练计划项目(202110004107)

Feature selection algorithm for imbalanced data based on pseudo-label consistency

Yiheng LI, Chenxi DU, Yanyan YANG(), Xiangyu LI   

  1. School of Software Engineering,Beijing Jiaotong University,Beijing 100044,China
  • Received:2021-03-25 Revised:2021-07-21 Accepted:2021-07-21 Online:2022-02-21 Published:2022-02-10
  • Contact: Yanyan YANG
  • About author:LI Yiheng, born in 2001. His research interests include machine learning.
    DU Chenxi, born in 2001. His research interests include machine learning.
    YANG Yanyan, born in 1986, Ph. D., lecturer. Her research interests include machine learning, granular computing, rough set.
    LI Xiangyu, born in 1990, Ph. D., lecturer. Her research interests include bioinformatics, machine learning.
  • Supported by:
    National Natural Science Foundation of China(61806108);Fundamental Research Funds for the Central Universities(2019RC055);Beijing Training Program of Innovation for College Students(202110004107)

摘要:

针对大多数粒计算特征选择算法未考虑数据的类别不平衡性的问题,提出一种融合伪标签策略的类别不平衡数据特征选择算法。首先,为了便于研究类别不平衡数据特征选择算法,重新定义样本和数据集一致度的概念,并设计了相应特征选择的贪婪前向搜索算法;其次,引入伪标签策略以平衡数据的类别分布,并将所学样本的伪标签融入一致性测度中,以构造伪标签一致度来估计类别不平衡数据集的特征;最后,通过保持类别不平衡数据集的伪标签一致度不变,设计一种面向类别不平衡数据的基于伪标签一致性的特征选择算法(PLCFS)。实验结果表明,所提PLCFS的性能仅次于最大相关最小冗余(mRMR)算法,而优于Relief算法和基于一致性的特征选择算法(CFS)。

关键词: 粒计算, 伪标签, 类别不平衡数据, 特征选择, 一致性测度

Abstract:

Aiming at the problem that most algorithms of granular computing ignore the class-imbalance of data, a feature selection algorithm integrating pseudo-label strategy was proposed to deal with class-imbalanced data. Firstly, to investigate feature selection from class-imbalanced data conveniently, the sample consistency and dataset consistency were re-defined, and the corresponding greedy forward search algorithm for feature selection was designed. Then, the pseudo-label strategy was introduced to balance the class distribution of the data. By integrating the learned pseudo-label of a sample into consistency measure, the pseudo-label consistency was defined to estimate the features of the class-imbalanced dataset. Finally, an algorithm for Pseudo-Label Consistency based Feature Selection (PLCFS) for class-imbalanced data was developed based on the preservation of the pseudo-label consistency measure for the class-imbalanced dataset. Experimental results indicate that the proposed PLCFS has the performance only lower than max-Relevancy and Min-Redundancy (mRMR) algorithm, but outperforms Relief algorithm and algorithm for Consistency-based Feature Selection (CFS).

Key words: granular computing, pseudo-label, class-imbalanced data, feature selection, consistency measure

中图分类号: