《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (12): 3779-3789.DOI: 10.11772/j.issn.1001-9081.2022121841

• 数据科学与技术 • 上一篇    下一篇

基于Fisher score与模糊邻域熵的多标记特征选择算法

孙林1(), 马天娇2, 薛占熬2,3   

  1. 1.天津科技大学 人工智能学院,天津 300457
    2.河南师范大学 计算机与信息工程学院,河南 新乡 453007
    3.智慧商务与物联网技术河南省工程实验室(河南师范大学),河南 新乡 453007
  • 收稿日期:2022-12-09 修回日期:2023-01-29 接受日期:2023-01-31 发布日期:2023-02-17 出版日期:2023-12-10
  • 通讯作者: 孙林
  • 作者简介:马天娇(1998—),女,河南信阳人,硕士研究生,主要研究方向:多标记学习
    薛占熬(1963—),男,河南三门峡人,教授,博士,CCF高级会员,主要研究方向:粒计算、三支决策。
  • 基金资助:
    国家自然科学基金资助项目(62076089)

Multilabel feature selection algorithm based on Fisher score and fuzzy neighborhood entropy

Lin SUN1(), Tianjiao MA2, Zhan’ao XUE2,3   

  1. 1.College of Artificial Intelligence,Tianjin University of Science & Technology,Tianjin 300457,China
    2.College of Computer and Information Engineering,Henan Normal University,Xinxiang Henan 453007,China
    3.Engineering Lab of Intelligence Business & Internet of Things of Henan Province (Henan Normal University),Xinxiang Henan 453007,China
  • Received:2022-12-09 Revised:2023-01-29 Accepted:2023-01-31 Online:2023-02-17 Published:2023-12-10
  • Contact: Lin SUN
  • About author:MA Tianjiao, born in 1998, M. S. candidate. Her research interests include multilabel learning.
    XUE Zhan’ao, born in 1963, Ph. D., professor. His research interests include granular computing, three-way decision.
  • Supported by:
    National Natural Science Foundation of China(62076089)

摘要:

针对Fisher score未充分考虑特征与标记以及标记之间的相关性,以及一些邻域粗糙集模型容易忽略边界域中知识粒的不确定性,导致算法分类性能偏低等问题,提出一种基于Fisher score与模糊邻域熵的多标记特征选择算法(MLFSF)。首先,利用最大信息系数(MIC)衡量特征与标记之间的关联程度,构建特征与标记关系矩阵;基于修正余弦相似度定义标记关系矩阵,分析标记之间的相关性。其次,给出一种二阶策略获得多个二阶标记关系组,以此重新划分多标记论域;通过增强标记之间的强相关性和削弱标记之间的弱相关性得到每个特征的得分,进而改进Fisher score模型,对多标记数据进行预处理。再次,引入多标记分类间隔,定义自适应邻域半径和邻域类并构造了上、下近似集;在此基础上提出了多标记粗糙隶属度函数,将多标记邻域粗糙集映射到模糊集,基于多标记模糊邻域给出了上、下近似集以及多标记模糊邻域粗糙集模型,由此定义模糊邻域熵和多标记模糊邻域熵,有效度量边界域的不确定性。最后,设计基于二阶标记相关性的多标记Fisher score特征选择算法(MFSLC),从而构建MLFSF。在多标记K近邻(MLKNN)分类器下11个多标记数据集上的实验结果表明,相较于ReliefF多标记特征选择(MFSR)等6种先进算法,MLFSF的平均分类精度(AP)的均值提高了2.47~6.66个百分点;同时,在多数数据集上,MLFSF在5个评价指标上均能取得最优值

关键词: 多标记学习, 特征选择, Fisher score, 多标记模糊邻域粗糙集, 模糊邻域熵

Abstract:

For that Fisher score model does not fully consider feature-label and label-label relations, and some neighborhood rough set models easily neglect the uncertainty of knowledge granulations in the boundary region, resulting in the low classification performance of these algorithms, a MultiLabel feature selection algorithm based on Fisher Score and Fuzzy neighborhood entropy (MLFSF) was proposed. Firstly, by using the Maximum Information Coefficient (MIC) to evaluate the feature-label association degree, the relationship matrix between features and labels was constructed, and the correlation between labels was analyzed by the relationship matrix of labels based on the adjusted cosine similarity. Secondly, a second-order strategy was given to obtain multiple second-order label relationship groups to reclassify the multilabel domain, where the strong correlation between labels was enhanced and the weak correlation between labels was weakened to obtain the score of each feature. The Fisher score model was improved to preprocess the multilabel data. Thirdly, the multilabel classification margin was introduced to define the adaptive neighborhood radius and neighborhood class, and the upper and lower approximation sets were constructed. On this basis, the multilabel rough membership degree function was presented, and the multilabel neighborhood rough set was mapped to the fuzzy set. Based on the multilabel fuzzy neighborhood, the upper and lower approximation sets and the multilabel fuzzy neighborhood rough set model were developed. Thus, the fuzzy neighborhood entropy and the multilabel fuzzy neighborhood entropy were defined to effectively measure the uncertainty of the boundary region. Finally, the Multilabel Fisher Score-based feature selection algorithm with second-order Label Correlation (MFSLC) was designed, and then the MLFSF was constructed. The experimental results applied to 11 multilabel datasets with the Multi-Label K-Nearest Neighbor (MLKNN) classifier show that when compared with six state-of-the-art algorithms including the Multilabel Feature Selection algorithm based on improved ReliefF (MFSR), MLFSF improves the mean of Average Precision (AP) by 2.47 to 6.66 percentage points; meanwhile, MLFSF obtains optimal values for all five evaluation metrics on most datasets.

Key words: multilabel learning, feature selection, Fisher score, multilabel fuzzy neighborhood rough set, fuzzy neighborhood entropy

中图分类号: