计算机应用 ›› 2020, Vol. 40 ›› Issue (3): 626-630.DOI: 10.11772/j.issn.1001-9081.2019071200

• 人工智能 • 上一篇    下一篇

面向二类区分能力的干扰熵特征选择方法

曾元鹏1,2, 王开军1, 林崧2   

  1. 1. 福建师范大学 数学与信息学院, 福州 350007;
    2. 福建师范大学 数字福建环境监测物联网实验室, 福州 350007
  • 收稿日期:2019-07-10 修回日期:2019-09-23 出版日期:2020-03-10 发布日期:2019-10-25
  • 通讯作者: 王开军
  • 作者简介:曾元鹏(1995-),男,福建漳州人,硕士研究生,主要研究方向:机器学习、数据挖掘;王开军(1965-),男,福建福州人,副教授,博士,主要研究方向:机器学习、数据挖掘;林崧(1979-),男,福建福州人,教授,博士,主要研究方向:量子密码、量子智能计算。
  • 基金资助:
    国家自然科学基金资助项目(61672157,61772134);福建省自然科学基金资助项目(2018J01778);中国博士后科学基金资助项目(2016M600494)。

Interference entropy feature selection method for two-class distinguishing ability

ZENG Yuanpeng1,2, WANG Kaijun1, LIN Song2   

  1. 1. College of Mathematics and Informatics, Fujian Normal University, Fuzhou Fujian 350007, China;
    2. Digit Fujian Internet-of-Things Laboratory of Environmental Monitoring, Fujian Normal University, Fuzhou Fujian 350007, China
  • Received:2019-07-10 Revised:2019-09-23 Online:2020-03-10 Published:2019-10-25
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China(61672157, 61772134), the Natural Science Foundation of Fujian Province (2018J01778), the China Postdoctoral Science Foundation (2016M600494).

摘要: 针对现有的特征选择方法对衡量不同类别数据重叠/分离能力的不足,提出了一种用于评价特征的二类区分能力的干扰熵方法(IET-CD)。对于包含两个类别(正类和负类)样本的特征,首先,计算正类数据范围内的负类样本的混合条件概率,以及负类样本归属于正类的概率;然后,由混合条件概率和归属概率计算混淆概率,再利用混淆概率计算正类干扰熵,同理,计算负类干扰熵;最后,将正、负类干扰熵之和作为该特征的二类干扰熵。干扰熵用于评价特征对二类样本的区分能力,该特征的干扰熵值小,表明该特征的二类区分能力强,反之则弱。在3个UCI数据集和1个模拟基因表达数据集上,每个方法挑选出5个最优特征,并对比了这些特征的二类区分能力,由此比较这些方法的性能。实验结果表明:所提方法与NEFS方法相比,二类区分能力相当或更好;与单索引近邻熵特征选择(SNEFS)方法、相关性最大冗余性最小特征选择(MRMR)算法、联合互信息(JMI)方法、Relief方法相比,绝大多数情况都是所提方法获胜。IET-CD方法能有效地选择二类区分能力更好的特征。

关键词: 特征选择, 二类区分能力, 条件概率, 干扰熵

Abstract: Aiming at the existing feature selection methods lacking the ability to measure the overlap/separation of different classes of data, an Interference Entropy of Two-Class Distinguishing (IET-CD) method was proposed to evaluate the two-class distinguishing ability of features. For the feature containing two classes (positive and negative), firstly, the mixed conditional probability of the negative class samples within the range of positive class data and the probability of the negative class samples belonging to the positive class were calculated; then, the confusion probability was calculated by the mixed conditional probability and attribution probability, and the confusion probability was used to calculate the positive interference entropy. In the similar way, the negative interference entropy was calculated. Finally, the sum of positive and negative interference entropies was taken as the two-class interference entropy of the feature. The interference entropy was used to evaluate the distinguishing ability of the feature to the two-class sample. The smaller the interference entropy value of the feature, the stronger the two-class distinguishing ability of the feature. On three UCI datasets and one simulated gene expression dataset, five optimal features were selected for each method, and the two-class distinguishing ability of the features were compared, so as to compare the performance of the methods. The experimental results show that the proposed method is equivalent or better than the NEFS (Neighborhood Entropy Feature Selection) method, and compared with the Single-indexed Neighborhood Entropy Feature Selection (SNEFS), feature selection based on Max-Relevance and Min-Redundancy (MRMR), Joint Mutual Information (JMI) and Relief method, the proposed method is better in most cases. The IET-CD method can effectively select features with better two-class distinguishing ability.

Key words: feature selection, two-class distinguishing ability, conditional probability, interference entropy

中图分类号: