• •    

CCML2021+319: 混合式的K-匿名特征选择算法

杨柳,李云   

  1. 南京邮电大学
  • 收稿日期:2021-06-09 修回日期:2021-06-22 发布日期:2021-06-22
  • 通讯作者: 杨柳

CCML2021+319: Hybrid K-Anonymous Feature Selection Algorithm

  • Received:2021-06-09 Revised:2021-06-22 Online:2021-06-22

摘要: 摘 要: K-匿名算法通过对数据的泛化、隐藏等手段使得数据达到K-匿名条件,在隐藏特征时同时考虑数据的隐私性与分类性能可以视为一种特殊的特征选择方法,即K-匿名特征选择。K-匿名特征选择方法结合K-匿名与特征选择的特点使用多个评价准则选出K-匿名特征子集。过滤式K-匿名特征选择方法难以搜索到所有满足K-匿名条件的候选特征子集,不能保证特征子集的分类性能是最优,而封装式特征选择方法计算成本又很大。因此,结合过滤式特征排序与封装式特征选择的特点,改进已有方法中的前向搜索策略,设计了一种混合式K-匿名特征选择方法,使用分类性能作为评价准则选出分类性能最好的K-匿名特征子集。在多个公开数据集上进行实验,结果表明,所提算法在分类性能上可以超过现有算法并且信息损失更小。

关键词: 混合式, 特征选择, 隐私保护, K-匿名, 搜索策略

Abstract: Abstract: In the era of big data, the protection of data privacy has become an issue that cannot be ignored. The K-anonymity algorithm is a classic method in the field of privacy protection. The data can reach the K-anonymity condition by generalizing and hiding the data. When hiding features, considering the privacy and classification performance of the data can be regarded as a special Feature selection method, namely K-anonymous feature selection. The K-anonymity feature selection method combines the characteristics of K-anonymity and feature selection to select a subset of K-anonymity features using multiple evaluation criteria. The filtering K-anonymous feature selection method is difficult to search for all candidate feature subsets that meet the K-anonymity condition, and cannot guarantee that the classification performance of the feature subset is optimal, and the encapsulated feature selection method has a large computational cost. Therefore, the article combines the characteristics of filtered feature ranking and encapsulated feature selection, improves the forward search strategy in the existing methods, and designs a hybrid (Hybrid) K-anonymous feature selection method, using classification performance as the evaluation criterion to select the subset of K-anonymous features with the best classification performance. Experiments on multiple public data sets show that the algorithm proposed in this paper can surpass the existing methods in classification performance, and the information loss is smaller.

Key words: hybrid, feature selection, privacy protection, K-anonymity, search strategy

中图分类号: