Journal of Computer Applications ›› 2021, Vol. 41 ›› Issue (12): 3521-3526.DOI: 10.11772/j.issn.1001-9081.2021060980

• The 18th China Conference on Machine Learning • Previous Articles    

Hybrid K-anonymous feature selection algorithm

Liu YANG1,2, Yun LI1,2()   

  1. 1.Jiangsu Key Laboratory of Big Data Security and Intelligent Processing (Nanjing University of Posts and Telecommunications),Nanjing Jiangsu 210023,China
    2.School of Computer Science,Nanjing University of Posts and Telecommunications,Nanjing Jiangsu 210023,China
  • Received:2021-05-12 Revised:2021-06-22 Accepted:2021-06-29 Online:2021-12-28 Published:2021-12-10
  • Contact: Yun LI
  • About author:YANG Liu, born in 1998, M. S. candidate. His research interests include pattern recognition, machine learning.
  • Supported by:
    the Surface Program of National Natural Science Foundation of China(61772284)

混合式的K-匿名特征选择算法

杨柳1,2, 李云1,2()   

  1. 1.江苏省大数据安全与智能处理重点实验室(南京邮电大学),南京 210023
    2.南京邮电大学 计算机学院、软件学院、网络空间安全学院,南京 210023
  • 通讯作者: 李云
  • 作者简介:杨柳(1998—),男,安徽安庆人,硕士研究生,主要研究方向:模式识别、机器学习;
  • 基金资助:
    国家自然科学基金面上项目(61772284)

Abstract:

K-anonymous algorithm makes the data reached the condition of K-anonymity by generalizing and suppressing the data. It can be seen as a special feature selection method named K-anonymous feature selection which considers both data privacy and classification performance. In K-anonymous feature selection method, the characteristics of K-anonymity and feature selection are combined to use multiple evaluation criteria to select the subset of K-anonymous features. It is difficult for the filtered K-anonymous feature selection method to search all the candidate feature subsets satisfying the K-anonymous condition, and the classification performance of the obtained feature subset cannot be guaranteed to be optimal, and the wrapper feature selection method has very high-cost calculation. Therefore, a hybrid K-anonymous feature selection method was designed by combining the characteristics of filtered feature sorting and wrapper feature selection by improving the forward search strategy in the existing methods and thereby using classification performance as the evaluation criterion to select the K-anonymous feature subset with the best classification performance. Experiments were carried out on multiple public datasets, and the results show that the proposed algorithm can outperform the existing algorithms in classification performance and has less information loss.

Key words: hybrid, filtered feature sorting, wrapper feature selection, feature selection, privacy protection, K-anonymity, forward search strategy

摘要:

K-匿名算法通过对数据的泛化、隐藏等手段使得数据达到K-匿名条件,在隐藏特征的同时考虑数据的隐私性与分类性能,可以视为一种特殊的特征选择方法,即K-匿名特征选择。K-匿名特征选择方法结合K-匿名与特征选择的特点使用多个评价准则选出K-匿名特征子集。过滤式K-匿名特征选择方法难以搜索到所有满足K-匿名条件的候选特征子集,不能保证得到的特征子集的分类性能最优,而封装式特征选择方法计算成本很大,因此,结合过滤式特征排序与封装式特征选择的特点,改进已有方法中的前向搜索策略,设计了一种混合式K-匿名特征选择算法,使用分类性能作为评价准则选出分类性能最好的K-匿名特征子集。在多个公开数据集上进行实验,结果表明,所提算法在分类性能上可以超过现有算法并且信息损失更小。

关键词: 混合式, 过滤式特征排序, 封装式特征选择, 特征选择, 隐私保护, K-匿名, 前向搜索策略

CLC Number: