Hybrid K-anonymous feature selection algorithm

doi:10.11772/j.issn.1001-9081.2021060980

Journal of Computer Applications ›› 2021, Vol. 41 ›› Issue (12): 3521-3526.DOI: 10.11772/j.issn.1001-9081.2021060980

• The 18th China Conference on Machine Learning • Previous Articles

Hybrid K-anonymous feature selection algorithm

Liu YANG¹^,², Yun LI¹^,²()

^1.Jiangsu Key Laboratory of Big Data Security and Intelligent Processing （Nanjing University of Posts and Telecommunications），Nanjing Jiangsu 210023，China
^2.School of Computer Science，Nanjing University of Posts and Telecommunications，Nanjing Jiangsu 210023，China

Received:2021-05-12 Revised:2021-06-22 Accepted:2021-06-29 Online:2021-12-28 Published:2021-12-10
Contact: Yun LI
About author:YANG Liu， born in 1998， M. S. candidate. His research interests include pattern recognition， machine learning.
Supported by:
the Surface Program of National Natural Science Foundation of China(61772284)

混合式的K-匿名特征选择算法

杨柳¹^,², 李云¹^,²()

^1.江苏省大数据安全与智能处理重点实验室（南京邮电大学），南京 210023
^2.南京邮电大学计算机学院、软件学院、网络空间安全学院，南京 210023

通讯作者: 李云
作者简介:杨柳（1998—），男，安徽安庆人，硕士研究生，主要研究方向：模式识别、机器学习；
基金资助:
国家自然科学基金面上项目(61772284)

Abstract

Abstract:

K-anonymous algorithm makes the data reached the condition of K-anonymity by generalizing and suppressing the data. It can be seen as a special feature selection method named K-anonymous feature selection which considers both data privacy and classification performance. In K-anonymous feature selection method， the characteristics of K-anonymity and feature selection are combined to use multiple evaluation criteria to select the subset of K-anonymous features. It is difficult for the filtered K-anonymous feature selection method to search all the candidate feature subsets satisfying the K-anonymous condition， and the classification performance of the obtained feature subset cannot be guaranteed to be optimal， and the wrapper feature selection method has very high-cost calculation. Therefore， a hybrid K-anonymous feature selection method was designed by combining the characteristics of filtered feature sorting and wrapper feature selection by improving the forward search strategy in the existing methods and thereby using classification performance as the evaluation criterion to select the K-anonymous feature subset with the best classification performance. Experiments were carried out on multiple public datasets， and the results show that the proposed algorithm can outperform the existing algorithms in classification performance and has less information loss.

Key words: hybrid, filtered feature sorting, wrapper feature selection, feature selection, privacy protection, K-anonymity, forward search strategy

摘要：

K-匿名算法通过对数据的泛化、隐藏等手段使得数据达到K-匿名条件，在隐藏特征的同时考虑数据的隐私性与分类性能，可以视为一种特殊的特征选择方法，即K-匿名特征选择。K-匿名特征选择方法结合K-匿名与特征选择的特点使用多个评价准则选出K-匿名特征子集。过滤式K-匿名特征选择方法难以搜索到所有满足K-匿名条件的候选特征子集，不能保证得到的特征子集的分类性能最优，而封装式特征选择方法计算成本很大，因此，结合过滤式特征排序与封装式特征选择的特点，改进已有方法中的前向搜索策略，设计了一种混合式K-匿名特征选择算法，使用分类性能作为评价准则选出分类性能最好的K-匿名特征子集。在多个公开数据集上进行实验，结果表明，所提算法在分类性能上可以超过现有算法并且信息损失更小。

关键词: 混合式, 过滤式特征排序, 封装式特征选择, 特征选择, 隐私保护, K-匿名, 前向搜索策略

CLC Number:

TP181

Liu YANG, Yun LI. Hybrid K-anonymous feature selection algorithm[J]. Journal of Computer Applications, 2021, 41(12): 3521-3526.

杨柳, 李云. 混合式的K-匿名特征选择算法[J]. 《计算机应用》唯一官方网站, 2021, 41(12): 3521-3526.

Figures/Tables 6

References 20

1	梁丽雯. Facebook“泄露门”：大数据零隐私？［J］. 金融科技时代， 2018（5）：92.
	LIANG L W. Facebook " Gateleakage"： big data means zero privacy？［J］. Financial Technology Time， 2018（5）：92.
2	NARAYANAN A， SHMATIKOV V. Robust de-anonymization of large sparse datasets［C］// Proceedings of the 2008 IEEE Symposium on Security and Privacy. Piscataway： IEEE， 2008：111-125. 10.1109/sp.2008.33
3	王智慧，许俭，汪卫，等. 一种基于聚类的数据匿名方法［J］. 软件学报， 2010， 21（4）： 680-693. 10.3724/SP.J.1001.2010.03508
	WANG Z H， XU J， WANG W， et al. Clustering-based approach for data anonymization［J］. Journal of Software， 2010， 21（4）： 680-693. 10.3724/SP.J.1001.2010.03508
4	LI N H， LI T C， VENKATASUBRAMANIAN S. t-closeness： privacy beyond k-anonymity and l-diversity［C］// Proceedings of the IEEE 23rd International Conference on Data Engineering. Piscataway： IEEE， 2007： 106-115. 10.1109/icde.2007.367856
5	DÍIAZ C， SEYS S， CLAESSENS J， et al. Towards measuring anonymity［C］// Proceedings of the 2002 International Workshop on Privacy Enhancing Technologies， LNCS2482. Berlin： Springer， 2002： 54-68.
6	SAMARATI P， SWEENEY L. Protecting privacy when disclosing information： k-anonymity and its enforcement through generalization and suppression， SRI-CSL-98-04 ［R/OL］. SRI Computer Science Laboratory， 1998 ［2021-02-15］. . 10.1145/275487.275508
7	SWEENEY L. k-anonymity： a model for protecting privacy［J］. International Journal of Uncertainty， Fuzziness and Knowledge-Based Systems， 2002， 10（5）： 557-570. 10.1142/s0218488502001648
8	IYENGAR V S. Transforming data to satisfy privacy constraints［C］// Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York： ACM， 2002： 279-288. 10.1145/775047.775089
9	MACHANAVAJJHALA A， KIFER D， GEHRKE J， et al. l-diversity： privacy beyond k-anonymity［J］. ACM Transactions on Knowledge Discovery from Data， 2007， 1（1）： No.3. 10.1145/1217299.1217302
10	XIAO X H， TAO Y F. Personalized privacy preservation［C］// Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data. New York： ACM， 2006： 229-240. 10.1145/1142473.1142500
11	LI J D， CHENG K W， WANG S H， et al. Feature selection： a data perspective［J］. ACM Computing Surveys， 2018， 50（6）： No.94. 10.1145/3136625
12	LI Y， LI T， LIU H. Recent advances in feature selection and its applications［J］. Knowledge and Information Systems， 2017， 53（3）： 551-577. 10.1007/s10115-017-1059-8
13	KISILEVICH S， ROKACH L， ELOVICI Y， et al. Efficient multidimensional suppression for k-anonymity［J］. IEEE Transactions on Knowledge and Data Engineering， 2010， 22（3）： 334-347. 10.1109/tkde.2009.91
14	MATATOV N， ROKACH L， MAIMON O. Privacy-preserving data mining： a feature set partitioning approach［J］. Information Sciences， 2010， 180（14）： 2696-2720. 10.1016/j.ins.2010.03.011
15	ZHANG B C， MOHAMMED N， DAVE V S， et al. Feature selection for classification under anonymity constraint［J］. Transactions on Data Privacy， 2017， 10（1）：1-25.
16	王旭，万长胜. 基于极端梯度提升特征重要性的K-匿名特征选择［J］. 网络空间安全， 2020， 11（8）： 70-74. 10.3969/j.issn.1674-9456.2020.08.013
	WANG X， WAN C S. K-anonymous feature selection based on XGBoost feature importance［J］. Cyberspace Security， 2020， 11（8）： 70-74. 10.3969/j.issn.1674-9456.2020.08.013
17	ZHANG H， XIE Y C， ZHENG L J， et al. Interpreting multivariate Shapley interactions in DNNs［C］// Proceedings of the 35th AAAI Conference on Artificial Intelligence. Palo Alto， CA： AAAI Press， 2021： 10877-10886. 10.1609/aaai.v34i04.6158
18	CHEN T， HE T， BENESTY M， et al. xgboost： Extreme Gradient Boosting［CP/OL］. （2021-04-22）［2021-05-03］.. 10.1080/00032719.2021.1952214
19	DUA D， GRAFF C. UCI machine learning repository［DS/OL］. ［2020-09-08］..
20	BAYARDO R J， AGRAWAL R. Data privacy through optimal k-anonymization［C］// Proceedings of the 21st International Conference on Data Engineering. Piscataway： IEEE， 2005： 217-228. 10.1109/icde.2005.42

Id	Age	Disease
12138	22	Flu
12139	25	Flu
12140	27	Cancer
12209	49	Respiratory
12257	95	Cancer
12356	33	Respiratory
12306	35	Flu

Id	Age	Disease
12138	22	Flu
12139	25	Flu
12140	27	Cancer
12209	49	Respiratory
12257	95	Cancer
12356	33	Respiratory
12306	35	Flu

Id	Age	Disease
121**	2*	Flu
121**	2*	Flu
121**	2*	Cancer
122**	≥40	Respiratory
122**	≥40	Cancer
123**	3*	Respiratory
123**	3*	Flu

Id	Age	Disease
121**	2*	Flu
121**	2*	Flu
121**	2*	Cancer
122**	≥40	Respiratory
122**	≥40	Cancer
123**	3*	Respiratory
123**	3*	Flu

X₁	X₃	X₄	X₅	Y
0	1	0	1	+1
1	1	1	0	-1
0	1	1	0	+1
1	1	0	1	-1

Hybrid K-anonymous feature selection algorithm

混合式的K-匿名特征选择算法

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 6

References 20

Related Articles 15

Recommended Articles

Metrics

数据集	特征数	实例数
Breast Cancer	13	699
Adult	14	48 842
Spambase	57	4 601
Heart Disease	75	303
Pima Indians Diabetes	8	769
HCV	29	1 385

[1]	LI Zhuo, SONG Zihui, SHEN Xin, CHEN Xin. Local differential privacy protection mechanism for mobile crowd sensing with edge computing [J]. Journal of Computer Applications, 2021, 41(9): 2678-2686.
[2]	ZHAN Hang, HE Lang, HUANG Zhangcan, LI Huafeng, ZHANG Qiang, TAN Qing. Improved feature selection and classification algorithm for gene expression programming based on layer distance [J]. Journal of Computer Applications, 2021, 41(9): 2658-2667.
[3]	ZHU Cheng, ZHAO Xiaoqi, ZHAO Liping, JIAO Yuhong, ZHU Yafei, CHENG Jianying, ZHOU Wei, TAN Ying. Classification of functional magnetic resonance imaging data based on semi-supervised feature selection by spectral clustering [J]. Journal of Computer Applications, 2021, 41(8): 2288-2293.
[4]	LI Mengmeng, QIN Wei, LIU Yi, DIAO Xingchun. Hybrid ant colony optimization algorithm with brain storm optimization [J]. Journal of Computer Applications, 2021, 41(8): 2412-2417.
[5]	YANG Rui, QIAN Xiaojun, SUN Zhenqiang, XU Zhen. Hybrid aerial image segmentation algorithm based on multi-region feature fusion for natural scene [J]. Journal of Computer Applications, 2021, 41(8): 2445-2452.
[6]	JIANG Lin, SHI Jiaqi, LI Yuancheng. Parallel design and implementation of synthetic view distortion change algorithm in reconfigurable structure [J]. Journal of Computer Applications, 2021, 41(6): 1734-1740.
[7]	LIU Ziyan, MA Shanshan, BAI He. Orthogonal matching pursuit hybrid precoding algorithm based on improved intelligent water drop [J]. Journal of Computer Applications, 2021, 41(5): 1419-1424.
[8]	LI Shuyi, HAN Xiaolong. Collaborative scheduling of rail-mounted gantry crane and container truck in hybrid operation mode of rail-water intermodal terminal [J]. Journal of Computer Applications, 2021, 41(5): 1506-1513.
[9]	JIA Heming, JIANG Zichao, LI Yao, SUN Kangjian. Simultaneous feature selection optimization based on improved spotted hyena optimizer algorithm [J]. Journal of Computer Applications, 2021, 41(5): 1290-1298.
[10]	LIN Junchao, WAN Yuan. Self-adaptive multi-measure unsupervised feature selection method with structured graph optimization [J]. Journal of Computer Applications, 2021, 41(5): 1282-1289.
[11]	QIN Jing, AN Wen, JI Changqing, WANG Zumin. Review of privacy protection mechanisms in wireless body area network [J]. Journal of Computer Applications, 2021, 41(4): 970-975.
[12]	LIU Xiangyu, XIA Guoping, XIA Xiufeng, ZONG Chuanyu, ZHU Rui, LI Jiajia. Personalized privacy protection for spatio-temporal data [J]. Journal of Computer Applications, 2021, 41(3): 643-650.
[13]	PENG Li, ZHANG Haiqing, LI Daiwei, TANG Dan, YU Xi, HE Lei. Imputation algorithm for hybrid information system of incomplete data analysis approach based on rough set theory [J]. Journal of Computer Applications, 2021, 41(3): 677-685.
[14]	ZHANG Zhihao, LIN Yaojin, LU Shun, GUO Chen, WANG Chenxi. Multi-label feature selection based on label-specific feature with missing labels [J]. Journal of Computer Applications, 2021, 41(10): 2849-2857.
[15]	DENG Wenhan, ZHANG Ming, WANG Lijin, ZHONG Yiwen. Hybrid population-based incremental learning algorithm for solving closed-loop layout problem [J]. Journal of Computer Applications, 2021, 41(1): 95-102.