Journal of Computer Applications ›› 2022, Vol. 42 ›› Issue (12): 3813-3821.DOI: 10.11772/j.issn.1001-9081.2021101724

• Cyber security • Previous Articles    

K-Prototypes clustering method for local differential privacy

Guopeng ZHANG1,2,3, Xuebin CHEN1,2,3(), Haoshi WANG1,2,3, Ran ZHAI1,2,3, Zheng MA1,2,3   

  1. 1.College of Science,North China University of Science and Technology,Tangshan Hebei 063210,China
    2.Hebei Key Laboratory of Data Science and Application (North China University of Science and Technology),Tangshan Hebei 063010,China
    3.Tangshan Key Laboratory of Data Science (North China University of Science and Technology),Tangshan Hebei 063010,China
  • Received:2021-10-08 Revised:2021-12-27 Accepted:2022-01-05 Online:2022-01-24 Published:2022-12-10
  • Contact: Xuebin CHEN
  • About author:ZHANG Guopeng, born in 1996, M. S. candidate. His research interests include data security, privacy protection.
    WANG Haoshi, born in 1996, M. S. candidate. His research interests include data security, privacy protection.
    ZHAI Ran, born in 1998, M. S. candidate. Her research interests include data security, federated learning.
    MA Zheng, born in 1997, M. S. candidate. His research interests include network security, privacy protection.
  • Supported by:
    National Natural Science Foundation of China(U20A20179)

面向本地差分隐私的K-Prototypes聚类方法

张国鹏1,2,3, 陈学斌1,2,3(), 王豪石1,2,3, 翟冉1,2,3, 马征1,2,3   

  1. 1.华北理工大学 理学院, 河北 唐山 063210
    2.河北省数据科学与应用重点实验室(华北理工大学), 河北 唐山 063210
    3.唐山市数据科学重点实验室(华北理工大学), 河北 唐山 063210
  • 通讯作者: 陈学斌
  • 作者简介:张国鹏(1996—),男,甘肃武威人,硕士研究生,CCF会员,主要研究方向:数据安全、隐私保护
    王豪石(1996—),男,河北邢台人,硕士研究生,CCF会员,主要研究方向:数据安全、隐私保护
    翟冉(1998—),女,河北唐山人,硕士研究生,CCF会员,主要研究方向:数据安全、联邦学习
    马征(1997—),男,河北唐山人,硕士研究生,CCF会员,主要研究方向:网络安全、隐私保护。
  • 基金资助:
    国家自然科学基金资助项目(U20A20179)

Abstract:

In order to protect data privacy while ensuring data availability in clustering analysis, a privacy protection clustering scheme based on Local Differential Privacy (LDP) technique called LDPK-Prototypes (LDP K-Prototypes) was proposed. Firstly, the hybrid dataset was encoded by users. Then, a random response mechanism was used to disturb the sensitive data, and after collecting the users’ disturbed data, the original dataset was recovered by the third party to the maximum extent. After that, the K-Prototypes clustering algorithm was performed. In the clustering process, the initial clustering center was determined by the dissimilarity measure method, and the new distance calculation formula was redefined by the entropy weight method. Theoretical analysis and experimental results show that compared with the ODPC (Optimizing and Differentially Private Clustering) algorithm based on the Centralized Differential Privacy (CDP) technique, the proposed scheme has the average accuracy on Adult and Heart datasets improved by 2.95% and 12.41% respectively, effectively improving the clustering usability. Meanwhile, LDPK-Prototypes expands the difference between data, effectively avoids local optimum, and improves the stability of the clustering algorithm.

Key words: Local Differential Privacy (LDC), K-Prototypes, random response mechanism, entropy weight method, privacy protection

摘要:

为了在聚类分析中保护数据隐私的同时确保数据的可用性,提出一种基于本地化差分隐私(LDP)技术的隐私保护聚类方案——LDPK-Prototypes。首先,用户对混合型数据集进行编码;其次,采用随机响应机制对敏感数据进行扰动,而第三方在收集到用户的扰动数据后以最大限度恢复原始数据集;然后,执行K-Prototypes聚类算法,在聚类过程中,使用相异性度量方法确定初始聚类中心,并利用熵权法重新定义新的距离计算公式。理论分析和实验结果表明,所提方案与基于中心化差分隐私(CDP)技术的ODPC算法相比,在Adult和Heart数据集上的平均准确率分别提高了2.95%和12.41%,有效提高了聚类的可用性。同时,LDPK-Prototypes扩大了数据之间的差异性,有效避免了局部最优,提高了聚类算法的稳定性

关键词: 本地化差分隐私, K-Prototypes, 随机响应机制, 熵权法, 隐私保护

CLC Number: