Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (1): 217-222.DOI: 10.11772/j.issn.1001-9081.2023010019

• Cyber security • Previous Articles    

Differential privacy clustering algorithm in horizontal federated learning

Xueran XU1(), Geng YANG1,2, Yuxian HUANG1   

  1. 1.School of Computer Science,Nanjing University of Posts and Telecommunications,Nanjing Jiangsu 210023,China
    2.Jiangsu Key Laboratory of Big Data Security and Intelligent Processing,Nanjing Jiangsu 210023,China
  • Received:2023-01-09 Revised:2023-04-13 Accepted:2023-04-14 Online:2023-06-06 Published:2024-01-10
  • Contact: Xueran XU
  • About author:YANG Geng, born in 1961, Ph. D., professor. His research interests include artificial intelligence security, data privacy protection, cloud computing security.
    HUANG Yuxian, born in 1998, Ph. D. His research interests include differential privacy, network and information security, federated learning.
  • Supported by:
    National Natural Science Foundation of China(61872197)

横向联邦学习中差分隐私聚类算法

徐雪冉1(), 杨庚1,2, 黄喻先1   

  1. 1.南京邮电大学 计算机学院、软件学院、网络空间安全学院, 南京 210023
    2.江苏省大数据安全与智能处理重点实验室, 南京 210023
  • 通讯作者: 徐雪冉
  • 作者简介:杨庚(1961—),男,江苏建湖人,教授,博士生导师,博士,主要研究方向:人工智能安全、数据隐私保护、云计算与安全;
    黄喻先(1998—),男,江苏南通人,博士,主要研究方向:差分隐私、网络与信息安全、联邦学习。
    第一联系人:徐雪冉(2000—),女,江苏徐州人,硕士研究生,主要研究方向:网络与信息安全、隐私保护;
  • 基金资助:
    国家自然科学基金资助项目(61872197)

Abstract:

Clustering analysis can uncover hidden interconnections between data and segment the data according to multiple indicators, which can facilitate personalized and refined operations. However, data fragmentation and isolation caused by data islands seriously affects the effectiveness of cluster analysis applications. To solve data island problem and protect data privacy, an Equivalent Local differential privacy Federated K-means (ELFedKmeans) algorithm was proposed. A grid-based initial cluster center selection method and a privacy budget allocation scheme were designed for the horizontal federation learning model. To generate same random noise with lower communication cost, all organizations jointly negotiated random seeds, protecting local data privacy. The ELFedKmeans algorithm was demonstrated satisfying differential privacy protection through theoretical analysis, and it was also compared with Local Differential Privacy distributed K-means (LDPKmeans) algorithm and Hybrid Privacy K-means (HPKmeans) algorithm on different datasets. Experimental results show that all three algorithms increase F-measure and decrease SSE (Sum of Squares due to Error) gradually as privacy budget increases. As a whole, the F-measure values of ELFedKmeans algorithm was 1.794 5% to 57.066 3% and 21.245 2% to 132.048 8% higher than those of LDPKmeans and HPKmeans algorithms respectively; the Log(SSE) values of ELFedKmeans algorithm were 1.204 2% to 12.894 6% and 5.617 5% to 27.575 2% less than those of LDPKmeans and HPKmeans algorithms respectively. With the same privacy budget, ELFedKmeans algorithm outperforms the comparison algorithms in terms of clustering quality and utility metric.

Key words: horizontal federated clustering, differential privacy, local disturbance, utility, K-means algorithm

摘要:

聚类分析能够挖掘出数据间隐藏的内在联系并对数据进行多指标划分,从而促进个性化和精细化运营。然而,数据孤岛造成的数据碎片化和孤立化严重影响了聚类分析的应用效果。为了解决数据孤岛问题的同时保护相关数据隐私,提出本地均分扰动联邦K-means算法(ELFedKmeans)。针对横向联邦学习模式,设计了一种基于网格的初始簇心选择方法和一种隐私预算分配方案。在ELFedKmeans算法中,各站点联合协商随机种子,以较小的通信代价生成相同的随机噪声,保护了本地数据的隐私。通过理论分析证明了该算法满足差分隐私保护,并将该算法与本地差分隐私K-means(LDPKmeans)算法和混合型隐私保护K-means (HPKmeans)算法在不同的数据集上进行了对比实验分析。实验结果表明,随着隐私预算不断增大,三个算法的F-measure值均逐渐升高;误差平方和(SSE)均逐渐减小。从整体上看,ELFedKmeans算法的F-measure值比LDPKmeans算法和HPKmeans算法分别高了1.794 5%~57.066 3%和21.245 2%~132.048 8%;ELFedKmeans算法的Log(SSE)值比LDPKmeans算法和HPKmeans算法分别减少了1.204 2%~12.894 6%和5.617 5%~27.575 2%。在相同的隐私预算下,ELFedKmeans算法在聚类质量和可用性指标上优于对比算法。

关键词: 横向联邦聚类, 差分隐私, 本地扰动, 可用性, K-means算法

CLC Number: