计算机应用 ›› 2021, Vol. 41 ›› Issue (9): 2578-2585.DOI: 10.11772/j.issn.1001-9081.2020111786

所属专题: 数据科学与技术

• 数据科学与技术 • 上一篇    下一篇

基于聚类分析的差分隐私高维数据发布方法

陈恒恒1,2, 倪志伟1,2, 朱旭辉1,2, 金媛媛1,2, 陈千1,2   

  1. 1. 合肥工业大学 管理学院, 合肥 230009;
    2. 过程优化与智能决策教育部重点实验室(合肥工业大学), 合肥 230009
  • 收稿日期:2020-11-16 修回日期:2021-01-11 出版日期:2021-09-10 发布日期:2021-05-08
  • 通讯作者: 倪志伟
  • 作者简介:陈恒恒(1998-),女,湖南邵阳人,硕士研究生,主要研究方向:信息安全、数据管理;倪志伟(1963-),男,安徽合肥人,教授,博士,主要研究方向:人工智能、机器学习、数据管理;朱旭辉(1991-),男,安徽阜阳人,讲师,博士,主要研究方向:智能计算、机器学习;金媛媛(1998-),女,安徽芜湖人,硕士研究生,主要研究方向:信息安全、数据管理;陈千(1991-),男,安徽合肥人,博士研究生,主要研究方向:信息安全、数据管理。
  • 基金资助:
    国家自然科学基金资助项目(91546108,61806068);安徽省科技重大专项(201903a05020020);安徽省自然科学基金资助项目(1908085QG298)。

Differential privacy high-dimensional data publishing method via clustering analysis

CHEN Hengheng1,2, NI Zhiwei1,2, ZHU Xuhui1,2, JIN Yuanyuan1,2, CHEN Qian1,2   

  1. 1. School of Management, Hefei University of Technology, Hefei Anhui 230009, China;
    2. Key Laboratory of Process Optimization and Intelligent Decision-making, Ministry of Education(Hefei University of Technology), Hefei Anhui 230009, China
  • Received:2020-11-16 Revised:2021-01-11 Online:2021-09-10 Published:2021-05-08
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (91546108, 61806068), the Anhui Provincial Science and Technology Major Project (201903a05020020), the Anhui Provincial Natural Science Foundation (1908085QG298).

摘要: 针对已有差分隐私高维数据发布方法无法有效兼顾数据间复杂属性的关联关系和计算成本的问题,提出一种基于聚类分析技术的差分隐私高维数据发布方法PrivBC。首先,基于K-means++设计属性聚类方法,引入最大信息系数量化属性间的关联关系,并对具有高度关联关系的数据属性进行聚类。其次,对聚类产生的各个数据子集进行如下操作:计算关系矩阵以缩减属性对的候选空间,并构建满足差分隐私的贝叶斯网络。最后,根据贝叶斯网络采样每个属性,并合成新的隐私数据集进行发布。与PrivBayes方法相比,PrivBC方法的误分类率和运行时间分别平均降低了12.6%和30.2%。实验结果表明,所提方法在有效保证数据可用性的基础上,可以显著提高计算效率,为高维数据的隐私发布提供了新思路。

关键词: 差分隐私, 高维数据, 属性聚类, 贝叶斯网络, 数据发布

Abstract: Aiming at the problem that the existing differential privacy high-dimensional data publishing methods are difficult to take into account both the complex attribute correlation between data and computational cost, a differential privacy high-dimensional data publishing method based on clustering analysis technology, namely PrivBC, was proposed. Firstly, the attribute clustering method was designed based on the K-means++, the maximum information coefficient was introduced to quantify the correlation between the attributes, and the data attributes with high correlation were clustered. Secondly, for each data subset obtained by the clustering, the correlation matrix was calculated to reduce the candidate space of attribute pairs, and the Bayesian network satisfying differential privacy was constructed. Finally, each attribute was sampled according to the Bayesian networks, and a new private dataset was synthesized for publishing. Compared with PrivBayes method, PrivBC method had the misclassification rate and running time reduced by 12.6% and 30.2% averagely and respectively. Experimental results show that the proposed method can significantly improve the computational efficiency with ensuring the data availability, and provides a new idea for the private publishing of high-dimensional big data.

Key words: differential privacy, high-dimensional data, attribute clustering, Bayesian network, data publishing

中图分类号: