本地化差分隐私下的频繁序列模式挖掘算法PrivSPM

doi:10.11772/j.issn.1001-9081.2022091365

《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (7): 2057-2064.DOI: 10.11772/j.issn.1001-9081.2022091365

• 第39届CCF中国数据库学术会议(NDBC 2022) • 上一篇

本地化差分隐私下的频繁序列模式挖掘算法PrivSPM

黄硕, 李艳辉(), 曹建秋

重庆交通大学信息科学与工程学院，重庆 400074

收稿日期:2022-09-12 修回日期:2022-11-15 接受日期:2022-11-21 发布日期:2023-07-20 出版日期:2023-07-10
通讯作者: 李艳辉
作者简介:黄硕（1998—），男，河南漯河人，硕士研究生，主要研究方向：数据隐私、差分隐私；
李艳辉（1989—），女，黑龙江齐齐哈尔人，讲师，博士，主要研究方向：数据隐私、差分隐私、大数据分析；
曹建秋（1967—），男，湖南益阳人，教授，硕士，主要研究方向：图形图像处理、信息可视化、交通信息化、智能控制。
基金资助:
国家自然科学基金资助项目(62002036);上海市信息安全综合管理技术研究重点实验室开放课题(AGK2020006);重庆市自然科学基金资助项目(cstc2021jcyj-msxmX0859);重庆市教育委员会科学技术研究项目(KJQN202000707)

PrivSPM： frequent sequential pattern mining algorithm under local differential privacy

Shuo HUANG, Yanhui LI(), Jianqiu CAO

School of Information Science and Engineering，Chongqing Jiaotong University，Chongqing 400074，China

Received:2022-09-12 Revised:2022-11-15 Accepted:2022-11-21 Online:2023-07-20 Published:2023-07-10
Contact: Yanhui LI
About author:HUANG Shuo， born in 1998， M. S. candidate. His research interests include data privacy， differential privacy.
LI Yanhui， born in 1989， Ph. D.， lecturer. Her research interests include data privacy， differential privacy， big data analysis.
CAO Jianqiu， born in 1967， M. S.， professor. His main research interests include graphics and image processing， information visualization， traffic informatization， intelligent control.
Supported by:
National Natural Science Foundation of China(62002036);Opening Project of Shanghai Key Laboratory of Integrated Administration Technologies for Information Security(AGK2020006);Natural Science Foundation of Chongqing(cstc2021jcyj-msxmX0859);Science and Technology Research Program of Chongqing Municipal Education Commission(KJQN202000707)

摘要/Abstract

摘要：

序列数据中可能包含大量敏感信息，因此直接对序列数据的频繁模式进行挖掘存在泄露用户隐私信息的风险。本地化差分隐私（LDP）能够抵御具有任意背景知识的攻击者，可以对敏感信息提供更全面的保护。序列数据内在序列性和高维度的特点为LDP应用于频繁序列模式挖掘带来了挑战。为解决这个问题，提出一种满足ε-LDP的top-k频繁序列模式挖掘算法PrivSPM。该算法结合填充和采样技术、自适应频率估计算法与频繁项预测技术来构造候选集；基于新域，利用基于指数机制的策略对用户数据进行扰动，并结合频率估计算法识别最终的频繁序列模式。理论分析证明了该算法满足ε-LDP。在3个真实数据集上的实验结果表明，PrivSPM算法在纳真率（TPR）和归一化累积排名（NCR）上明显高于对比算法，能有效提高挖掘结果的准确度。

关键词: 本地化差分隐私, 隐私保护, 频繁序列模式挖掘, 指数机制, 数据挖掘

Abstract:

Sequential data may contain a lot of sensitive information， so that directly mining frequent patterns of sequential data would carry significant risk to privacy of individuals. By resisting attackers with any background knowledge， Local Differential Privacy （LDP） can provide more comprehensive protection for sensitive information. Due to the inherent sequentiality and high-dimensionality， it is challenging to mine frequent sequential patterns with the application of LDP. To tackle this problem， a top-k frequent sequential pattern mining algorithm satisfying ε-LDP， called PrivSPM， was proposed. In this algorithm， filling and sampling technologies， adaptive frequency estimation algorithm and frequent item prediction technology were integrated to construct candidate item. Based on the new domain， an exponential mechanism based strategy was employed to perturb the user data， and the final frequent sequential patterns were identified by combining the frequency estimation algorithm. Theoretical analysis proves that the proposed algorithm satisfies ε-LDP. Experimental results on three real datasets demonstrate that PrivSPM algorithm performs better than the comparison algorithm on True Positive Rate （TPR） and Normalized Cumulative Rank （NCR）， and can improve the accuracy of mined results effectively.

Key words: Local Differential Privacy (LDP), privacy protection, frequent sequential pattern mining, exponential mechanism, data mining

中图分类号:

TP311.13

黄硕, 李艳辉, 曹建秋. 本地化差分隐私下的频繁序列模式挖掘算法PrivSPM[J]. 计算机应用, 2023, 43(7): 2057-2064.

Shuo HUANG, Yanhui LI, Jianqiu CAO. PrivSPM： frequent sequential pattern mining algorithm under local differential privacy[J]. Journal of Computer Applications, 2023, 43(7): 2057-2064.

图/表 7

表1 三个数据集的描述

Tab. 1 Description of three datasets

数据集名称	$\| D B \|$	$\| X \|$	Avg（ $s$ ）
kosarak	990 002	41 270	8.1
MSNBC	989 818	17	4.7
retail	88 162	16 479	11.3

表1 三个数据集的描述

Tab. 1 Description of three datasets

数据集名称	$\| D B \|$	$\| X \|$	Avg（ $s$ ）
kosarak	990 002	41 270	8.1
MSNBC	989 818	17	4.7
retail	88 162	16 479	11.3

图1 kosarak数据集上的候选项识别

Fig. 1 Candidate item identification on kosarak dataset

图2 retail数据集上的候选项识别

Fig. 2 Candidate item identification on retail dataset

图3 k=30时TPR的变化情况

Fig. 3 Changes in TPR when k=30

图4 k=30时NCR的变化情况

Fig. 4 Changes in NCR when k=30

图5 k=30时Err的变化情况

Fig. 5 Changes in Err when k=3

图6 kosarak数据集上k值对可用性的影响

Fig. 6 Influence of k on utility on kosarak dataset

参考文献 32

1	AGRAWAL R， SRIKANT R. Mining sequential patterns ［C］// Proceedings of the 11th International Conference on Data Engineering. Piscataway： IEEE， 1995： 3-14.
2	叶青青，孟小峰，朱敏杰，等. 本地化差分隐私研究综述［J］. 软件学报， 2018， 29 （7）： 1981-2005.
	YE Q Q， MENG X F， ZHU M J， et al. Survey on local differential privacy［J］. Journal of Software， 2018， 29 （7）： 1981-2005.
3	SWEENEY L. k-Anonymity： a model for protecting privacy［J］. International Journal of Uncertainty， Fuzziness and Knowledge-Based Systems， 2002， 10 （5）： 557-570. 10.1142/s0218488502001648
4	GANTA S R， KASIVISWANATHAN S P， SMITH A. Composition attacks and auxiliary information in data privacy ［C］// Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York： ACM， 2008： 265-273. 10.1145/1401890.1401926
5	WONG R C W， FU A W C， WANG K， et al. Can the utility of anonymized data be used for privacy breaches？［J］. ACM Transactions on Knowledge Discovery from Data， 2011， 5 （3）： No.16. 10.1145/1993077.1993080
6	DWORK C. Differential privacy ［C］// Proceedings of the 2006 International Colloquium on Automata， Languages， and Programming， LNCS 4052. Berlin： Springer， 2006： 1-12.
7	XIA H H， HUANG W C， XIONG Y， et al. Mining frequent sequential patterns with local differential privacy［J］. International Journal of Network Security， 2021， 23 （5）： 817-829.
8	WANG T， HU Z. Local differential privacy-based frequent sequence mining［J］. Journal of King Saud University — Computer and Information Sciences， 2022， 34 （6 Pt B）： 3591-3601. 10.1016/j.jksuci.2022.04.013
9	SRIKANT R， AGRAWAL R. Mining sequential patterns： generalizations and performance improvements ［C］// Proceedings of the 1996 International Conference on Extending Database Technology， LNCS 1057. Berlin： Springer， 1996： 1-17.
10	ZAKI M J. SPADE： an efficient algorithm for mining frequent sequences［J］. Machine Learning， 2001， 42 （1/2）： 31-60.
11	FOURNIER-VIGER P， GOMARIZ A， CAMPOS M， et al. Fast vertical mining of sequential patterns using co-occurrence information ［C］// Proceedings of the 2014 Pacific-Asia Conference on Knowledge Discovery and Data Mining， LNCS 8443. Cham： Springer， 2014： 40-52.
12	BONOMI L， XIONG L. A two-phase algorithm for mining sequential patterns with differential privacy ［C］// Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. New York： ACM， 2013： 269-278. 10.1145/2505515.2505553
13	CHENG X， SU S， XU S， et al. Differentially private maximal frequent sequence mining［J］. Computers and Security， 2015， 55： 175-192. 10.1016/j.cose.2015.08.005
14	XU S Z， CHENG X， SU S， et al. Differentially private frequent sequence mining［J］. IEEE Transactions on Knowledge and Data Engineering， 2016， 28 （11）： 2910-2926. 10.1109/tkde.2016.2601106
15	李艳辉，刘浩，袁野，等. 基于差分隐私的频繁序列模式挖掘算法［J］. 计算机应用， 2017， 37 （2）： 316-321， 340. 10.11772/j.issn.1001-9081.2017.02.0316
	LI Y H， LIU H， YUAN Y， et al. Frequent sequence pattern mining with differential privacy［J］. Journal of Computer Applications， 2017， 37 （2）： 316-321， 340. 10.11772/j.issn.1001-9081.2017.02.0316
16	WARNER S L. Randomized response： a survey technique for eliminating evasive answer bias［J］. Journal of the American Statistical Association， 1965， 60 （309）： 63-69. 10.1080/01621459.1965.10480775
17	ERLINGSSON Ú， PIHUR V， KOROLOVA A. RAPPOR： randomized aggregatable privacy-preserving ordinal response ［C］// Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security. New York： ACM， 2014： 1054-1067. 10.1145/2660267.2660348
18	BASSILY R， SMITH A. Local， private， efficient protocols for succinct histograms ［C］// Proceedings of the 47th Annual ACM Symposium on Theory of Computing. New York： ACM， 2015： 127-135. 10.1145/2746539.2746632
19	WANG T H， BLOCKI J， LI N H， et al. Locally differentially private protocols for frequency estimation ［C］// Proceedings of the 26th USENIX Security Symposium. Berkeley： USENIX Association， 2017： 729-745.
20	WANG T H， LI N H， JHA S. Locally differentially private heavy hitter identification［J］. IEEE Transactions on Dependable and Secure Computing， 2021， 18 （2）： 982-993. 10.1109/tdsc.2019.2927695
21	ZHAO D， ZHAO S Y， CHEN H， et al. Efficient protocols for heavy hitter identification with local differential privacy［J］. Frontiers of Computer Science， 2022， 16 （5）： No.165825. 10.1007/s11704-021-0412-y
22	YE Q Q， HU H B， MENG X F， et al. PrivKV： key-value data collection with local differential privacy ［C］// Proceedings of the 2019 IEEE Symposium on Security and Privacy. Piscataway： IEEE， 2019： 317-331. 10.1109/sp.2019.00018
23	NGUYÊN T T， XIAO X K， YANG Y， et al. Collecting and analyzing data from smart device users with local differential privacy［EB/OL］. （2016-06-16）［2021-09-27］. .
24	GU X L， LI M， CHENG Y Q， et al. PCKV： locally differentially private correlated key-value data collection with optimized utility ［C］// Proceedings of the 29th USENIX Security Symposium. Berkeley： USENIX Association， 2020： 967-984.
25	QIN Z， YANG Y， YU T， et al. Heavy hitter estimation over set-valued data with local differential privacy ［C］// Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. New York： ACM， 2016： 192-203. 10.1145/2976749.2978409
26	WANG S W， HUANG L S， NIE Y W， et al. PrivSet： set-valued data analyses with locale differential privacy ［C］// Proceedings of the 2018 IEEE Conference on Computer Communications. Piscataway： IEEE， 2018： 1088-1096. 10.1109/infocom.2018.8486234
27	WANG S W， QIAN Y Q， DU J C， et al. Set-valued data publication with local privacy： tight error bounds and efficient mechanisms［J］. Proceedings of the VLDB Endowment， 2020， 13 （8）： 1234-1247. 10.14778/3389133.3389140
28	WANG T H， LI N H， JHA S. Locally differentially private frequent itemset mining ［C］// Proceedings of the 2018 IEEE Symposium on Security and Privacy. Piscataway： IEEE， 2018： 127-143. 10.1109/sp.2018.00035
29	AFROSE S， HASHEM T， ALI M E. Frequent itemsets mining with a guaranteed local differential privacy in small datasets ［C］// Proceedings of the 33rd International Conference on Scientific and Statistical Database Management. New York： ACM， 2021： 232-236. 10.1145/3468791.3468807
30	LIU H J， CUI L W， MA X B， et al. Frequent itemset mining of user's multi-attribute under local differential privacy［J］. Computers， Materials & Continua， 2020， 65 （1）： 369-385. 10.32604/cmc.2020.010987
31	LI N H， QARDAJI W， SU D. On sampling， anonymization， and differential privacy or， k-anonymization meets differential privacy ［C］// Proceedings of the 7th ACM Symposium on Information， Computer and Communications Security. New York： ACM， 2012： 32-33. 10.1145/2414456.2414474
32	WANG N， XIAO X， YANG Y， et al. PrivSuper： a superset-first approach to frequent itemset mining under differential privacy［C］// Proceedings of the 2017 IEEE 33rd International Conference on Data Engineering. Piscataway： IEEE， 2017： 809-820. 10.1109/icde.2017.131

[1]	祁超帅, 何文思, 焦毅, 马英红, 蔡伟, 任素萍. 无人机飞行数据异常检测算法综述[J]. 《计算机应用》唯一官方网站, 2023, 43(6): 1833-1841.
[2]	翟冉, 陈学斌, 张国鹏, 裴浪涛, 马征. 基于不同敏感度的改进K-匿名隐私保护算法[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1497-1503.
[3]	李元江, 权金升, 谭阳奕, 杨田. 基于相似和差异双视角的高维数据属性约简[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1467-1472.
[4]	尹春勇, 屈锐. 基于个性化差分隐私的联邦学习算法[J]. 《计算机应用》唯一官方网站, 2023, 43(4): 1160-1168.
[5]	邵小萌, 张猛. 融合注意力机制的时间卷积知识追踪模型[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 343-348.
[6]	王腾, 霍峥, 黄亚鑫, 范艺琳. 联邦学习中的隐私保护技术研究综述[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 437-449.
[7]	尹春勇, 李荧. 基于BCU-Tree与字典的高效用挖掘快速脱敏算法[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 413-422.
[8]	吴军, 欧阳艾嘉, 张琳. 基于影响度的统计显著序列模式挖掘算法[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2713-2721.
[9]	吴静雯, 殷新春, 宁建廷. 车载自组网中可撤销的聚合签名认证方案[J]. 《计算机应用》唯一官方网站, 2022, 42(3): 911-920.
[10]	余顺坤, 闫泓序. 基于确定性因子的启发式属性值约简模型[J]. 《计算机应用》唯一官方网站, 2022, 42(2): 469-474.
[11]	钟洋, 毕仁万, 颜西山, 应作斌, 熊金波. 支持隐私保护训练的高效同态神经网络[J]. 《计算机应用》唯一官方网站, 2022, 42(12): 3792-3800.
[12]	梁天恺, 曾碧, 陈光. 联邦学习综述：概念、技术、应用与挑战[J]. 《计算机应用》唯一官方网站, 2022, 42(12): 3651-3662.
[13]	赵乐, 张恩, 秦磊勇, 李功丽. 基于区块链的多方隐私保护k-means聚类方案[J]. 《计算机应用》唯一官方网站, 2022, 42(12): 3801-3812.
[14]	张国鹏, 陈学斌, 王豪石, 翟冉, 马征. 面向本地差分隐私的K-Prototypes聚类方法[J]. 《计算机应用》唯一官方网站, 2022, 42(12): 3813-3821.
[15]	孙睿, 李超, 王伟, 童恩栋, 王健, 刘吉强. 基于区块链的联邦学习研究进展[J]. 《计算机应用》唯一官方网站, 2022, 42(11): 3413-3420.

本地化差分隐私下的频繁序列模式挖掘算法PrivSPM

PrivSPM： frequent sequential pattern mining algorithm under local differential privacy

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 7

参考文献 32

相关文章 15

编辑推荐

Metrics