《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (2): 413-422.DOI: 10.11772/j.issn.1001-9081.2021122161
所属专题: 数据科学与技术
收稿日期:
2021-12-27
修回日期:
2022-02-19
接受日期:
2022-02-23
发布日期:
2022-05-16
出版日期:
2023-02-10
通讯作者:
尹春勇
作者简介:
李荧(1995—),女,江苏淮安人,硕士研究生,主要研究方向:数据挖掘、隐私保护。
基金资助:
Received:
2021-12-27
Revised:
2022-02-19
Accepted:
2022-02-23
Online:
2022-05-16
Published:
2023-02-10
Contact:
Chunyong YIN
About author:
LI Ying, born in 1995, M. S. candidate. Her research interests include data mining, privacy preserving.
Supported by:
摘要:
针对隐私保护效用挖掘(PPUM)中脱敏时间长、计算复杂度高,以及算法副作用大等问题,提出一种基于BCU-Tree和字典(BCUTD)的高效用挖掘快速脱敏算法。该算法提出了一种新的树结构BCU-Tree来存储敏感项信息,基于按位运算符编码模型降低树的构建时间并减小搜索空间。采用字典表存储树结构中的所有节点,修改敏感项时只需访问字典表,最终达到数据库脱敏目的。在4个不同的数据集上进行的实验中,BCUTD算法在脱敏时间和副作用上的表现要明显优于经典的优先隐藏高效用项(HHUIF)算法、最大敏感效用-最大项效用(MSU-MAU)算法和使用树与表结构的快速扰动(FPUTT)算法。实验结果表明,BCUTD算法能够有效减少脱敏时间,降低算法副作用以及计算复杂度。
中图分类号:
尹春勇, 李荧. 基于BCU-Tree与字典的高效用挖掘快速脱敏算法[J]. 计算机应用, 2023, 43(2): 413-422.
Chunyong YIN, Ying LI. Fast sanitization algorithm based on BCU-Tree and dictionary for high-utility mining[J]. Journal of Computer Applications, 2023, 43(2): 413-422.
TID | 事务(项,内部效用) | tu(Ti ) |
---|---|---|
T1 | (C, 7), (D, 1), (E, 1) | 15 |
T2 | (A, 1), (C, 2), (E, 2) | 11 |
T3 | (B, 6), (C, 4), (D, 3), (E, 7) | 54 |
T4 | (B, 5), (C, 3), (D, 9) | 72 |
T5 | (A, 3), (C, 10), (D, 3) | 43 |
T6 | (C, 5), (E, 9) | 23 |
T7 | (A, 6), (C, 9), (D, 2), (E, 5) | 61 |
T8 | (A, 1), (B, 6), (C, 2), (D, 5), (E, 3) | 61 |
表1 一个事务数据库
Tab. 1 One transaction database
TID | 事务(项,内部效用) | tu(Ti ) |
---|---|---|
T1 | (C, 7), (D, 1), (E, 1) | 15 |
T2 | (A, 1), (C, 2), (E, 2) | 11 |
T3 | (B, 6), (C, 4), (D, 3), (E, 7) | 54 |
T4 | (B, 5), (C, 3), (D, 9) | 72 |
T5 | (A, 3), (C, 10), (D, 3) | 43 |
T6 | (C, 5), (E, 9) | 23 |
T7 | (A, 6), (C, 9), (D, 2), (E, 5) | 61 |
T8 | (A, 1), (B, 6), (C, 2), (D, 5), (E, 3) | 61 |
项 | 外部效用 | 项 | 外部效用 |
---|---|---|---|
A | 5 | D | 6 |
B | 3 | E | 2 |
C | 1 |
表2 项外部效用表
Tab. 2 External utility table of items
项 | 外部效用 | 项 | 外部效用 |
---|---|---|---|
A | 5 | D | 6 |
B | 3 | E | 2 |
C | 1 |
项集 | 效用 | 项集 | 效用 | 项集 | 效用 |
---|---|---|---|---|---|
{ACD} | 131 | {BD} | 153 | {CDE} | 120 |
{ACDE} | 104 | {BDE} | 104 | {D} | 138 |
{AD} | 110 | {BCDE} | 110 | ||
{BCD} | 162 | {CD} | 173 |
表3 高效用项集
Tab. 3 High-utility itemsets
项集 | 效用 | 项集 | 效用 | 项集 | 效用 |
---|---|---|---|---|---|
{ACD} | 131 | {BD} | 153 | {CDE} | 120 |
{ACDE} | 104 | {BDE} | 104 | {D} | 138 |
{AD} | 110 | {BCDE} | 110 | ||
{BCD} | 162 | {CD} | 173 |
敏感项集(SI) | TIDs |
---|---|
{CD} | T1,T3,T4,T5,T7,T8 |
{BD} | T3,T4,T8 |
{ACD} | T5,T7,T8 |
表4 敏感项集表
Tab. 4 SI-table
敏感项集(SI) | TIDs |
---|---|
{CD} | T1,T3,T4,T5,T7,T8 |
{BD} | T3,T4,T8 |
{ACD} | T5,T7,T8 |
TID | 项 | 排列后的项和效用 |
---|---|---|
T1 | (C,7),(D,1) | (C,7),(D,6) |
T3 | (B,6),(C,4),(D,3) | (B,18),(C,4),(D,18), |
T4 | (B,5),(C,3),(D,9) | (B,15),(C,3),(D,54) |
T5 | (A,3),(C,10),(D,3) | (A,15),(C,10),(D,18) |
T7 | (A,6),(C,9),(D,2) | (A,30),(C,9),(D,12) |
T8 | (A,1),(B,6),(C,2),(D, 5) | (B,18),(A,5),(C,2),(D,30) |
表5 敏感事务表
Tab. 5 Sensitive transaction table
TID | 项 | 排列后的项和效用 |
---|---|---|
T1 | (C,7),(D,1) | (C,7),(D,6) |
T3 | (B,6),(C,4),(D,3) | (B,18),(C,4),(D,18), |
T4 | (B,5),(C,3),(D,9) | (B,15),(C,3),(D,54) |
T5 | (A,3),(C,10),(D,3) | (A,15),(C,10),(D,18) |
T7 | (A,6),(C,9),(D,2) | (A,30),(C,9),(D,12) |
T8 | (A,1),(B,6),(C,2),(D, 5) | (B,18),(A,5),(C,2),(D,30) |
项 | 次数 | 项 | 次数 |
---|---|---|---|
D | 6 | A | 3 |
C | 6 | B | 3 |
表6 敏感项计数表
Tab. 6 Counting table of sensitive items
项 | 次数 | 项 | 次数 |
---|---|---|---|
D | 6 | A | 3 |
C | 6 | B | 3 |
数据集 | 事务数量 | 项数量 | 平均长度 | 密度/% |
---|---|---|---|---|
foodmart | 4 141 | 1 559 | 4.42 | 0.28 |
retail | 88 162 | 16 470 | 1 030.00 | 6.25 |
t20i6d100k | 99 922 | 893 | 24.77 | 2.77 |
t25i10d10k | 9 976 | 929 | 24.77 | 2.67 |
表7 实验数据集
Tab. 7 Experimental datasets
数据集 | 事务数量 | 项数量 | 平均长度 | 密度/% |
---|---|---|---|---|
foodmart | 4 141 | 1 559 | 4.42 | 0.28 |
retail | 88 162 | 16 470 | 1 030.00 | 6.25 |
t20i6d100k | 99 922 | 893 | 24.77 | 2.77 |
t25i10d10k | 9 976 | 929 | 24.77 | 2.67 |
1 | CHEN M S, HAN J W, YU P S. Data mining: an overview from a database perspective[J]. IEEE Transactions on Knowledge and Data Engineering, 1996, 8(6): 866-883. 10.1109/69.553155 |
2 | HAN J W, PEI J, YIN Y W, et al. Mining frequent patterns without candidate generation: a frequent-pattern tree approach[J]. Data Mining and Knowledge Discovery, 2004, 8(1): 53-87. 10.1023/b:dami.0000005258.31418.83 |
3 | TELIKANI A, GANDOMI A H, SHAHBAHRAMI A. A survey of evolutionary computation for association rule mining[J]. Information Sciences, 2020, 524: 318-352. 10.1016/j.ins.2020.02.073 |
4 | QIN J W, WANG J Y, LI Q Y, et al. Differentially private frequent episode mining over event streams[J]. Engineering Applications of Artificial Intelligence, 2022, 110: No.104681. 10.1016/j.engappai.2022.104681 |
5 | WU Y X, ZHU C R, LI Y, et al. NetNCSP: nonoverlapping closed sequential pattern mining[J]. Knowledge-Based Systems, 2020, 196: No.105812. 10.1016/j.knosys.2020.105812 |
6 | SRIVASTAVA G, LIN J C W, PIROUZ M, et al. A pre-large weighted-fusion system of sensed high-utility patterns[J]. IEEE Sensors Journal, 2021, 21(14): 15626-15634. 10.1109/jsen.2020.2991045 |
7 | SONG W, HUANG C M. Mining high average-utility itemsets based on particle swarm optimization[J]. Data Science and Pattern Recognition, 2020, 4(2): 19-32. |
8 | YAO H, HAMILTON H J, GENG L Q. A unified framework for utility-based measures for mining itemsets[C]// Proceedings of the 2nd Workshop on Utility-Based Data Mining Held in Conjunction with the KDD Conference. New York: ACM, 2006:28-37. |
9 | GENG L Q, HAMILTON H J. Interestingness measures for data mining: a survey[J]. ACM Computing Surveys, 2006, 38(3): No.9-es. 10.1145/1132960.1132963 |
10 | TAN P N, KUMAR V, SRIVASTAVA J. Selecting the right objective measure for association analysis[J]. Information Systems, 2004, 29(4): 293-313. 10.1016/S0306-4379(03)00072-3 |
11 | McGARRY K. A survey of interestingness measures for knowledge discovery[J]. The Knowledge Engineering Review, 2005, 20(1): 39-61. 10.1017/s0269888905000408 |
12 | HILDERMAN R J, HAMILTON H J. Measuring the interestingness of discovered knowledge: a principled approach[J]. Intelligent Data Analysis, 2003, 7(4): 347-382. 10.3233/ida-2003-7406 |
13 | SILBERSCHATZ A, TUZHILIN A. On subjective measures of interestingness in knowledge discovery[C]// Proceedings of the First International Conference on Knowledge Discovery and Data Mining. Menlo Park, CA: AAAI Press, 1995:275-281. |
14 | DE BIE T. Maximum entropy models and subjective interestingness: an application to tiles in binary databases[J]. Data Mining and Knowledge Discovery, 2011, 23(3): 407-446. 10.1007/s10618-010-0209-3 |
15 | YAO H, HAMILTON H J. Mining itemset utilities from transaction databases[J]. Data and Knowledge Engineering, 2006, 59(3): 603-626. 10.1016/j.datak.2005.10.004 |
16 | 冯登国,张敏,李昊. 大数据安全与隐私保护[J]. 计算机学报, 2014, 37(1): 246-258. |
FENG D G, ZHANG M, LI H. Big data security and privacy protection[J]. Chinese Journal of Computers, 2014, 37(1): 246-258. | |
17 | BERTINO E, LIN D, JIANG W. A survey of quantification of privacy preserving data mining algorithms[M]// AGGARWAL C C, YU P S. Privacy-Preserving Data Mining: Models and Algorithms. Boston: Springer, 2008: 183-205. 10.1007/978-0-387-70992-5_8 |
18 | 张啸剑,孟小峰. 面向数据发布和分析的差分隐私保护[J]. 计算机学报, 2014, 37(4): 927-949. 10.3724/SP.J.1016.2014.00927 |
ZHANG X J, MENG X F. Differential privacy in data publication and analysis[J]. Chinese Journal of Computers, 2014, 37(4): 927-949. 10.3724/SP.J.1016.2014.00927 | |
19 | SLIJEPČEVIĆ D, HENZL M, DANIEL KLAUSNER L, et al. k-Anonymity in practice: how generalisation and suppression affect machine learning classifiers[J]. Computers and Security, 2021, 111: No.102488. 10.1016/j.cose.2021.102488 |
20 | LIU J, TIAN Y, ZHOU Y, et al. Privacy preserving distributed data mining based on secure multi-party computation[J]. Computer Communications, 2020, 153: 208-216. 10.1016/j.comcom.2020.02.014 |
21 | YEH J S, HSU P C. HHUIF and MSICF: novel algorithms for privacy preserving utility mining[J]. Expert Systems with Applications, 2010, 37(7): 4779-4786. 10.1016/j.eswa.2009.12.038 |
22 | LIN J C W, GAN W S, FOURNIER-VIGER P, et al. Fast algorithms for mining high-utility itemsets with various discount strategies[J]. Advanced Engineering Informatics, 2016, 30(2): 109-126. 10.1016/j.aei.2016.02.003 |
23 | YUN U, KIM J. A fast perturbation algorithm using tree structure for privacy preserving utility mining[J]. Expert Systems with Applications, 2015, 42(3): 1149-1165. 10.1016/j.eswa.2014.08.037 |
24 | LIU X, CHEN G L, WEN S T, et al. An improved sanitization algorithm in privacy-preserving utility mining[J]. Mathematical Problems in Engineering, 2020, 2020: No.7489045. 10.1155/2020/7489045 |
25 | LIU X, WEN S T, ZUO W L. Effective sanitization approaches to protect sensitive knowledge in high-utility itemset mining[J]. Applied Intelligence, 2020, 50(1): 169-191. 10.1007/s10489-019-01524-2 |
26 | JANGRA S, TOSHNIWAL D. Efficient algorithms for victim item selection in privacy-preserving utility mining[J]. Future Generation Computer Systems, 2022, 128: 219-234. 10.1016/j.future.2021.10.008 |
27 | SHEN Y D, ZHANG Z, YANG Q. Objective-oriented utility-based association mining[C]// Proceedings of the 2002 IEEE International Conference on Data Mining. Piscataway: IEEE, 2002:426-433. 10.1109/icdm.2002.1183878 |
28 | HU J Y, MOJSILOVIC A. High-utility pattern mining: a method for discovery of high-utility item sets[J]. Pattern Recognition, 2007, 40(11): 3317-3324. 10.1016/j.patcog.2007.02.003 |
29 | AHMED C F, TANBEER S K, JEONG B S, et al. Efficient tree structures for high utility pattern mining in incremental databases[J]. IEEE Transactions on Knowledge and Data Engineering, 2009, 21(12): 1708-1721. 10.1109/tkde.2009.46 |
30 | WU J M T, LIN J C W, TAMRAKAR A. High-utility itemset mining with effective pruning strategies[J]. ACM Transactions on Knowledge Discovery from Data, 2019, 13(6): No.58. 10.1145/3363571 |
31 | KRISHNAMOORTHY S. Pruning strategies for mining high utility itemsets[J]. Expert Systems with Applications, 2015, 42(5): 2371-2381. 10.1016/j.eswa.2014.11.001 |
32 | ZIDA S, FOURNIER-VIGER P, LIN J C W, et al. EFIM: a highly efficient algorithm for high-utility itemset mining[C]// Proceedings of the 2015 Mexican International Conference on Artificial Intelligence, LNCS 9413. Cham: Springer, 2015:530-546. |
33 | GAN W S, LIN J C W, FOURNIER-VIGER P, et al. HUOPM: high-utility occupancy pattern mining[J]. IEEE Transactions on Cybernetics, 2020, 50(3): 1195-1208. 10.1109/tcyb.2019.2896267 |
34 | WU J M T, SRIVASTAVA G, WEI M, et al. Fuzzy high-utility pattern mining in parallel and distributed Hadoop framework[J]. Information Sciences, 2021, 553: 31-48. 10.1016/j.ins.2020.12.004 |
35 | LI S X, MU N K, LE J Q, et al. A novel algorithm for privacy preserving utility mining based on integer linear programming[J]. Engineering Applications of Artificial Intelligence, 2019, 81: 300-312. 10.1016/j.engappai.2018.12.006 |
36 | DINH T, QUANG M N, LE B. A novel approach for hiding high utility sequential patterns[C]// Proceedings of Proceedings of the 6th International Symposium on Information and Communication Technology. New York: ACM, 2015:121-128. 10.1145/2833258.2833271 |
37 | QUANG M N, HUYNH U, DINH T, et al. An approach to decrease execution time and difference for hiding high utility sequential patterns[C]// Proceedings of the 2016 International Symposium on Integrated Uncertainty in Knowledge Modelling and Decision Making, LNCS 9978. Cham: Springer, 2016:435-446. |
38 | LE B, DINH D T, HUYNH V N, et al. An efficient algorithm for Hiding High Utility Sequential Patterns[J]. International Journal of Approximate Reasoning, 2018, 95: 77-92. 10.1016/j.ijar.2018.01.005 |
39 | ARYABARZAN N, MINAEI-BIDGOLI B, TESHNEHLAB M. negFIN: an efficient algorithm for fast mining frequent itemsets[J]. Expert Systems with Applications, 2018, 105: 129-143. 10.1016/j.eswa.2018.03.041 |
40 | LIN J C W, WU T Y, FOURNIER-VIGER P, et al. Fast algorithms for hiding sensitive high-utility itemsets in privacy-preserving utility mining[J]. Engineering Applications of Artificial Intelligence, 2016, 55: 269-284. 10.1016/j.engappai.2016.07.003 |
41 | DUONG Q H, FOURNIER-VIGER P, RAMAMPIARO H, et al. Efficient high utility itemset mining using buffered utility-lists[J]. Applied Intelligence, 2018, 48(7): 1859-1877. 10.1007/s10489-017-1057-2 |
42 | GE Z Q, SONG Z H, DING S X, et al. Data mining and analytics in the process industry: the role of machine learning[J]. IEEE Access, 2017, 5: 20590-20616. 10.1109/access.2017.2756872 |
43 | TASSA T. Secure mining of association rules in horizontally distributed databases[J]. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(4): 970-983. 10.1109/tkde.2013.41 |
44 | LIN J C W, LIU Q K, FOURNIER-VIGER P, et al. A sanitization approach for hiding sensitive itemsets based on particle swarm optimization[J]. Engineering Applications of Artificial Intelligence, 2016, 53: 1-18. 10.1016/j.engappai.2016.03.007 |
[1] | 陈学斌, 任志强, 张宏扬. 联邦学习中的安全威胁与防御措施综述[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1663-1672. |
[2] | 刘沛骞, 王水莲, 申自浩, 王辉. 基于轨迹扰动和路网匹配的位置隐私保护算法[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1546-1554. |
[3] | 高改梅, 张瑾, 刘春霞, 党伟超, 白尚旺. 基于区块链与CP-ABE策略隐藏的众包测试任务隐私保护方案[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 811-818. |
[4] | 马海峰, 李玉霞, 薛庆水, 杨家海, 高永福. 用于实现区块链隐私保护的属性基加密方案[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 485-489. |
[5] | 王一帆, 林绍福, 李云江. 基于区块链和零知识证明的高速公路自由流收费方法[J]. 《计算机应用》唯一官方网站, 2024, 44(12): 3741-3750. |
[6] | 刘晶鑫, 黄雯静, 徐亮胜, 黄冲, 吴建生. 字典学习与样本关联保持结合的无监督特征选择模型[J]. 《计算机应用》唯一官方网站, 2024, 44(12): 3766-3775. |
[7] | 王伊婷, 万武南, 张仕斌, 张金全, 秦智. 基于SM9算法的可链接环签名方案[J]. 《计算机应用》唯一官方网站, 2024, 44(12): 3709-3716. |
[8] | 梁静, 万武南, 张仕斌, 张金全, 秦智. 面向主从链的慈善系统溯源存储模型[J]. 《计算机应用》唯一官方网站, 2024, 44(12): 3751-3758. |
[9] | 方鹏, 赵凡, 王保全, 王轶, 蒋同海. 区块链3.0的发展、技术与应用[J]. 《计算机应用》唯一官方网站, 2024, 44(12): 3647-3657. |
[10] | 高瑞, 陈学斌, 张祖篡. 面向部分图更新的动态社交网络隐私发布方法[J]. 《计算机应用》唯一官方网站, 2024, 44(12): 3831-3838. |
[11] | 贾淼, 姚中原, 祝卫华, 高婷婷, 斯雪明, 邓翔. 零知识证明赋能区块链的进展与展望[J]. 《计算机应用》唯一官方网站, 2024, 44(12): 3669-3677. |
[12] | 陈学斌, 屈昌盛. 面向联邦学习的后门攻击与防御综述[J]. 《计算机应用》唯一官方网站, 2024, 44(11): 3459-3469. |
[13] | 周辉, 陈玉玲, 王学伟, 张洋文, 何建江. 基于生成对抗网络的联邦学习深度影子防御方案[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 223-232. |
[14] | 崔剑阳, 蔡英, 张宇, 范艳芳. 车载自组织网络中格基签密的可认证隐私保护方案[J]. 《计算机应用》唯一官方网站, 2024, 44(1): 233-241. |
[15] | 陆佳行, 戴华, 刘源龙, 周倩, 杨庚. 面向云环境密文排序检索的字典划分向量空间模型[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 1994-2000. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||