基于聚类划分的高效用模式并行挖掘算法

doi:10.11772/j.issn.1001-9081.2016.08.2202

计算机应用 ›› 2016, Vol. 36 ›› Issue (8): 2202-2206.DOI: 10.11772/j.issn.1001-9081.2016.08.2202

基于聚类划分的高效用模式并行挖掘算法

邢淑凝^1,2, 刘方爱^1,2, 赵晓晖^1,2

1. 山东师范大学信息科学与工程学院, 济南 250014;
2. 山东省分布式计算机软件新技术重点实验室(山东师范大学), 济南 250014

收稿日期:2016-01-11 修回日期:2016-02-27 发布日期:2016-08-10 出版日期:2016-08-10
通讯作者: 刘方爱
作者简介:邢淑凝(1992-),女,山东青岛人,硕士研究生,CCF会员,主要研究方向:数据挖掘、大数据分析;刘方爱(1962-),男,山东青岛人,教授,博士生导师,博士,主要研究方向:并行计算模型、分布式网络、数据挖掘;赵晓晖(1981-),女,河南范县人,讲师,博士研究生,主要研究方向:复杂网络、数据挖掘。
基金资助:
国家自然科学基金资助项目（90612003,61572301）。

Parallel high utility pattern mining algorithm based on cluster partition

XING Shuning^1,2, LIU Fang'ai^1,2, ZHAO Xiaohui^1,2

1. College of Information Science and Engineering, Shandong Normal University, Jinan Shandong 250014, China;
2. Shandong Provincial Key Laboratory for Novel Distributed Computer Software Technology(Shandong Normal University), Jinan Shandong 250014, China

Received:2016-01-11 Revised:2016-02-27 Online:2016-08-10 Published:2016-08-10
Supported by:
This work is partially supported by the National Natural Science Foundation of China (90612003, 61572301).

摘要/Abstract

摘要： 针对在大规模数据库中挖掘高效用模式产生大量基于内存的效用模式树，从而导致内存空间占用较大以及丢失一些高效用项集的问题，提出在Hadoop分布式计算平台下的基于聚类划分的高效用模式并行挖掘算法PUCP。首先，采用聚类的方法把数据库中相似的事务划分为若干数据子集；然后，把若干划分好的数据子集分配到Hadoop平台的各个节点中构造效用模式树；最后，把各个节点中相同项的条件模式基分配到同一个节点中进行挖掘，以减少各个节点交叉操作的次数。通过实验结果和理论分析表明：PUCP算法在不影响挖掘结果可靠性的前提下，与主流串行高效用模式挖掘——效用模式增长挖掘算法（UP-Growth）和现有的并行高效用模式挖掘算法PHUI-Growth相比，挖掘效率分别提高了61.2%和16.6%；并且使用了Hadoop计算平台，能有效缓解挖掘大规模数据的内存压力。

关键词: 大数据, 高效用模式挖掘, 聚类, 并行计算, Hadoop

Abstract: The exiting algorithms generate a lot of utility pattern trees based on memory when mining high utility patterns in large-scale database, leading to occupying more memory spaces and losing some high utility itemsets. Using Hadoop platform, a parallel high utility pattern mining algorithm, named PUCP, based on cluster partition was proposed. Firstly, the clustering method was introduced to divide the transaction database into several sub-datasets. Secondly, sub-datasets were allocated to each node of Hadoop to construct utility pattern tree. Finally, the conditional pattern bases of the same item which generated from utility pattern trees were allocated to the same node, reducing the crossover operation times of each node. The theoretical analysis and experimental results show that, compared with the mainstream serial high utility pattern mining algorithm named UP-Growth (Utility Pattern Growth) and parallel high utility pattern mining algorithm named HUI-Growth (Parallel mining High Utility Itemsets by pattern-Growth), the mining efficiency of PUCP is increased by 61.2% and 16.6% respectively without affecting the reliability of the mining results; and the memory pressure of large data mining can be effectively relieved by using Hadoop platform.

Key words: big data, high utility pattern mining, clustering, parallel computing, Hadoop

中图分类号:

TP301.6

邢淑凝, 刘方爱, 赵晓晖. 基于聚类划分的高效用模式并行挖掘算法[J]. 计算机应用, 2016, 36(8): 2202-2206.

XING Shuning, LIU Fang'ai, ZHAO Xiaohui. Parallel high utility pattern mining algorithm based on cluster partition[J]. Journal of Computer Applications, 2016, 36(8): 2202-2206.

参考文献

[1] JAGRAW R,SRIKANT R.Fast algorithms for mining association rules[C]//Proceedings of the 20th International Conference on Very Large Data Bases.San Francisco,CA:Morgan Kaufmann,1994:487-499.
[2] HAN J W,KAMBER M.数据挖掘:概念与技术[M].范明,孟小峰,译.2版.北京:机械工业出版社,2007:206-228.(HAN J W,KAMBER M.Data Mining:Concepts and Techniques[M].FAN M,MENG X F,translated.2nd ed.Beijing:China Machine Press,2007:206-228.)
[3] ZIHAYAT M,AN A.Mining top-k high utility patterns over data streams[J].Information Sciences,2014,285:138-161.
[4] YUN U,RYANG H.Incremental high utility pattern mining with static and dynamic databases[J].Applied Intelligence,2015,42(2):323-352.
[5] YUN U,RYANG H,RYU K H.High utility itemset mining with techniques for reducing overestimated utilities and pruning candidates[J].Expert Systems with Applications,2014,41(8):3861-3878.
[6] SONG W,LIU Y,LI J.Mining high utility itemsets by dynamically pruning the tree structure[J].Applied Intelligence,2014,40(1):29-43.
[7] SHIE B-E,HSIAO H-F,TSENG V S.Efficient algorithms for discovering high utility user behavior patterns in mobile commerce environments[J].Knowledge and Information Systems,2013,37(2):363-387.
[8] LEE D,PARK S-H,MOON S.Utility-based association rule mining:a marketing solution for cross-selling[J].Expert Systems with Applications,2013,40(7):2715-2725.
[9] LIU Y,LIAO W-K,CHOUDHARY A.A two-phase algorithm for fast discovery of high utility itemsets[M]//PAKDD'05:Proceedings of the 9th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining.Berlin:Springer-Verlag,2005:689-695.
[10] TSENG V S,WU C-W,SHIE B-E,et al.UP-Growth:an efficient algorithm for high utility itemset mining[C]//KDD'10:Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York:ACM,2010:253-262.
[11] SHIE B-E,HSIAO H-F,TSENG V S.Efficient algorithms for discovering high utility user behavior patterns in mobile commerce environments[J].Knowledge and Information Systems,2013,37(2):363-387.
[12] LIN Y C,WU C-W,TSENG V S.Mining high utility itemsets in big data[C]//PAKDD 2015:Proceedings of the 19th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining,LNCS 9078.Berlin:Springer-Verlag,2015:649-661.
[13] LIU Y,LIAO W-K,CHOUDHARY A.A fast high utility itemsets mining algorithm[C]//UBDM'05:Proceedings of the 1st International Workshop on Utility-based Data Mining.New York:ACM,2005:90-99.
[14] JAIN A K,DUBES R C.Algorithms for clustering data[M]//Algorithms for Clustering Data.Upper Saddle River,NJ:Prentice Hall,1988:227-229.
[15] Frequent itemset mining implementations repository[EB/OL].[2015-12-22].http://fimi.cs.Helsinki.fi.

基于聚类划分的高效用模式并行挖掘算法

Parallel high utility pattern mining algorithm based on cluster partition

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	李顺勇, 李师毅, 胥瑞, 赵兴旺. 基于自注意力融合的不完整多视图聚类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2696-2703.
[2]	张润莲, 张密, 武小年, 舒瑞. 基于GPU的大状态密码S盒差分性质评估方法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2785-2790.
[3]	王清, 赵杰煜, 叶绪伦, 王弄潇. 统一框架的增强深度子空间聚类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 1995-2003.
[4]	李旭, 何玉林, 崔来中, 黄哲学, PHILIPPE Fournier‑Viger. 基于大数据随机样本划分的分布式观测点分类器[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1727-1733.
[5]	董瑶, 付怡雪, 董永峰, 史进, 陈晨. 不完整多视图聚类综述[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1673-1682.
[6]	蒋小霞, 黄瑞章, 白瑞娜, 任丽娜, 陈艳平. 基于事件表示和对比学习的深度事件聚类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1734-1742.
[7]	黄天宇, 李远兴, 陈昊, 郭紫佳, 魏明军. 地空协同场景下加权模糊聚类用户簇划分方法[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1555-1561.
[8]	高麟, 周宇, 邝得互. 进化双层自适应局部特征选择[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1408-1414.
[9]	徐童童, 解滨, 张春昊, 张喜梅. 融合转移概率矩阵的多阶最近邻图聚类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1527-1538.
[10]	丁雨, 张瀚霖, 罗荣, 孟华. 基于信念子簇切割的模糊聚类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1128-1138.
[11]	孙林, 刘梦含. 基于自适应布谷鸟优化特征选择的K-means聚类[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 831-841.
[12]	张卓, 陈花竹. 基于一致性和多样性的多尺度自表示学习的深度子空间聚类[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 353-359.
[13]	杨成昊, 胡节, 王红军, 彭博. 基于注意力机制的不完备多视图聚类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(12): 3784-3789.
[14]	尹春勇, 周永成. 双端聚类的自动调整聚类联邦学习[J]. 《计算机应用》唯一官方网站, 2024, 44(10): 3011-3020.
[15]	朱云华, 孔兵, 周丽华, 陈红梅, 包崇明. 图对比学习引导的多视图聚类网络[J]. 《计算机应用》唯一官方网站, 2024, 44(10): 3267-3274.