一次性条件下top-k高平均效用序列模式挖掘算法

doi:10.11772/j.issn.1001-9081.2023030268

《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (2): 477-484.DOI: 10.11772/j.issn.1001-9081.2023030268

• 数据科学与技术 • 上一篇

一次性条件下top-k高平均效用序列模式挖掘算法

杨克帅¹, 武优西¹(), 耿萌¹, 刘靖宇¹, 李艳²

^1.河北工业大学人工智能与数据科学学院，天津 300401
^2.河北工业大学经济管理学院，天津 300401

收稿日期:2023-03-13 修回日期:2023-05-17 接受日期:2023-05-29 发布日期:2023-06-16 出版日期:2024-02-10
通讯作者: 武优西
作者简介:杨克帅（1998—），男，河南濮阳人，硕士研究生，CCF会员，主要研究方向：数据挖掘
耿萌（1997—），女，河北石家庄人，博士研究生，CCF会员，主要研究方向：数据挖掘
刘靖宇（1976—），男，天津人，副教授，博士，主要研究方向：网络存储、信息安全
李艳（1975—），女，天津人，副教授，博士，主要研究方向：数据挖掘、供应链管理。
基金资助:
国家自然科学基金资助项目(61976240)

Top-k high average utility sequential pattern mining algorithm under one-off condition

Keshuai YANG¹, Youxi WU¹(), Meng GENG¹, Jingyu LIU¹, Yan LI²

^1.School of Artificial Intelligence，Hebei University of Technology，Tianjin 300401，China
^2.School of Economics and Management，Hebei University of Technology，Tianjin 300401，China

Received:2023-03-13 Revised:2023-05-17 Accepted:2023-05-29 Online:2023-06-16 Published:2024-02-10
Contact: Youxi WU
About author:YANG Keshuai， born in 1998， M. S. candidate. His research interests include data mining.
GENG Meng， born in 1997， M. S. candidate. Her research interests include data mining.
LIU Jingyu， born in 1976， Ph. D.， associate professor. His research interests include network storage， information security.
LI Yan， born in 1975， Ph. D.， associate professor. Her research interests include data mining， supply chain management.
Supported by:
National Natural Science Foundation of China(61976240)

摘要/Abstract

摘要：

针对传统序列模式挖掘（SPM）不考虑模式重复性且忽略各项的效用（单价或利润）与模式长度对用户兴趣度影响的问题，提出一次性条件下top-k高平均效用序列模式挖掘（TOUP）算法。TOUP算法主要包括两个核心步骤：平均效用计算和候选模式生成。首先，提出基于各项出现位置与项重复关系数组的CSP（Calculation Support of Pattern）算法计算模式支持度，从而实现模式平均效用的快速计算；其次，采用项集扩展和序列扩展生成候选模式，并提出了最大平均效用上界，基于该上界实现对候选模式的有效剪枝。在5个真实数据集和1个合成数据集上的实验结果表明，相较于TOUP-dfs和HAOP-ms算法，TOUP算法的候选模式数分别降低了38.5%~99.8%和0.9%~77.6%；运行时间分别降低了33.6%~97.1%和57.9%~97.2%。TOUP的算法性能更优，能更高效地挖掘用户感兴趣的模式。

关键词: 数据挖掘, 序列模式挖掘, 高平均效用, 一次性条件, top-k

Abstract:

To address the issue that traditional Sequential Pattern Mining （SPM） does not consider pattern repetition and ignores the effects of utility （unit price or profit） and pattern length on user interest， a Top-k One-off high average Utility sequential Pattern mining （TOUP） algorithm was proposed. The TOUP algorithm mainly includes two core steps： average utility calculation and candidate pattern generation. Firstly， a CSP （Calculation Support of Pattern） algorithm based on the occurrence position of each item and the item repetition relation array was proposed to calculate pattern support， thereby achieving rapid calculation of the average utility of patterns. Secondly， candidate patterns were generated by itemset extension and sequence extension， and a maximum average utility upper bound was proposed. Based on this upper bound， effective pruning of candidate patterns was achieved. Experimental results on five real datasets and one synthetic dataset show that compared to the TOUP-dfs and HAOP-ms algorithms， TOUP algorithm reduces the number of candidate patterns by 38.5% to 99.8% and 0.9% to 77.6%， respectively， and decreases the running time by 33.6% to 97.1% and 57.9% to 97.2%， respectively. Therefore， the algorithm performance of TOUP is better， and it can mine patterns of interests to users more efficiently.

Key words: data mining, sequential pattern mining, high average utility, one-off condition, top-k

中图分类号:

TP311.1

杨克帅, 武优西, 耿萌, 刘靖宇, 李艳. 一次性条件下top-k高平均效用序列模式挖掘算法[J]. 计算机应用, 2024, 44(2): 477-484.

Keshuai YANG, Youxi WU, Meng GENG, Jingyu LIU, Yan LI. Top-k high average utility sequential pattern mining algorithm under one-off condition[J]. Journal of Computer Applications, 2024, 44(2): 477-484.

图/表 10

表1 序列数据库D

Tab. 1 Sequence database D

序列号	序列
$S 1$	（1，ab）（2，a）（3，abc）（4，ab）（5，ad）（6，cd）
$S 2$	（1，bd）（2，ab）（3，acd）（4，abc）（5，ac）（6，ac）

表1 序列数据库D

Tab. 1 Sequence database D

序列号	序列
$S 1$	（1，ab）（2，a）（3，abc）（4，ab）（5，ad）（6，cd）
$S 2$	（1，bd）（2，ab）（3，acd）（4，abc）（5，ac）（6，ac）

表2 效用表

Tab. 2 Utility table

项	值	项	值
a	10	c	8
b	5	d	3

表3 实验数据集

Tab. 3 Experimental datasets

数据集	类型	序列数	项目数	项集数	总长度
SDB1	项集序列	5 026	6	55 524	140 182
SDB2	项集序列	1 248	276	8 134	73 206
SDB3	项集序列	1 096	5 390	6 111	54 999
SDB4	项集序列	69	2 520	178	1 403
SDB5	单项序列	668	231	35 100	35 100
SDB6	单项序列	1 663	6 417	57 852	57 852

表4 在6个数据集上不同算法生成候选模式数量对比

Tab. 4 Comparison of number of candidate patterns generated by different algorithms on six datasets

算法	SDB1	SDB2	SDB3	SDB4	SDB5	SDB6
TOUP-rf	369	4 244	8 152	4 152	7 320	10 109
TOUP-nus	1 206	114 990	2 268 510	480 672	70 438	1 263 717
TOUP-dfs	600	53 045	63 818	2 135 126	69 631	57 427
HAOP-ms	1 650	14 583	17 124	9 727	9 791	10 205
HANP-oms	1 673	14 764	17 124	9 727	10 095	10 343
PMBC-ms	369	4 244	8 152	4 152	7 320	10 109
TOUP	369	4 244	8 152	4 152	7 320	10 109

表5 在6个数据集上不同算法运行时间对比 (s)

Tab. 5 Comparison of running time amongdifferent algorithms on six datasets

算法	SDB1	SDB2	SDB3	SDB4	SDB5	SDB6
TOUP-rf	123	224	339	14	804	1 825
TOUP-nus	84	123	1 188	82	478	757
TOUP-dfs	107	446	866	456	91	1 594
HAOP-ms	591	995	699	138	1 343	2 174
HANP-oms	628	1 022	710	14	1 383	2 213
PMBC-ms	98	190	346	14	568	1 826
TOUP	71	151	294	13	37	417

表6 在6个数据集上不同算法内存消耗对比 (MB)

Tab. 6 Comparison of memory consumption amongdifferent algorithms on six datasets

算法	SDB1	SDB2	SDB3	SDB4	SDB5	SDB6
TOUP-rf	52	47	46	40	48	55
TOUP-nus	54	50	50	41	47	54
TOUP-dfs	54	50	49	40	47	52
HAOP-ms	52	50	50	45	49	56
HANP-oms	52	50	50	47	49	56
PMBC-ms	52	47	49	44	48	55
TOUP	52	47	46	40	47	52

表7 在SDB1_1~SDB1_6数据集上不同算法的运行时间对比 (s)

Tab. 7 Comparison of running time amongdifferent algorithms on SDB1_1-SDB1_6 datasets

算法	SDB1_1	SDB1_2	SDB1_3	SDB1_4	SDB1_5	SDB1_6
TOUP-rf	123	267	386	508	636	761
TOUP-nus	84	217	325	410	530	620
TOUP-dfs	107	237	373	522	673	777
HAOP-ms	591	1 184	1 808	2 579	3 167	3 689
HANP-oms	628	1 189	1 811	2 587	3 171	3 723
PMBC-ms	98	199	302	409	510	611
TOUP	71	161	242	323	401	484

表8 在SDB1_1~SDB1_6数据集上不同算法的内存消耗对比 (MB)

Tab. 8 Comparison of memory consumption amongdifferent algorithms on SDB1_1-SDB1_6 datasets

算法	SDB1_1	SDB1_2	SDB1_3	SDB1_4	SDB1_5	SDB1_6
TOUP-rf	52	64	77	89	103	116
TOUP-nus	54	68	84	102	121	136
TOUP-dfs	54	67	83	97	115	132
HAOP-ms	52	65	77	90	103	116
HANP-oms	52	65	78	90	103	116
PMBC-ms	52	64	77	90	103	116
TOUP	52	64	77	90	103	116

表9 不同参数k的生成候选模式数对比

Tab. 9 Comparison of number of generated candidate patterns with different parameter k

算法	k=10	k=12	k=14	k=16	k=18	k=20
TOUP-rf	369	391	485	758	942	956
TOUP-nus	1 206	1 806	2 464	3 738	5 160	6 606
TOUP-dfs	600	627	771	942	1 173	1 190
HAOP-ms	1 650	2 130	4 381	5 294	6 754	7 287
HANP-oms	1 673	2 186	4 512	5 515	6 892	7 621
PMBC-ms	369	391	485	758	942	956
TOUP	369	391	485	758	942	956

表10 不同参数k的运行时间对比 (s)

Tab. 10 Comparison of running time with different parameter k

算法	k=10	k=12	k=14	k=16	k=18	k=20
TOUP-rf	123	146	169	255	318	321
TOUP-nus	84	105	141	175	217	264
TOUP-dfs	107	163	214	310	408	514
HAOP-ms	591	716	1 465	1 794	2 211	2 343
HANP-oms	628	775	1 549	1 893	2 353	2 521
PMBC-ms	98	101	131	200	247	255
TOUP	71	79	93	141	174	176

参考文献 33

1	OKOLICA J S， PETERSON G L， MILLS R F， et al. Sequence pattern mining with variables［J］. IEEE Transactions on Knowledge and Data Engineering， 2020， 32（1）： 177-187. 10.1109/tkde.2018.2881675
2	GHOSH S， LI J， CAO L， et al. Septic shock prediction for ICU patients via coupled HMM walking on sequential contrast patterns［J］. Journal of Biomedical Informatics， 2017， 66： 19-31. 10.1016/j.jbi.2016.12.010
3	LE B， M-T TRAN， VO B. Mining frequent closed inter-sequence patterns efficiently using dynamic bit vectors［J］. Applied Intelligence， 2015， 43： 74-84. 10.1007/s10489-014-0630-1
4	CHEN X， XIAO R， DU X， et al. Constructing a novel Spark-based distributed maximum frequent sequence pattern mining for IoT log［C］// Proceedings of the 8th International Conference on Communication and Network Security. New York： ACM， 2018： 112-116. 10.1145/3290480.3290497
5	韩萌，丁剑. 数据流频繁模式挖掘综述［J］. 计算机应用， 2019， 39（3）： 719-727. 10.11772/j.issn.1001-9081.2018081712
	HAN M， DING J. Survey of frequent pattern mining over data streams［J］. Journal of Computer Applications， 2019， 39（3）： 719-727. 10.11772/j.issn.1001-9081.2018081712
6	WU Y， FAN J， LI Y， et al. NetDAP：（Δ， γ）-approximate pattern matching with length constraints［J］. Applied Intelligence， 2020， 50： 4094-4116. 10.1007/s10489-020-01778-1
7	WU Y， TONG Y， ZHU X， et al. NOSEP： nonoverlapping sequence pattern mining with gap constraints［J］. IEEE Transactions on Cybernetics， 2018， 48（10）： 2809-2822. 10.1109/tcyb.2017.2750691
8	WANG Y WU Y， LI Y， et al. Self adaptive nonoverlapping sequential pattern mining［J］. Applied Intelligence， 2022， 52： 6646-6661. 10.1007/s10489-021-02763-y
9	WU Y， LUO L， LI Y， et al. NTP-Miner： nonoverlapping three-way sequential pattern mining［J］. ACM Transaction on Knowledge Discovery from Data， 2022， 16（3）： No. 51. 10.1145/3480245
10	武优西，刘茜，闫文杰，等. 无重叠条件严格模式匹配的高效求解算法［J］. 软件学报， 2021， 32（11）： 3331-3350. 10.13328/j.cnki.jos.006054
	WU Y X， LIU Q， YAN W J， et al. Efficient algorithm for solving strict pattern matching under nonoverlapping condition［J］. Journal of Software， 2021， 32（11）： 3331-3350. 10.13328/j.cnki.jos.006054
11	FOURNIER-VIGER P， GAN W， WU Y， et al. Pattern mining： Current challenges and opportunities ［C］// Proceedings of the 2022 International Conference on Database Systems for Advanced Applications. Cham： Springer， 2022： 34-49. 10.1007/978-3-031-11217-1_3
12	WU Y， WANG X， LI Y， et al. OWSP-Miner： self-adaptive one-off weak-gap strong pattern mining［J］. ACM Transactions on Management Information Systems， 2022， 13（3）： No. 25. 10.1145/3476247
13	XU T， XU J， DONG X. Mining high utility sequential patterns using multiple minimum utility［J］. International Journal of Pattern Recognition and Artificial Intelligence， 2018， 32（10）： 1859017. 10.1142/s0218001418590176
14	GAN W， LIN J C-W， FOURNIER-VIGER P， et al. A survey of utility-oriented pattern mining［J］. IEEE Transactions on Knowledge and Data Engineering， 2021， 33（4）： 1306-1327. 10.1109/tkde.2019.2942594
15	单芝慧，韩萌，韩强. 动态数据上的高效用模式挖掘综述［J］. 计算机应用， 2022， 42（1）： 94-108. 10.11772/j.issn.1001-9081.2021071290
	SHAN Z H， HAN M， HAN Q. Survey of high utility pattern mining on dynamic data［J］. Journal of Computer Applications， 2022， 42（1）： 94-108. 10.11772/j.issn.1001-9081.2021071290
16	SRIVASTAVA G， LIN J C-W， ZHANG X， et al. Large-scale high-utility sequential pattern analytics in Internet of Things［J］. IEEE Internet of Things Journal， 2021， 8（16）： 12669-12678. 10.1109/jiot.2020.3026826
17	SEGURA-DELGADO A， ANGUITA-RUIZ A， ALCALÁ R， et al. Mining high average-utility sequential rules to identify high-utility gene expression sequences in longitudinal human studies［J］. Expert Systems with Applications， 2022， 193： 116411. 10.1016/j.eswa.2021.116411
18	WU Y， LEI R， LI Y， et al. HAOP-Miner： self-adaptive high-average utility one-off sequential pattern mining［J］. Expert Systems with Applications， 2021， 184： 115449. 10.1016/j.eswa.2021.115449
19	WU Y， WANG Y， LI Y， et al. Top-k self-adaptive contrast sequential pattern mining［J］. IEEE Transactions on Cybernetics， 2022， 52（11）：11819-11833. 10.1109/tcyb.2021.3082114
20	ZHANG C， DU Z， GAN W， et al. TKUS： mining top-k high utility sequential patterns［J］. Information Sciences， 2021， 570： 342-359. 10.1016/j.ins.2021.04.035
21	DONG X， QIU P， LÜ J， et al. Mining top-k useful negative sequential patterns via learning［J］. IEEE Transactions on Neural Networks and Learning Systems， 2019， 30（9）： 2764-2778. 10.1109/tnnls.2018.2886199
22	LE T， VO B， V-N HUYNH， et al. Mining top-k frequent patterns from uncertain databases［J］. Applied Intelligence， 2020， 50： 1487-1497. 10.1007/s10489-019-01622-1
23	LI C， YANG Q， WANG J. Efficient mining of gap-constrained subsequences and its various applications［J］. ACM Transactions on Knowledge Discovery from Data， 2012， 6（1）： No. 2. 10.1145/2133360.2133362
24	XIE F， WU X， ZHU X. Efficient sequential pattern mining with wildcards for keyphrase extraction［J］. Knowledge-Based Systems， 2017， 115： 27-39. 10.1016/j.knosys.2016.10.011
25	LI Y， ZHANG S， GUO L， et al. NetNMSP： nonoverlapping maximal sequential pattern mining［J］. Applied Intelligence， 2022， 52： 9861-9884. 10.1007/s10489-021-02912-3
26	王珠林，武优西，王月华，等. 具有周期间隙约束的负序列模式挖掘［J］. 计算机科学， 2023， 50（3）： 147-154. 10.11896/jsjkx.211200248
	WANG Z L， WU Y X， WANG Y H， et al. Mining negative sequential patterns with periodic gap constraints［J］. Computer Science， 2023， 50（3）： 147-154. 10.11896/jsjkx.211200248
27	LIN J C-W， DJENOURI Y， SRIVASTAVA G， et al. Scalable mining of high-utility sequential patterns with three-tier MapReduce model［J］. ACM Transactions on Knowledge Discovery from Data， 2022， 16（3）： 1-26. 10.1145/3487046
28	GAN W S， LIN J C W， ZHANG J X， et al. Fast utility mining on sequence data［J］. IEEE Transactions on Cybernetics， 2021， 51（2）： No. 60. 10.1109/tcyb.2020.2970176
29	WU Y， GENG M， LI Y， et al. HANP-Miner： High average utility nonoverlapping sequential pattern mining［J］. Knowledge-Based Systems， 2021， 229： 107361. 10.1016/j.knosys.2021.107361
30	KIM H， YUN U， BAEK Y， et al. Efficient list based mining of high average utility patterns with maximum average pruning strategies［J］. Information Sciences， 2021， 543： 85-105. 10.1016/j.ins.2020.07.043
31	EZEIFE C I， ARAVINDAN V， CHATURVEDI R. Mining integrated sequential patterns from multiple databases［J］. International Journal of Data Warehousing and Mining， 2020， 16（1）： 1-21. 10.4018/ijdwm.2020010101
32	T-T PHAM， NGUYEN A DO T， et al. An efficient method for mining top-k closed sequential patterns［J］. IEEE Access， 2020， 8： 118156-118163. 10.1109/access.2020.3004528
33	WU X， ZHU X， HE Y， et al. PMBC： Pattern mining from biological sequences with wildcard constraints［J］. Computers in Biology and Medicine， 2013， 43（5）： 481-492. 10.1016/j.compbiomed.2013.02.006

[1]	郑浩东, 马华, 谢颖超, 唐文胜. 融合遗忘因素与记忆门的图神经网络知识追踪模型[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2747-2752.
[2]	蒋华, 李星, 王慧娇, 韦静海. 基于数据索引结构的跨级高效用项集挖掘算法[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2200-2208.
[3]	黄硕, 李艳辉, 曹建秋. 本地化差分隐私下的频繁序列模式挖掘算法PrivSPM[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2057-2064.
[4]	祁超帅, 何文思, 焦毅, 马英红, 蔡伟, 任素萍. 无人机飞行数据异常检测算法综述[J]. 《计算机应用》唯一官方网站, 2023, 43(6): 1833-1841.
[5]	李元江, 权金升, 谭阳奕, 杨田. 基于相似和差异双视角的高维数据属性约简[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1467-1472.
[6]	邵小萌, 张猛. 融合注意力机制的时间卷积知识追踪模型[J]. 《计算机应用》唯一官方网站, 2023, 43(2): 343-348.
[7]	李文全, 毛伊敏, 彭新东. 基于犹豫模糊集的凝聚式层次聚类算法[J]. 《计算机应用》唯一官方网站, 2023, 43(12): 3755-3763.
[8]	孟玉飞, 武优西, 王珍, 李艳. 对比保序模式挖掘算法[J]. 《计算机应用》唯一官方网站, 2023, 43(12): 3740-3746.
[9]	吴军, 欧阳艾嘉, 张琳. 基于影响度的统计显著序列模式挖掘算法[J]. 《计算机应用》唯一官方网站, 2022, 42(9): 2713-2721.
[10]	余顺坤, 闫泓序. 基于确定性因子的启发式属性值约简模型[J]. 《计算机应用》唯一官方网站, 2022, 42(2): 469-474.
[11]	孙蕊, 韩萌, 张春砚, 申明尧, 杜诗语. 含负项top-k高效用项集挖掘算法[J]. 计算机应用, 2021, 41(8): 2386-2395.
[12]	康军, 黄山, 段宗涛, 李宜修. 时空轨迹序列模式挖掘方法综述[J]. 《计算机应用》唯一官方网站, 2021, 41(8): 2379-2385.
[13]	刘世泽, 秦艳君, 王晨星, 苏琳, 柯其学, 罗海勇, 孙艺, 王宝会. 基于深度残差长短记忆网络交通流量预测算法[J]. 计算机应用, 2021, 41(6): 1566-1572.
[14]	李旭娟, 皮建勇, 黄飞翔, 贾海朋. 基于自生成深度神经网络的4D航迹预测[J]. 计算机应用, 2021, 41(5): 1492-1499.
[15]	陈凯, 于彦伟, 赵金东, 宋鹏. 基于城市交通监控大数据的工作位置推理方法[J]. 计算机应用, 2021, 41(1): 177-184.

一次性条件下top-k高平均效用序列模式挖掘算法

Top-k high average utility sequential pattern mining algorithm under one-off condition

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 10

参考文献 33

相关文章 15

编辑推荐

Metrics