基于影响度的统计显著序列模式挖掘算法

doi:10.11772/j.issn.1001-9081.2021071311

《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (9): 2713-2721.DOI: 10.11772/j.issn.1001-9081.2021071311

• 数据科学与技术 • 上一篇

基于影响度的统计显著序列模式挖掘算法

吴军(), 欧阳艾嘉, 张琳

遵义师范学院信息工程学院，贵州遵义 563006

收稿日期:2021-07-19 修回日期:2021-10-22 接受日期:2021-10-25 发布日期:2021-11-10 出版日期:2022-09-10
通讯作者: 吴军
作者简介:欧阳艾嘉（1975—），男，湖南娄底人，教授，博士，CCF会员，主要研究方向：智能计算、并行计算；
张琳（1984—），女，贵州遵义人，副教授，硕士，主要研究方向：数据挖掘。
基金资助:
国家自然科学基金资助项目(62066049);遵义市联合资金项目(遵市科合HZ字（2022）123)

Statistically significant sequential patterns mining algorithm under influence degree

Jun WU(), Aijia OUYANG, Lin ZHANG

School of Information Engineering，Zunyi Normal University，Zunyi Guizhou 563006，China

Received:2021-07-19 Revised:2021-10-22 Accepted:2021-10-25 Online:2021-11-10 Published:2022-09-10
Contact: Jun WU
About author:OUYANG Aijia， born in 1975， Ph. D.， professor. His research interests include intelligent computing， parallel computing.
ZHANG Lin， born in 1984， M. S.， associate professor. Her research interests include data mining.
Supported by:
National Natural Science Foundation of China(62066049);Joint Fund Program of Zunyi Science and Technology Bureau(ZSKHHZ（2022）123)

摘要/Abstract

摘要：

针对传统序列模式挖掘算法中支持度不能如实体现序列模式兴趣度以及未对报告的序列模式进行质量评估的问题，提出一个基于影响度的统计显著序列模式挖掘算法ISSPM。首先，递归地挖掘出所有满足兴趣度约束的序列模式；然后，使用项集置换方法构建这些序列模式的置换检验零分布；最后，通过该零分布计算出被评估的序列模式的统计度量值，并从上述序列模式中找到所有统计显著序列模式。真实序列记录集合上的实验结果表明，ISSPM算法相较于PSPM、SPDL和PSDSP算法挖掘到的序列模式数量更少但兴趣度更强；仿真序列记录集合上的实验结果表明，ISSPM算法报告的结果中假阳性序列模式数量平均占比为3.39%，且该算法的嵌入模式的发现率均不低于66.7%，明显优于上述3个对比算法。可见，ISSPM算法报告的统计显著序列模式能够体现序列记录集合中更有价值的信息，同时根据这些信息做出的进一步分析和决策也更加可靠。

关键词: 数据挖掘, 序列模式挖掘, 兴趣度度量, 统计显著模式, 置换检验

Abstract:

Aiming at the problems that the degree of support is not a good indicator for the interestingness of sequential patterns and the quality of reported sequential patterns is not evaluated in traditional sequential patterns mining algorithms， a statistically significant sequential patterns mining algorithm under influence degree， calling ISSPM （Influence-based Significant Sequential Patterns Mining）， was proposed. Firstly， all sequential patterns meeting the interestingness constraint were mined recursively. Then， the itemset permuting method was introduced to construct permutation test null distribution for these sequential patterns. Finally， the statistical measures of the evaluated sequential patterns were calculated from this distribution， and all statistically significant sequential patterns were found from the above sequential patterns. In the experiments with the PSPM （Prefix-projected Sequential Patterns Mining）， SPDL （Sequential Patterns Discovering under Leverage） and PSDSP （Permutation Strategies for Discovering Sequential Patterns） algorithms on the real-world sequential record datasets， ISSPM algorithm reports fewer but more interesting sequential patterns. Experimental results on the synthetic sequential record datasets show that the average proportion of the false positive sequential patterns reported by the ISSPM algorithm is 3.39%， and the discovery rate of embedded patterns of this algorithm is not less than 66.7%， which are significantly better than those of the above three algorithms to compare. It can be seen that the statistically significant sequential patterns reported by ISSPM algorithm can reflect more valuable information in sequential record datasets， and the decisions made based on the information are more reliable.

Key words: data mining, sequential pattern mining, interestingness measure, statistically significant pattern, permutation test

中图分类号:

TP391.4

吴军, 欧阳艾嘉, 张琳. 基于影响度的统计显著序列模式挖掘算法[J]. 计算机应用, 2022, 42(9): 2713-2721.

Jun WU, Aijia OUYANG, Lin ZHANG. Statistically significant sequential patterns mining algorithm under influence degree[J]. Journal of Computer Applications, 2022, 42(9): 2713-2721.

图/表 8

表1 ATS文本序列记录集合中支持度最大的5个序列模式

Tab. 1 Top-5 sequential patterns with largest degree of support in ATS dataset

序列模式	支持度/%	序列模式	支持度/%
$a n d, a n d$	13.0	$o f, a n d$	8.6
$a n d, t o$	9.8	$a n d, o f$	8.0
$t o, a n d$	9.1

表1 ATS文本序列记录集合中支持度最大的5个序列模式

Tab. 1 Top-5 sequential patterns with largest degree of support in ATS dataset

序列模式	支持度/%	序列模式	支持度/%
$a n d, a n d$	13.0	$o f, a n d$	8.6
$a n d, t o$	9.8	$a n d, o f$	8.0
$t o, a n d$	9.1

图1 项集置换方法生成的随机序列记录集合

Fig. 1 Random sequential record dataset generated by itemset permuting method

表2 真实序列记录集合的信息

Tab. 2 Information of real-world sequential record datasets

序列记录集合	记录数	项数	平均长度	重复项
Book	788	3 844	96.5	有
Unix	4 015	1 103	26.4	有
Peptide	15 784	20	27.0	有
Bike	21 078	67	7.3	有

图2 各个算法在真实序列记录集合上报告的序列模式数量

Fig. 2 Number of sequential patterns reported by each algorithm on real-world sequential record datasets

表3 各个算法在Book序列记录集合上报告的兴趣度最大的7个2长度的序列模式

Tab. 3 Top 7 2-length sequential patterns with the largest interestingness reported by each algorithm on Book dataset

模式序号	PSPM	SPDL	PSDSP	ISSPM
1	$a l g o r i t h m, a l g o r i t h m$	$s u p p o r t, v e c t o r$	$a l g o r i t h m, a l g o r i t h m$	$p a p e r, s h o w$
2	$l e a r n, l e a r n$	$p a p e r, a l g o r i t h m$	$l e a r n, l e a r n$	$s u p p o r t, v e c t o r$
3	$l e a r n, a l g o r i t h m$	$p a p e r, s h o w$	$l e a r n, a l g o r i t h m$	$p a p e r, a l g o r i t h m$
4	$a l g o r i t h m, l e a r n$	$s u p p o r t, m a c h i n e$	$a l g o r i t h m, l e a r n$	$b a s e, r e s u l t$
5	$d a t a, d a t a$	$v e c t o r, m a c h i n e$	$d a t a, d a t a$	$v e c t o r, m a c h i n e$
6	$l e a r n, d a t a$	$p r o b l e m, a l g o r i t h m$	$l e a r n, d a t a$	$l e a r n, r e s u l t$
7	$m o d e l, m o d e l$	$a l g o r i t h m, p r o b l e m$	$m o d e l, m o d e l$	$p a p e r, m e t h o d$

表3 各个算法在Book序列记录集合上报告的兴趣度最大的7个2长度的序列模式

Tab. 3 Top 7 2-length sequential patterns with the largest interestingness reported by each algorithm on Book dataset

模式序号	PSPM	SPDL	PSDSP	ISSPM
1	$a l g o r i t h m, a l g o r i t h m$	$s u p p o r t, v e c t o r$	$a l g o r i t h m, a l g o r i t h m$	$p a p e r, s h o w$
2	$l e a r n, l e a r n$	$p a p e r, a l g o r i t h m$	$l e a r n, l e a r n$	$s u p p o r t, v e c t o r$
3	$l e a r n, a l g o r i t h m$	$p a p e r, s h o w$	$l e a r n, a l g o r i t h m$	$p a p e r, a l g o r i t h m$
4	$a l g o r i t h m, l e a r n$	$s u p p o r t, m a c h i n e$	$a l g o r i t h m, l e a r n$	$b a s e, r e s u l t$
5	$d a t a, d a t a$	$v e c t o r, m a c h i n e$	$d a t a, d a t a$	$v e c t o r, m a c h i n e$
6	$l e a r n, d a t a$	$p r o b l e m, a l g o r i t h m$	$l e a r n, d a t a$	$l e a r n, r e s u l t$
7	$m o d e l, m o d e l$	$a l g o r i t h m, p r o b l e m$	$m o d e l, m o d e l$	$p a p e r, m e t h o d$

表4 各个算法在仿真序列记录集合上报告的非假阳性序列模式和假阳性序列模式的数量

Tab. 4 Number of non-false positive patterns and false positive patterns reported by each algorithm on synthetic sequential pattern datasets

算法	序列模式数量	非假阳性模式数量	假阳性模式数量
PSPM	17 925.6	3 176.4	14 749.2
SPDL	10 864.5	2 284.6	8 579.9
PSDSP	1 986.4	1 896.8	89.6
ISSPM^FDR	1 216.2	1 174.9	41.3
ISSPM^FWER	868.7	860.6	8.1

表5 各个算法在仿真序列记录集合上嵌入模式的发现率 (%)

Tab. 5 Embedded patterns discovery rate reported by each algorithm on synthetic sequential pattern datasets

集合编号	PSPM	SPDL	PSDSP	ISSPM
1	0.0	66.7	0.0	100.0
2	11.1	44.4	11.1	88.9
3	22.2	44.4	22.2	77.8
4	0.0	55.6	0.0	100.0
5	11.1	44.4	11.1	77.8
6	22.2	33.3	22.2	66.7
7	22.2	33.3	22.2	77.8
8	0.0	66.7	0.0	100.0
9	22.2	33.3	22.2	66.7
10	11.1	44.4	11.1	88.9

表6 各个算法在仿真序列记录集合上的平均运行时间 (s)

Tab. 6 Average running time of each algorithm on synthetic sequential pattern datasets

算法	挖掘阶段	评估阶段	总和
PSPM	236.2		236.2
SPDL	272.5		272.5
PSDSP	72.7	7 163.2	7 235.9
ISSPM	215.4	581.3	796.7

参考文献 27

1	HAN J W， CHENG H， XIN D， et al. Frequent pattern mining： current status and future directions［J］. Data Mining and Knowledge Discovery， 2007， 15（1）： 55-86. 10.1007/s10618-006-0059-1
2	谢彬，张琨，蔡颖，等. 移动目标关联共现规则挖掘算法研究［J］. 计算机工程， 2018， 44（8）： 61-67， 73.
	XIE B， ZHANG K， CAI Y， et al. Research on mining algorithm for association co-occurrence rule of moving targets［J］. Computer Engineering， 2018， 44（8）： 61-67， 73.
3	黄亚坤，王杨，王明星. 综合社区与关联序列挖掘的电子政务推荐算法［J］. 计算机应用， 2017， 37（9）： 2671-2677. 10.11772/j.issn.1001-9081.2017.09.2671
	HUANG Y K， WANG Y， WANG M X. E-government recommendation algorithm combining community and association sequence mining［J］. Journal of Computer Applications， 2017， 37（9）： 2671-2677. 10.11772/j.issn.1001-9081.2017.09.2671
4	FOURNIER-VIGER P， LIN J C W， KIRAN R U， et al. A survey of sequential pattern mining［J］. Data Science and Pattern Recognition， 2017， 1（1）： 54-77.
5	GAN W S， LIN J C W， FOURNIER-VIGER P， et al. A survey of parallel sequential pattern mining［J］. ACM Transactions on Knowledge Discovery from Data， 2019， 13（3）： No.25. 10.1145/3314107
6	SHAIKH M R， McNICHOLAS P D， ANTONIE M L， et al. Standardizing interestingness measures for association rules［J］. Statistical Analysis and Data Mining， 2018， 11（6）： 282-295. 10.1002/sam.11394
7	HÄMÄLÄINEN W， WEBB G I. A tutorial on statistically sound pattern discovery［J］. Data Mining and Knowledge Discovery， 2019， 33（2）： 325-377. 10.1007/s10618-018-0590-x
8	潘舒，祁云嵩. 多重假设检验及其在大数据特征降维中的应用［J］. 计算机科学， 2015， 42（6A）： 89-93.
	PAN S， QI Y S. Multiple hypothesis testing and its application in feature dimension reduction［J］. Computer Science， 2015， 42（6A）： 89-93.
9	HAN J W， PEI J， YIN Y W. Mining frequent patterns without candidate generation［J］. ACM SIGMOD Record， 2000， 29（2）： 1-12. 10.1145/335191.335372
10	YAN D， QU W W， GUO G M， et al. PrefixFPM： a parallel framework for general-purpose frequent pattern mining［C］// Proceedings of the IEEE 36th International Conference on Data Engineering. Piscataway： IEEE， 2020： 1938-1941. 10.1109/icde48307.2020.00208
11	CHEE C H， JAAFAR J， AZIZ I A， et al. Algorithms for frequent itemset mining： a literature review［J］. Artificial Intelligence Review， 2019， 52（4）： 2603-2621. 10.1007/s10462-018-9629-z
12	FOURNIER-VIGER P， LIN J C W， VO B， et al. A survey of itemset mining［J］. WIREs Data Mining and Knowledge Discovery， 2017， 7（4）： No.e1207. 10.1002/widm.1207
13	PEI J， HAN J W， MORTAZAVI-ASL B， et al. Mining sequential patterns by pattern-growth： the PrefixSpan approach［J］. IEEE Transactions on Knowledge and Data Engineering， 2004， 16（11）： 1424-1440. 10.1109/tkde.2004.77
14	WU Y X， ZHU C R， LI Y， et al. NetNCSP： nonoverlapping closed sequential pattern mining［J］. Knowledge-Based Systems， 2020， 196： No.105812. 10.1016/j.knosys.2020.105812
15	SON L H， CHICLANA F， KUMAR R， et al. ARM-AMO： an efficient association rule mining algorithm based on animal migration optimization［J］. Knowledge-Based Systems， 2018， 154： 68-80. 10.1016/j.knosys.2018.04.038
16	WANG C S， CHANG J Y. MISFP-growth： Hadoop-based frequent pattern mining with multiple item support［J］. Applied Sciences， 2019， 9（10）： No.2075. 10.3390/app9102075
17	KOH Y S， RAVANA S D. Unsupervised rare pattern mining： a survey［J］. ACM Transactions on Knowledge Discovery from Data， 2016， 10（4）： No.45. 10.1145/2898359
18	LIU X Q， WU J， GU F Y， et al. Discriminative pattern mining and its applications in bioinformatics［J］. Briefings in Bioinformatics， 2015， 16（5）： 884-900. 10.1093/bib/bbu042
19	YU H H， CHEN C H， TSENG V S. Mining emerging patterns from time series data with time gap constraint［J］. International Journal of Innovative Computing， Information and Control， 2011， 7（9）： 5515-5528.
20	GUNS T， NIJSSEN S， DE RAEDT L. K-pattern set mining under constraints［J］. IEEE Transactions on Knowledge and Data Engineering， 2013， 25（2）： 402-418. 10.1109/tkde.2011.204
21	PETITJEAN F， LI T， TATTI N， et al. Skopus： mining top-k sequential patterns under leverage［J］. Data Mining and Knowledge Discovery， 2016， 30（5）： 1086-1111. 10.1007/s10618-016-0467-9
22	TEW C， GIRAUD-CARRIER C， TANNER K， et al. Behavior-based clustering and analysis of interestingness measures for association rule mining［J］. Data Mining and Knowledge Discovery， 2014， 28（4）： 1004-1045. 10.1007/s10618-013-0326-x
23	TONON A， VANDIN F. Permutation strategies for mining significant sequential patterns［C］// Proceedings of the 2019 IEEE International Conference on Data Mining. Piscataway： IEEE， 2019： 1330-1335. 10.1109/icdm.2019.00169
24	PELLEGRINA L， RIONDATO M， VANDIN F. SPuManTE： significant pattern mining with unconditional testing［C］// Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York： ACM， 2019： 1528-1538. 10.1145/3292500.3330978
25	吴军，段琼，张琳，等. 磷酸化基序精确置换检验p-value的计算方法［J］. 中国科学：信息科学， 2017， 47（10）： 1334-1348.
	WU J， DUAN Q， ZHANG L， et al. Computing exact permutation p-values for phosphorylation motifs［J］. SCIENTIA SINICA Informationis， 2017， 47（10）： 1334-1348.
26	DUA D， GRAFF C. UCI machine learning repository［DB/OL］. ［2021-04-15］..
27	DIELLA F， CAMERON S， GEMÜND C， et al. Phospho.ELM： a database of experimentally verified phosphorylation sites in eukaryotic proteins［J］. BMC Bioinformatics， 2004， 5： No.79. 10.1186/1471-2105-5-79

[1]	余顺坤, 闫泓序. 基于确定性因子的启发式属性值约简模型[J]. 《计算机应用》唯一官方网站, 2022, 42(2): 469-474.
[2]	康军, 黄山, 段宗涛, 李宜修. 时空轨迹序列模式挖掘方法综述[J]. 《计算机应用》唯一官方网站, 2021, 41(8): 2379-2385.
[3]	刘世泽, 秦艳君, 王晨星, 苏琳, 柯其学, 罗海勇, 孙艺, 王宝会. 基于深度残差长短记忆网络交通流量预测算法[J]. 计算机应用, 2021, 41(6): 1566-1572.
[4]	李旭娟, 皮建勇, 黄飞翔, 贾海朋. 基于自生成深度神经网络的4D航迹预测[J]. 计算机应用, 2021, 41(5): 1492-1499.
[5]	陈凯, 于彦伟, 赵金东, 宋鹏. 基于城市交通监控大数据的工作位置推理方法[J]. 计算机应用, 2021, 41(1): 177-184.
[6]	龙洋洋, 陈玉玲, 辛阳, 豆慧. 基于联盟区块链的安全能源交易方案[J]. 计算机应用, 2020, 40(6): 1668-1673.
[7]	杜旭升, 于炯, 叶乐乐, 陈嘉颖. 基于图上随机游走的离群点检测算法[J]. 计算机应用, 2020, 40(5): 1322-1328.
[8]	徐周波, 杨健, 刘华东, 黄文文. 基于XGBoost与拓扑结构信息的蛋白质复合物识别算法[J]. 计算机应用, 2020, 40(5): 1510-1514.
[9]	陈曦, 梅广, 张金金, 许维胜. 融合知识图谱和协同过滤的学生成绩预测方法[J]. 《计算机应用》唯一官方网站, 2020, 40(2): 595-601.
[10]	马董, 陈红梅, 王丽珍, 肖清. 空间亚频繁co-location模式的主导特征挖掘[J]. 《计算机应用》唯一官方网站, 2020, 40(2): 465-472.
[11]	李莎莎, 梁冬阳, 余杰, 纪斌, 马俊, 谭郁松, 吴庆波. 基于师门关系的研究团队挖掘算法[J]. 计算机应用, 2020, 40(11): 3198-3202.
[12]	孙鹤立, 张优优, 杨洲, 何亮, 贾晓琳. 基于时间线段树的城市可达区域搜索[J]. 计算机应用, 2020, 40(10): 2936-2941.
[13]	王淳颖, 张驯, 赵金雄, 袁晖, 李方军, 赵博, 朱小琴, 杨凡, 吕世超. 基于多源告警的攻击事件分析[J]. 计算机应用, 2020, 40(1): 123-128.
[14]	李博, 张晓, 颜靖艺, 李可威, 李恒, 凌玉龙, 张勇. 基于值差度量和聚类优化的K最近邻算法在银行客户行为预测中的应用[J]. 计算机应用, 2019, 39(9): 2784-2788.
[15]	纪丽娜, 陈凯, 于彦伟, 宋鹏, 王淑莹, 王成锐. 基于城市交通大数据的车辆类别挖掘及应用分析[J]. 计算机应用, 2019, 39(5): 1343-1350.

基于影响度的统计显著序列模式挖掘算法

Statistically significant sequential patterns mining algorithm under influence degree

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 8

参考文献 27

相关文章 15

编辑推荐

Metrics