《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (2): 477-484.DOI: 10.11772/j.issn.1001-9081.2023030268

• 数据科学与技术 • 上一篇    

一次性条件下top-k高平均效用序列模式挖掘算法

杨克帅1, 武优西1(), 耿萌1, 刘靖宇1, 李艳2   

  1. 1.河北工业大学 人工智能与数据科学学院,天津 300401
    2.河北工业大学 经济管理学院,天津 300401
  • 收稿日期:2023-03-13 修回日期:2023-05-17 接受日期:2023-05-29 发布日期:2023-06-16 出版日期:2024-02-10
  • 通讯作者: 武优西
  • 作者简介:杨克帅(1998—),男,河南濮阳人,硕士研究生,CCF会员,主要研究方向:数据挖掘
    耿萌(1997—),女,河北石家庄人,博士研究生,CCF会员,主要研究方向:数据挖掘
    刘靖宇(1976—),男,天津人,副教授,博士,主要研究方向:网络存储、信息安全
    李艳(1975—),女,天津人,副教授,博士,主要研究方向:数据挖掘、供应链管理。
  • 基金资助:
    国家自然科学基金资助项目(61976240)

Top-k high average utility sequential pattern mining algorithm under one-off condition

Keshuai YANG1, Youxi WU1(), Meng GENG1, Jingyu LIU1, Yan LI2   

  1. 1.School of Artificial Intelligence,Hebei University of Technology,Tianjin 300401,China
    2.School of Economics and Management,Hebei University of Technology,Tianjin 300401,China
  • Received:2023-03-13 Revised:2023-05-17 Accepted:2023-05-29 Online:2023-06-16 Published:2024-02-10
  • Contact: Youxi WU
  • About author:YANG Keshuai, born in 1998, M. S. candidate. His research interests include data mining.
    GENG Meng, born in 1997, M. S. candidate. Her research interests include data mining.
    LIU Jingyu, born in 1976, Ph. D., associate professor. His research interests include network storage, information security.
    LI Yan, born in 1975, Ph. D., associate professor. Her research interests include data mining, supply chain management.
  • Supported by:
    National Natural Science Foundation of China(61976240)

摘要:

针对传统序列模式挖掘(SPM)不考虑模式重复性且忽略各项的效用(单价或利润)与模式长度对用户兴趣度影响的问题,提出一次性条件下top-k高平均效用序列模式挖掘(TOUP)算法。TOUP算法主要包括两个核心步骤:平均效用计算和候选模式生成。首先,提出基于各项出现位置与项重复关系数组的CSP(Calculation Support of Pattern)算法计算模式支持度,从而实现模式平均效用的快速计算;其次,采用项集扩展和序列扩展生成候选模式,并提出了最大平均效用上界,基于该上界实现对候选模式的有效剪枝。在5个真实数据集和1个合成数据集上的实验结果表明,相较于TOUP-dfs和HAOP-ms算法,TOUP算法的候选模式数分别降低了38.5%~99.8%和0.9%~77.6%;运行时间分别降低了33.6%~97.1%和57.9%~97.2%。TOUP的算法性能更优,能更高效地挖掘用户感兴趣的模式。

关键词: 数据挖掘, 序列模式挖掘, 高平均效用, 一次性条件, top-k

Abstract:

To address the issue that traditional Sequential Pattern Mining (SPM) does not consider pattern repetition and ignores the effects of utility (unit price or profit) and pattern length on user interest, a Top-k One-off high average Utility sequential Pattern mining (TOUP) algorithm was proposed. The TOUP algorithm mainly includes two core steps: average utility calculation and candidate pattern generation. Firstly, a CSP (Calculation Support of Pattern) algorithm based on the occurrence position of each item and the item repetition relation array was proposed to calculate pattern support, thereby achieving rapid calculation of the average utility of patterns. Secondly, candidate patterns were generated by itemset extension and sequence extension, and a maximum average utility upper bound was proposed. Based on this upper bound, effective pruning of candidate patterns was achieved. Experimental results on five real datasets and one synthetic dataset show that compared to the TOUP-dfs and HAOP-ms algorithms, TOUP algorithm reduces the number of candidate patterns by 38.5% to 99.8% and 0.9% to 77.6%, respectively, and decreases the running time by 33.6% to 97.1% and 57.9% to 97.2%, respectively. Therefore, the algorithm performance of TOUP is better, and it can mine patterns of interests to users more efficiently.

Key words: data mining, sequential pattern mining, high average utility, one-off condition, top-k

中图分类号: