《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (7): 2200-2208.DOI: 10.11772/j.issn.1001-9081.2022060907

• 数据科学与技术 • 上一篇    

基于数据索引结构的跨级高效用项集挖掘算法

蒋华, 李星(), 王慧娇, 韦静海   

  1. 广西可信软件重点实验室(桂林电子科技大学),广西 桂林 541004
  • 收稿日期:2022-06-22 修回日期:2022-09-03 接受日期:2022-09-09 发布日期:2022-10-08 出版日期:2023-07-10
  • 通讯作者: 李星
  • 作者简介:蒋华(1963—),男,河南信阳人,教授,博士,主要研究方向:数据库系统、信息安全;
    李星(1997—),女,河南南阳人,硕士研究生,主要研究方向:模式挖掘;
    王慧娇(1976—),女,辽宁铁岭人,副教授,硕士,CCF会员,主要研究方向:网络安全、智能计算;
    韦静海(1997—),女,广西贵港人,硕士研究生,主要研究方向:数据挖掘。
  • 基金资助:
    广西科技重大专项(AA22068072);广西可信软件重点实验室项目(KX202056)

Cross-level high utility itemsets mining algorithm based on data index structure

Hua JIANG, Xing LI(), Huijiao WANG, Jinghai WEI   

  1. Guangxi Key Laboratory of Trusted Software (Guilin University of Electronic Technology),Guilin Guangxi 541004,China
  • Received:2022-06-22 Revised:2022-09-03 Accepted:2022-09-09 Online:2022-10-08 Published:2023-07-10
  • Contact: Xing LI
  • About author:JIANG Hua, born in 1963, Ph. D., professor. His research interests include database system, information security.
    LI Xing, born in 1997, M. S. candidate. Her research interests include pattern mining.
    WANG Huijiao, born in 1976, M. S., associate professor. Her research interests include network security, intelligent computing.
    WEI Jinghai, born in 1997, M. S. candidate. Her research interests include data mining.
  • Supported by:
    Science and Technology Major Project of Guangxi(GuiKe AA22068072);Project of Guangxi Key Laboratory of Trusted Software(KX202056)

摘要:

针对现有的跨级高效用项集挖掘(HUIM)算法非常耗时且占用大量内存的问题,提出一种基于数据索引结构的跨级高效用项集挖掘算法(DISCH)。首先,为了高效存储和快速检索到搜索空间中的所有项集,拓展带有分类信息和索引信息的效用链表为数据索引结构(DIS);然后,为了提高内存利用率,对不满足条件的效用链表所占的内存进行回收再分配;最后,在构建效用链表时使用提前结束策略,以减少效用链表的产生。基于真实零售数据集和合成数据集进行的实验结果表明,与CLH-Miner (Cross-Level High utility itemsets Miner)算法相比,DISCH在运行时间上平均降低了77.6%,同时在内存消耗上平均降低了73.3%,可见该算法能高效完成跨级高效用项集的搜索,并且降低算法的内存消耗。

关键词: 数据挖掘, 高效用项集挖掘, 分类关系, 索引链表, 重用内存

Abstract:

The existing cross-level High Utility Itemsets Mining (HUIM) algorithms consume a lot of time and occupy large amounts of memory. To address these problems, a Data Index Structure Cross-level High utility itemsets mining (DISCH) algorithm was proposed. Firstly, the utility list with taxonomic information and index information was expanded into Data Index Structure (DIS) to efficiently store and quickly retrieve all itemsets in the search space. Then, in order to improve the memory utilization, the memory occupied by the utility lists that do not meet the conditions was reclaimed and reallocated. Finally, during the construction of utility list, early termination strategy was used to reduce the generation of utility list. Experimental results based on real retail datasets and synthetic datasets show that compared with the CLH-Miner (Cross-Level High utility itemsets Miner) algorithm, DISCH reduces the running time by 77.6% on average and the memory consumption by 73.3% on average. Therefore, the proposed algorithm can search the cross-level high utility itemsets efficiently and reduce the memory consumption of algorithm.

Key words: data mining, High Utility Itemsets Mining (HUIM), taxonomy, index list, reuse memory

中图分类号: