计算机应用 ›› 2014, Vol. 34 ›› Issue (9): 2612-2616.DOI: 10.11772/j.issn.1001-9081.2014.09.2612

• 数据技术 • 上一篇    下一篇

带有间隔约束的多序列模式挖掘

王华东,杨杰,李亚娟   

  1. 郑州轻工业学院 现代教育技术中心,郑州 450002
  • 收稿日期:2014-03-24 修回日期:2014-05-26 出版日期:2014-09-01 发布日期:2014-09-30
  • 通讯作者: 王华东
  • 作者简介: 
    王华东(1980-),男,河南泌阳人,讲师,硕士, 主要研究方向:数据挖掘;
    杨杰(1979-),男,河北石家庄人,工程师,硕士,主要研究方向:模式匹配与挖掘;
    李亚娟(1987-),女,河南济源人,硕士,主要研究方向:人工智能。
  • 基金资助:

    国家自然科学基金资助项目;河南省教育厅科学技术重点研究项目

Mining multiple sequential patterns with gap constraints

WANG Huadong,YANG Jie,LI Yajuan   

  1. Modern Educational Technology Center, Zhengzhou University of Light Industry, Zhengzhou Henan 450002, China
  • Received:2014-03-24 Revised:2014-05-26 Online:2014-09-01 Published:2014-09-30
  • Contact: WANG Huadong

摘要:

研究这样一个问题:给定多序列、支持度阈值和间隔约束,从多序列中挖掘所有出现次数不小于支持度阈值的频繁序列模式,这里要求模式中任意两个相邻元素在序列中的出现都要满足用户自定义的间隔约束,并且模式在序列中的出现要满足one-off条件。在解决该问题上,已有算法M-OneOffMine在计算模式的支持度时,只考虑模式的每个字符在序列中的首次出现,导致计算的模式支持度远小于其真实支持度,以致许多频繁的模式没有被挖掘出来。为此,设计了一个有效的带有间隔约束的多序列模式挖掘算法——MMSP算法:首先,通过采用二维表保存模式的候选位置;然后,根据候选位置采用最左最优的思想选择匹配位置。通过生物DNA序列进行实验,多序列中元素序列数目不变而序列长度变化时,MMSP挖掘出的频繁模式总数是同类算法M-OneOffMine的3.23倍;在元素序列个数变化时,MMSP挖掘出的频繁模式个数平均是M-OneOffMine的4.11倍;这两种情况下MMSP都有更好的时间性能。在模式长度变化时,MMSP挖掘出的频繁模式个数分别平均是M-OneOffMine的2.21倍和MPP的5.24倍。同时还验证了M-OneOffMine挖掘到的模式是MMSP挖掘到的频繁的子集。实验结果表明,MMSP算法不仅可以挖掘到更多的频繁模式,而且时间花费更少,更适合于实际的应用。

Abstract:

For the given multiple sequences, a certain threshold and the gap constraints, the study objective is to discover frequent patterns whose supports in multiple sequences are no less than the given threshold value, where any two successive elements of pattern fulfill the user-specified gap constraints, and any two occurrences of a pattern in a given sequence meet the one-off condition. To solve this problem, the existing algorithms only consider the first occurrence of each character of a pattern when they compute the support of a pattern in a given sequence, so that many frequent patterns are not mined. An efficient mining algorithm of multiple sequential patterns with gap constraints, named MMSP, was proposed. Firstly, it stored the candidate positions of a pattern using two-dimensional table, then it selected the position from the candidate positions according to the left-most strategy. The experiments were conducted on DNA sequences. The number of frequent patterns mined by MMSP was 3.23 times of that mined by the related algorithm named M-OneOffMine when the number of multiple sequence elements is constant and the sequence length changes, and the average number of mining patterns by MMSP was 4.11 times of that mined by M-OneOffMine when the number of multiple sequence elements changes. The average number of mined patterns by MMSP was 2.21 and 5.24 times of that mined by M-OneOffMine and MPP respectively when the number of multiple sequence elements changes, and the frequent patterns mined by M-OneOffMine was a subset of MMSP. The experimental results show that MMSP can mine more frequent patterns with shorter time, and it is more suitable for practical applications.

中图分类号: