• •    

(CRSSC-CWI-CGrC-3WD 2017+54)基于覆盖模式的频繁子树挖掘方法

夏英1,李洪旭2   

  1. 1. 重庆邮电大学 空间信息系统研究中心,重庆400065
    2. 重庆邮电大学计算机科学与技术学院空间信息系统研究中心
  • 收稿日期:2017-03-27 修回日期:2017-04-14 发布日期:2017-04-14
  • 通讯作者: 李洪旭

Frequent Subtree Mining Method Based on Coverage pattern

XIA Ying1,LI HongXu   

  • Received:2017-03-27 Revised:2017-04-14 Online:2017-04-14
  • Contact: LI HongXu

摘要: 无序树常用于半结构化数据建模,对其进行频繁子树挖掘有利于发现隐藏的知识。传统的频繁子树挖掘方法常常输出大规模且带有冗余信息的频繁子树,这样的输出结果会降低后续操作的效率。针对传统方法的不足,本文提出了MCRP(Mining Coverage Pattern)算法用于挖掘覆盖模式,该算法能有效减小输出频繁子树的规模及冗余信息。MCRP算法首先采用宽度孩子数编码对树进行编码,然后通过基于最大前缀编码序列的边扩展方式生成所有的候选子树,最后在频繁子树集和δ^'-覆盖概念的基础上输出覆盖模式集。与传统的挖掘频繁闭树模式和极大频繁树模式的算法相比,该算法能够在保留所有频繁子树信息的情况下输出更少的频繁子树,并且在处理效率上具有一定的优越性。

关键词: 无序树, 频繁子树, 最大前缀编码, 边扩展, 覆盖模式

Abstract: Unordered tree is widely used for semi-structured data modeling, frequent subtree mining on it has benefit for finding hidden knowledge. The traditional methods of mining frequent subtrees often output large-scale frequent subtrees with redundant information, such an output will reduce the efficiency of subsequent operations. In view of the shortcomings of traditional methods, this paper proposed a MCRP(Mining Coverage Pattern) algorithm for mining coverage pattern, the algorithm can effectively reduce the size and redundant information of the output frequent subtree. it firstly presents a tree coding rule according to the tree width and numbers of children, and then all candidate subtrees are generated by edge extension based on the maximum prefix coding sequence, finally a set of coverage patterns will be output on the basis of frequent subtrees and δ^'- coverage concept. Compared with the traditional algorithms for mining frequent closed tree patterns and maximal frequent tree patterns, the algorithm can output fewer frequent subtrees in the case of preserving all the frequent subtrees, and has a certain superiority in processing efficiency.

Key words: unordered tree, frequent subtree, maximum prefix coding, edge extension, coverage pattern

中图分类号: