计算机应用 ›› 2013, Vol. 33 ›› Issue (03): 800-805.DOI: 10.3724/SP.J.1087.2013.00800

• 人工智能 • 上一篇    下一篇

基于通配符和长度约束的近似模式匹配算法

黄国林*,郭丹,胡学钢   

  1. 合肥工业大学 计算机与信息学院, 合肥 230009
  • 收稿日期:2012-09-04 修回日期:2012-11-02 出版日期:2013-03-01 发布日期:2013-03-01
  • 通讯作者: 黄国林
  • 作者简介:黄国林(1988-),男,甘肃白银人,硕士研究生,主要研究方向:数据挖掘; 郭丹(1983-),女,湖北潜江人,讲师,博士,主要研究方向:机器学习、数据挖掘; 胡学钢(1961-),男,安徽当涂人,教授,博士生导师,主要研究方向:知识工程、数据挖掘。
  • 基金资助:

    国家863计划项目(2012AA011005); 国家自然科学基金资助项目(61229301); 国家博士后科学基金资助项目(2012M511403); 中央高校基本科研基金资助项目(2010HGXJ0714)。

Algorithms for approximate pattern matching with wildcards and length constraints

HUANG Guolin*, GUO Dan, HU Xuegang   

  1. School of Computer and Information, Hefei University of Technology, Hefei Anhui 230009, China
  • Received:2012-09-04 Revised:2012-11-02 Online:2013-03-01 Published:2013-03-01
  • Contact: Guo-Lin HUANG

摘要: 针对近似模式匹配算法在处理带有灵活通配符和长度约束近似模式匹配(APMWL)问题时只能解决替换操作, 提出一种基于动态规划的编辑距离矩阵(EDM)构造方法,设计了基于EDM的近似模式匹配算法APM, 可以处理近似匹配中的三种编辑操作,即插入、替换和删除操作。此外,根据文本中字符是否允许被重复使用的约束条件,设计APM-OF算法。实验结果表明,APM和APM-OF与同类算法相比具备显著的优势:与Sail_Approx匹配算法实验对比, 获取解的平均增长率分别达到8.34%和12.37%; 将APM-OF算法应用至模式挖掘中, 挖掘出的频繁近似模式个数为OneoffMining算法的2.07倍。

关键词: 近似匹配, 通配符, 长度约束, 编辑距离矩阵, one-off条件

Abstract: Current works on the Approximate Pattern Matching with Wildcards and Length constraints (APMWL) problem can only cope with replacement operation. This paper proposed an Edit Distance Matrix (EDM) method based on dynamic programming and the Approximate Pattern Matching with EDM (APM) algorithm. APM can handle all approximate operations including insertion, replacement and deletion. Moreover, this paper extended APM to the APM-OF algorithm with a strict constraint condition that each character can be used at most once for pattern matching in a sequence. The experiments verify that both APM and APM-OF have significant advantages on matching solutions against other peers. The average improvement rates of matching compared to SAIL-Approx are up to 8.34% and 12.37% respectively. It also demonstrates an advantage on approximate pattern mining that the number of approximate patterns mined by APM-OF is 2.07 times of that mined by OneoffMining.

Key words: approximate pattern matching, wildcard, length constraint, edit distance matrix, one-off condition

中图分类号: