Journal of Computer Applications ›› 2013, Vol. 33 ›› Issue (08): 2370-2374.

• Typical applications • Previous Articles     Next Articles

Multi-character DFA-based high speed regular expression matching algorithm

HE Wei,GUO Yunfei,MO Han,HU Hongchao   

  1. China National Digital Switching System Engineering and Technological R&D Center, Zhengzhou Henan 450002, China
  • Received:2013-02-20 Revised:2013-03-12 Online:2013-09-11 Published:2013-08-01
  • Contact: HE Wei

基于多字符DFA的高速正则表达式匹配算法

贺炜,郭云飞,莫涵,扈红超   

  1. 国家数字交换系统工程技术研究中心,郑州 450002
  • 通讯作者: 贺炜
  • 作者简介:

    贺炜(1988-),男,山西吕梁人,硕士研究生,主要研究方向:网络安全中的特征匹配;
    郭云飞(1963-),男,河南郑州人,教授,博士生导师,主要研究方向:三网融合;
    莫涵(1988-),女,河南郑州人,硕士研究生,主要研究方向:可重构网络、组播传输;
    扈红超(1983-),男,河南郑州人,研究员,博士,主要研究方向:网络协议分析。

  • 基金资助:
    国家科技支撑计划项目;国家863计划项目

Abstract: Traditional Deterministic Finite Automata (DFA) based regular expression matching can only process one character per cycle, which is the speed bottleneck. A new algorithm named Multi-Character DFA (MC-DFA) was proposed for high throughput matching and precise positioning. It combined the one character transition in traditional DFA together to handle multi-character processing per cycle. A new transition matrix compress algorithm was also proposed to reduce the redundancy introduced by MC-DFA. The result demonstrates that MC-DFA can improve the throughput efficiently while requiring acceptable memory. For a set of 300 regexes, 8C-DFA obtains a throughput of 7.88Gb/s, memory usage less than 6MB and 19.24s preprocessing time, better than traditional methods.

Key words: regular expression, high throughput, multi-character, precise positioning, matrix compress

摘要: 基于确定性有限自动机(DFA)的传统正则表达式匹配方法存在单周期处理单字符的速度瓶颈。为提升处理速率,提出一种单周期处理多字符的匹配算法MC-DFA,该算法基于DFA实现,支持匹配位置的精确定位。MC-DFA将传统DFA中的单字符跳转合并为多字符跳转,实现了单周期处理多个输入字符。通过状态转移矩阵二阶压缩算法,MC-DFA分别对矩阵行内以及行间冗余进行消除,减少了内存使用。300条规则下,单周期处理8字符时,MC-DFA吞吐率能够达到7.88Gb/s,内存占用小于6MB,预处理时间为19.24s。实验结果表明,MC-DFA能够有效提升系统吞吐率,并且保证内存占用在可接受范围之内,性能优于现有正则表达式匹配算法。

关键词: 正则表达式, 高速, 多字符, 精确定位, 矩阵压缩

CLC Number: