计算机应用 ›› 2018, Vol. 38 ›› Issue (6): 1726-1731.DOI: 10.11772/j.issn.1001-9081.2017112846

• 网络与通信 • 上一篇    下一篇

基于最佳路径搜索的二进制协议格式关键词边界确定方法

闫小勇, 李青   

  1. 信息工程大学, 郑州 450001
  • 收稿日期:2017-12-05 修回日期:2018-01-09 出版日期:2018-06-10 发布日期:2018-06-13
  • 通讯作者: 闫小勇
  • 作者简介:闫小勇(1993-),男,陕西陇县人,硕士研究生,主要研究方向:数据挖掘、协议逆向分析;李青(1976-),女,河北正定人,副教授,博士,主要研究方向:协议逆向分析、可见光通信、无线自组织网、传感网。

Method for determining boundaries of binary protocol format keywords based on optimal path search

YAN Xiaoyong, LI Qing   

  1. Information Engineering University, Zhengzhou Henan 450001, China
  • Received:2017-12-05 Revised:2018-01-09 Online:2018-06-10 Published:2018-06-13

摘要: 针对二进制协议报文格式逆向分析中字段切分问题,提出以格式关键词为逆向分析目标,通过改进的n-gram算法和最佳路径搜索算法实现对二进制协议格式关键词的最优定界。首先,将位置因素引入n-gram算法,提出基于迭代n-gram-position的格式关键词边界提取算法,有效解决了n-gram算法中n值不易确定和固定偏移位置格式关键词的边界提取问题;然后,定义了频繁项边界命中率和左右分支信息熵为基础的分支度量,以关键词和非关键词的n-gram-position取值变化率存在差异为基础构造约束条件,提出基于最佳路径搜索的格式关键词边界选择算法,实现了对格式关键词的联合最优定界。在AIS1、AIS18、ICMP00、ICMP03和NetBios五种不同类型协议报文数据集上的测试结果表明,所提算法能够准确确定不同协议格式关键词的边界,F值均在83%以上。与VDV和AutoReEngine经典算法相比,所提算法的F值平均提升约8个百分点。

关键词: 二进制协议, 格式关键词, 边界确定, n-gram, 最佳路径搜索

Abstract: Aiming at the problem of field segmentation in the reverse analysis of binary protocol message format, a novel algorithm with format keywords as the reverse analysis target was proposed, which can optimally determine the boundaries of binary protocol format keywords by improved n-gram algorithm and optimal path search algorithm. Firstly, by introducing the position factor into n-gram algorithm, a boundary extraction algorithm of format keywords was proposed based on the iterative n-gram-position algorithm, which effectively solved the problems that the n value was difficult to determine and the candidate boundary extraction of format keywords with fixed offset position in the n-gram algorithm. Then, the branch metric was defined based on the hit ratio of frequent item boundaries and the left and right branch information entropies, and the constraint conditions were constructed based on the difference of n-gram-position value change rate between keywords and non-keywords. The boundary selection algorithm of format keywords based on the optimal path search was proposed to determine the joint optimal bound for format keywords. The experimental results of testing on five different types of protocol message datasets such as AIS1, AIS18, ICMP00, ICMP03 and NetBios show that, the proposed algorithm can accurately determine the boundaries of different protocol format keywords, its F values are all above 83%. Compared with the classical algorithms of Variance of the Distribution of Variances (VDV) and AutoReEngine, the F value of the proposed algorithm is increased averagely by about 8 percentage points.

Key words: binary protocol, format keyword, boundary determining, n-gram, optimal path search

中图分类号: