Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (7): 2160-2167.DOI: 10.11772/j.issn.1001-9081.2023070992

• Computer software technology • Previous Articles     Next Articles

Binary code identification based on user system call sequences

Haixiang HUANG1, Shuanghe PENG1(), Ziyu ZHONG2   

  1. 1.College of Computer and Information Technology,Beijing Jiaotong University,Beijing 100044,China
    2.Faculty of Science,the Hong Kong University of Science and Technology,Hong Kong 999077,China
  • Received:2023-07-24 Revised:2023-09-13 Accepted:2023-09-21 Online:2023-10-26 Published:2024-07-10
  • Contact: Shuanghe PENG
  • About author:HUANG Haixiang, born in 1999, M. S. candidate. His research interests include binary reverse analysis, vulnerability mining.
    ZHONG Ziyu, born in 1999, M. S. candidate. His research interests include operations research optimization, machine learning.
    First author contact:PENG Shuanghe, born in 1974, Ph. D., associate professor. Her research interests include trusted computing, binary reverse analysis.
  • Supported by:
    National Natural Science Foundation of China(62272028)

基于用户系统调用序列的二进制代码识别

黄海翔1, 彭双和1(), 钟子煜2   

  1. 1.北京交通大学 计算机与信息技术学院,北京 100044
    2.香港科技大学 理学院,香港 999077
  • 通讯作者: 彭双和
  • 作者简介:黄海翔(1999—),男,江西九江人,硕士研究生,主要研究方向:二进制逆向分析、漏洞挖掘;
    钟子煜(1999—),男,江西九江人,硕士研究生,主要研究方向:运筹优化、机器学习。
    第一联系人:彭双和(1974—),女,湖南衡阳人,副教授,博士,主要研究方向:可信计算、二进制逆向分析;
  • 基金资助:
    国家自然科学基金资助项目(62272028)

Abstract:

In order to solve the low accuracy problem of binary code identification caused by compilation optimization, cross-compiler, obfuscation, etc., UstraceDiff, an identification scheme based on user system call sequences, was proposed. First, to extract the sequences of user system calls and parameters of the binary codes, a dynamic binary instrumentation tool based on Intel Pin framework was designed. Second, the common sequences of system call sequences of two compared binary codes were obtained through sequence alignment, and a valid parameter table was designed to filter out valid system call parameters. Finally, an algorithm was proposed to evaluate the similarity of binary codes by combining the common sequences and valid parameters to calculate the homology score. UstraceDiff was evaluated by using the Coreutils dataset under four different compilation conditions. The results show that the average accuracy of UstraceDiff for homologous program identification is 35.1 percentage points and 55.4 percentage points higher than those of Bindiff and DeepBinDiff respectively, and the distinction effect for non-homologous programs of UstraceDiff is also better.

Key words: code identification, dynamic analysis, system call, program traceability, binary code similarity analysis

摘要:

针对编译优化、跨编译器、混淆等带来的二进制代码相似性识别准确率低的问题,提出并实现了一种基于用户系统调用序列的识别方案UstraceDiff。首先,基于Intel Pin框架设计了一个动态分析插桩工具,动态提取二进制代码的用户系统调用序列及参数;其次,通过序列对齐获得被分析的2个二进制代码的系统调用序列的公有序列,并设计了一个有效参数表用于筛选出有效系统调用参数;最后,为评估二进制代码的相似性,提出一种算法利用公有序列及有效参数,计算它们的同源度。使用Coreutils数据集在4种不同的编译条件下对UstraceDiff进行了评估。实验结果表明,相较于Bindiff和DeepBinDiff,UstraceDiff对于同源程序识别的平均准确率分别提高了35.1个百分点和55.4个百分点,对于非同源程序的区分效果也更好。

关键词: 代码识别, 动态分析, 系统调用, 程序溯源, 二进制相似性分析

CLC Number: