《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (8): 2540-2547.DOI: 10.11772/j.issn.1001-9081.2021071166

• 计算机软件技术 • 上一篇    

基于底层虚拟机的标识符混淆方法

田大江, 李成扬, 黄天波, 文伟平()   

  1. 北京大学 软件与微电子学院,北京 102600
  • 收稿日期:2021-07-07 修回日期:2021-09-14 接受日期:2021-09-18 发布日期:2021-10-11 出版日期:2022-08-10
  • 通讯作者: 文伟平
  • 作者简介:田大江(1997—),男,湖北黄冈人,CCF会员,主要研究方向:代码混淆;
    李成扬(1996—),男,山东临沂人,硕士研究生,主要研究方向:代码混淆;
    黄天波(1997—),男,河北邯郸人,硕士研究生,主要研究方向:网络空间安全、恶意代码检测、代码混淆;
    文伟平(1976—),男,湖南益阳人,教授,博士,主要研究方向:系统与网络安全、大数据与云安全、智能计算安全。
  • 基金资助:
    华为-北京大学校企合作项目(2020001763)

Identifier obfuscation method based on low level virtual machine

Dajiang TIAN, Chengyang LI, Tianbo HUANG, Weiping WEN()   

  1. School of Software and Microelectronics,Peking University,Beijing 102600,China
  • Received:2021-07-07 Revised:2021-09-14 Accepted:2021-09-18 Online:2021-10-11 Published:2022-08-10
  • Contact: Weiping WEN
  • About author:TIAN Dajiang, born in 1997. His research interests include code obfuscation.
    LI Chengyang, born in 1996, M. S. candidate. His research interests include code obfuscation.
    HUANG Tianbo, born in 1997, M. S. candidate. His research interests include cyberspace security, malicious code detection, code obfuscation.
    WEN Weiping, born in 1976, Ph. D., professor. His research interests include system and network security, big data and cloud security, intelligent computing security.
  • Supported by:
    Huawei-Peking University School-Enterprise Cooperation Project(2020001763)

摘要:

针对现有代码混淆仅限于某一特定编程语言或某一平台,并不具有广泛性和通用性,以及控制流混淆和数据混淆会引入额外开销的问题,提出一种基于底层虚拟机(LLVM)的标识符混淆方法。该方法实现了4种标识符混淆算法,包括随机标识符算法、重载归纳算法、异常标识符算法以及高频词替换算法,同时结合这些算法,设计新的混合混淆算法。所提混淆方法首先在前端编译得到的中间文件中候选出符合混淆条件的函数名,然后使用具体的混淆算法对这些函数名进行处理,最后使用具体的编译后端将混淆后的文件转换为二进制文件。基于LLVM的标识符混淆方法适用于LLVM支持的语言,不影响程序正常功能,且针对不同的编程语言,时间开销在20%内,空间开销几乎无增加;同时程序的平均混淆比率在77.5%,且相较于单一的替换算法和重载算法,提出的混合标识符算法理论分析上可以提供更强的隐蔽性。实验结果表明,所提方法具有性能开销小、隐蔽性强、通用性广的特点。

关键词: 软件保护, 代码混淆, 标识符混淆, 底层虚拟机, 混淆方法

Abstract:

Most of the existing code obfuscation solutions are limited to a specific programming language or a platform, which are not widespread and general. Moreover, control flow obfuscation and data obfuscation introduce additional overhead. Aiming at the above problems, an identifier obfuscation method was proposed based on Low Level Virtual Machine (LLVM). Four identifier obfuscation algorithms were implemented in the method, including random identifier algorithm, overload induction algorithm, abnormal identifier algorithm, and high-frequency word replacement algorithm. At the same time, a new hybrid obfuscation algorithm was designed by combining these algorithms. In the proposed method, firstly, in the intermediate files compiled by the front-ends, the function names, which met the obfuscation criteria, were selected. Secondly, these function names were processed by using specific obfuscation algorithms. Finally, the obfuscated files were transformed into binary files by using specific compilation back-ends. The identifier obfuscation method based on LLVM is suitable for the languages supported by LLVM and does not affect the normal functions of the program. For different programming languages, the time overhead is within 20% and the space overhead hardly increases. At the same time, the average confusion ratio of the program is 77.5%, and compared with the single replacement algorithm and overload algorithm, the proposed mixed identifier algorithm can provide stronger concealment in theoretical analysis. Experimental results show that the proposed method has the characteristics of low-performance overhead, strong concealment, and wide versatility.

Key words: software protection, code obfuscation, identifier obfuscation, Low Level Virtual Machine (LLVM), obfuscation method

中图分类号: