《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (4): 1248-1258.DOI: 10.11772/j.issn.1001-9081.2023040551

• 计算机软件技术 • 上一篇    

代码相似性检测技术综述

孙祥杰1,2, 魏强2, 王奕森2(), 杜江2   

  1. 1.郑州大学 网络空间安全学院,郑州 450002
    2.信息工程大学 网络空间安全学院,郑州 450001
  • 收稿日期:2023-05-09 修回日期:2023-07-13 接受日期:2023-07-14 发布日期:2023-12-04 出版日期:2024-04-10
  • 通讯作者: 王奕森
  • 作者简介:孙祥杰(1999—),男,河南焦作人,硕士研究生,主要研究方向:软件成分分析
    魏强(1979—),男,江西南昌人,教授,博士,主要研究方向:工业控制系统安全
    王奕森(1990—),男,河南沈丘人,副教授,博士,主要研究方向:网络安全 851067568@qq.com
    杜江(1990—),男,河南郑州人,博士研究生,主要研究方向:二进制代码相似性。
  • 基金资助:
    国家重点研发计划项目(2019QY0502)

Survey of code similarity detection technology

Xiangjie SUN1,2, Qiang WEI2, Yisen WANG2(), Jiang DU2   

  1. 1.School of Cyber Science and Engineering,Zhengzhou University,Zhengzhou Henan 450002,China
    2.School of Cyberspace Security,Information Engineering University,Zhengzhou Henan 450001,China
  • Received:2023-05-09 Revised:2023-07-13 Accepted:2023-07-14 Online:2023-12-04 Published:2024-04-10
  • Contact: Yisen WANG
  • About author:SUN Xiangjie, born in 1999, M. S. candidate. His research interests include software composition analysis.
    WEI Qiang, born in 1979, Ph. D., professor. His research interests include industrial control system security.
    WANG Yisen, born in 1990, Ph. D., associate professor. His research interests include network security.
    DU Jiang, born in 1990, Ph. D. candidate. His research interests include binary code similarity.
  • Supported by:
    National Key Research & Development Program(2019QY0502)

摘要:

代码复用为软件开发带来便利的同时也引入了安全风险,如加速漏洞传播、代码恶意抄袭等,代码相似性检测技术通过分析代码间词法、语法、语义等信息计算代码相似程度,是判断代码复用最有效的技术之一,也是近年发展较快的程序安全分析技术。首先,系统梳理代码相似性检测的近期技术进展,根据目标代码是否开源,将代码相似性检测技术分为源码相似性检测和二进制代码相似性检测,又根据编程语言、指令集的不同进行二次细分;其次,总结每一种技术的思路和研究成果,分析机器学习技术在代码相似性检测领域成功的案例,并讨论现有技术的优势与不足;最后,给出代码相似性检测技术的发展趋势,为相关研究人员提供参考。

关键词: 二进制代码相似性, 源代码相似性, 跨语言代码相似性, 深度学习, 代码克隆

Abstract:

Code reuse not only brings convenience to software development, but also introduces security risks, such as accelerating vulnerability propagation and malicious code plagiarism. Code similarity detection technology is to calculate code similarity by analyzing lexical, syntactic, semantic and other information between codes. It is one of the most effective technologies to judge code reuse, and it is also a program security analysis technology that has developed rapidly in recent years. First, the latest technical progress of code similarity detection was systematically reviewed, and the current code similarity detection technology was classified. According to whether the target code was open source, it was divided into source code similarity detection and binary code similarity detection. According to the different programming languages and instruction sets, the second subdivision was carried out. Then, the ideas and research results of each technology were summarized, the successful cases of machine learning technology in the field of code similarity detection were analyzed, and the advantages and disadvantages of existing technologies were discussed. Finally, the development trend of code similarity detection technology was given to provide reference for relevant researchers.

Key words: binary code similarity, source code similarity, cross language code similarity, deep learning, code clone

中图分类号: