《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (4): 1248-1258.DOI: 10.11772/j.issn.1001-9081.2023040551
收稿日期:
2023-05-09
修回日期:
2023-07-13
接受日期:
2023-07-14
发布日期:
2023-12-04
出版日期:
2024-04-10
通讯作者:
王奕森
作者简介:
孙祥杰(1999—),男,河南焦作人,硕士研究生,主要研究方向:软件成分分析基金资助:
Xiangjie SUN1,2, Qiang WEI2, Yisen WANG2(), Jiang DU2
Received:
2023-05-09
Revised:
2023-07-13
Accepted:
2023-07-14
Online:
2023-12-04
Published:
2024-04-10
Contact:
Yisen WANG
About author:
SUN Xiangjie, born in 1999, M. S. candidate. His research interests include software composition analysis.Supported by:
摘要:
代码复用为软件开发带来便利的同时也引入了安全风险,如加速漏洞传播、代码恶意抄袭等,代码相似性检测技术通过分析代码间词法、语法、语义等信息计算代码相似程度,是判断代码复用最有效的技术之一,也是近年发展较快的程序安全分析技术。首先,系统梳理代码相似性检测的近期技术进展,根据目标代码是否开源,将代码相似性检测技术分为源码相似性检测和二进制代码相似性检测,又根据编程语言、指令集的不同进行二次细分;其次,总结每一种技术的思路和研究成果,分析机器学习技术在代码相似性检测领域成功的案例,并讨论现有技术的优势与不足;最后,给出代码相似性检测技术的发展趋势,为相关研究人员提供参考。
中图分类号:
孙祥杰, 魏强, 王奕森, 杜江. 代码相似性检测技术综述[J]. 计算机应用, 2024, 44(4): 1248-1258.
Xiangjie SUN, Qiang WEI, Yisen WANG, Jiang DU. Survey of code similarity detection technology[J]. Journal of Computer Applications, 2024, 44(4): 1248-1258.
方法 | 实现思路 | 优势 | 检测内容 |
---|---|---|---|
SourcererCC[ | 使用优化的反向索引和过滤的启发式算法实现 | 能够实现大规模克隆检测 | Ⅰ型、Ⅱ型、Ⅲ型 |
CCLearner[ | 利用token训练分类器,利用分类器检测 | 首次通过神经网络训练token进行相似性分析 | Ⅰ型、Ⅱ型、Ⅲ型 |
VUDDY[ | 利用函数级粒度和长度过滤技术减少函数签名比较数 | 有较高的可伸缩性和准确率 | Ⅰ型、Ⅱ型 |
CCAligner[ | 利用滑动窗口和模糊匹配 | 有良好的精度和召回率 | Ⅰ型、Ⅱ型、Ⅲ型 |
NIL[ | 利用最长公共子序列算法 | 在大规模检测时高精度 | Ⅰ型、Ⅱ型、Ⅲ型 |
表1 基于token的检测方法
Tab. 1 Token based detection methods
方法 | 实现思路 | 优势 | 检测内容 |
---|---|---|---|
SourcererCC[ | 使用优化的反向索引和过滤的启发式算法实现 | 能够实现大规模克隆检测 | Ⅰ型、Ⅱ型、Ⅲ型 |
CCLearner[ | 利用token训练分类器,利用分类器检测 | 首次通过神经网络训练token进行相似性分析 | Ⅰ型、Ⅱ型、Ⅲ型 |
VUDDY[ | 利用函数级粒度和长度过滤技术减少函数签名比较数 | 有较高的可伸缩性和准确率 | Ⅰ型、Ⅱ型 |
CCAligner[ | 利用滑动窗口和模糊匹配 | 有良好的精度和召回率 | Ⅰ型、Ⅱ型、Ⅲ型 |
NIL[ | 利用最长公共子序列算法 | 在大规模检测时高精度 | Ⅰ型、Ⅱ型、Ⅲ型 |
方法 | 实现思路 | 优势 | 检测内容 |
---|---|---|---|
DECKARD[ | 将AST转化为向量并使用局部敏感哈希算法匹配 | 适用于任何有正式语法的编程语言 | Ⅰ型、Ⅱ型、Ⅲ型 |
code2vec[ | 转化为AST路径并使用神经网络训练代码表示 | 面对复杂情况时,代码表示依然有较好的泛化能力 | Ⅰ型、Ⅱ型、Ⅲ型 |
code2seq[ | 转化为AST路径并用LSTM编码 | 相较于code2vec有更好的代码表示 | Ⅰ型、Ⅱ型、Ⅲ型 |
ASTNN[ | 分割大型AST并使用双向RNN训练代码表示 | 可以对代码进行批处理且有较高的准确率 | Ⅰ型、Ⅱ型、Ⅲ型、Ⅳ型 |
TreeCen[ | 将AST转化为树图再转化为向量,采用SVM处理 | 有效保留结构信息且执行效率高 | Ⅰ型、Ⅱ型、Ⅲ型、Ⅳ型 |
表2 基于树的检测方法
Tab. 2 Tree based detection methods
方法 | 实现思路 | 优势 | 检测内容 |
---|---|---|---|
DECKARD[ | 将AST转化为向量并使用局部敏感哈希算法匹配 | 适用于任何有正式语法的编程语言 | Ⅰ型、Ⅱ型、Ⅲ型 |
code2vec[ | 转化为AST路径并使用神经网络训练代码表示 | 面对复杂情况时,代码表示依然有较好的泛化能力 | Ⅰ型、Ⅱ型、Ⅲ型 |
code2seq[ | 转化为AST路径并用LSTM编码 | 相较于code2vec有更好的代码表示 | Ⅰ型、Ⅱ型、Ⅲ型 |
ASTNN[ | 分割大型AST并使用双向RNN训练代码表示 | 可以对代码进行批处理且有较高的准确率 | Ⅰ型、Ⅱ型、Ⅲ型、Ⅳ型 |
TreeCen[ | 将AST转化为树图再转化为向量,采用SVM处理 | 有效保留结构信息且执行效率高 | Ⅰ型、Ⅱ型、Ⅲ型、Ⅳ型 |
方法 | 实现思路 | 优势 | 检测内容 |
---|---|---|---|
DeepSim[ | 处理控制流和数据流得到语义特征矩阵并学习 | 是一种端到端的方法,有较高的可扩展性 | Ⅰ型、Ⅱ型、Ⅲ型、Ⅳ型 |
CCGraph[ | 将代码转化为PDG并进行图匹配 | 可以高效检测高级别的克隆 | Ⅰ型、Ⅱ型、Ⅲ型、Ⅳ型 |
SCDetector[ | 结合了token和CFG,采用暹罗架构检测 | 比传统基于图方法检测的时间减少较多 | Ⅰ型、Ⅱ型、Ⅲ型、Ⅳ型 |
MVP[ | 使用程序切片提取漏洞的语法和语义特征并生成签名 | 可以有效检测重复漏洞 | Ⅰ型、Ⅱ型、Ⅲ型 |
TRACER[ | 通过污点分析获取脆弱路径生成签名 | 有较高的分析效率和可扩展性 | 检测相似漏洞 |
表3 基于语义的检测方法
Tab. 3 Detection methods based on semantics
方法 | 实现思路 | 优势 | 检测内容 |
---|---|---|---|
DeepSim[ | 处理控制流和数据流得到语义特征矩阵并学习 | 是一种端到端的方法,有较高的可扩展性 | Ⅰ型、Ⅱ型、Ⅲ型、Ⅳ型 |
CCGraph[ | 将代码转化为PDG并进行图匹配 | 可以高效检测高级别的克隆 | Ⅰ型、Ⅱ型、Ⅲ型、Ⅳ型 |
SCDetector[ | 结合了token和CFG,采用暹罗架构检测 | 比传统基于图方法检测的时间减少较多 | Ⅰ型、Ⅱ型、Ⅲ型、Ⅳ型 |
MVP[ | 使用程序切片提取漏洞的语法和语义特征并生成签名 | 可以有效检测重复漏洞 | Ⅰ型、Ⅱ型、Ⅲ型 |
TRACER[ | 通过污点分析获取脆弱路径生成签名 | 有较高的分析效率和可扩展性 | 检测相似漏洞 |
方法 | 处理内容 | 支持语言 |
---|---|---|
CLCDSA[ | API文档相似性和AST的特征值 | Java、Python、C# |
CroLSim[ | API描述信息 | Java、Python、C# |
CroLSim[ | API和方法描述 | Java、Python、C#、C |
CroLSSim[ | AST特征 | Java、C#、C++ |
COSAL[ | AST树结构和IO信息 | Java、Python |
BIGPT[ | AST相关树结构和文本特征 | Java、Python、C++ |
文献[ | 源代码和AST表示 | Java、Python |
UAST[ | AST的序列和AST的图结构 | Java、Python、 C/C++、JavaScript |
表4 基于代码信息的检测方法
Tab. 4 Detection methods based on code information
方法 | 处理内容 | 支持语言 |
---|---|---|
CLCDSA[ | API文档相似性和AST的特征值 | Java、Python、C# |
CroLSim[ | API描述信息 | Java、Python、C# |
CroLSim[ | API和方法描述 | Java、Python、C#、C |
CroLSSim[ | AST特征 | Java、C#、C++ |
COSAL[ | AST树结构和IO信息 | Java、Python |
BIGPT[ | AST相关树结构和文本特征 | Java、Python、C++ |
文献[ | 源代码和AST表示 | Java、Python |
UAST[ | AST的序列和AST的图结构 | Java、Python、 C/C++、JavaScript |
方法 | 处理内容 | 支持语言 |
---|---|---|
C4[ | 通过CodeBERT预训练并 使用对比学习方法 | Java、Python、C#、C++ |
文献[ | 通过InferCode预训练并 使用孪生神经网络 | Java、Python |
XCode[ | 使用变分自编码器和 教师学生机制知识蒸馏 | Java、Python、C#、C++ |
表5 基于代码向量化的检测方法
Tab. 5 Detection methods based on code vector
方法 | 处理内容 | 支持语言 |
---|---|---|
C4[ | 通过CodeBERT预训练并 使用对比学习方法 | Java、Python、C#、C++ |
文献[ | 通过InferCode预训练并 使用孪生神经网络 | Java、Python |
XCode[ | 使用变分自编码器和 教师学生机制知识蒸馏 | Java、Python、C#、C++ |
方法 | 提取特征 | 采用模型 | 跨编译器 | 跨优化 | 跨架构 | 抗混淆 |
---|---|---|---|---|---|---|
Asm2Vec[ | 汇编指令、CFG | PV-DM | × | √ | × | √ |
SAFE[ | 汇编指令 | word2vec、RNN、Siamese Network | √ | √ | × | × |
INNEREYE[ | 汇编指令、CFG | word2vec、LSTM、Siamese Network | × | √ | √ | × |
文献[ | 汇编指令 | CBOW | × | √ | √ | × |
MIRROR[ | 汇编指令、基本块 | Transformer | × | √ | √ | × |
Order Matters[ | CFG、基本块 | CNN、BERT、MPNN | √ | √ | √ | × |
DeepBinDiff[ | 汇编指令、CFG、基本块 | word2vec、TADW | × | √ | × | × |
TREX[ | 汇编指令 | word2vec、Transformer、LSTM | × | √ | √ | √ |
Codee[ | 汇编指令、CFG | Skip-gram | √ | √ | √ | √ |
BinDiffNN[ | 汇编指令 | Attention、Siamese Network | × | × | × | × |
QBinDiff[ | CFG、CG | 图编辑距离 | × | × | × | × |
PlamTree[ | 汇编指令 | BERT | √ | √ | √ | × |
jTrans[ | 汇编指令 | BERT | √ | √ | × | × |
XBA[ | 二进制分解图 | GCN | × | √ | √ | × |
BINSHOT[ | 汇编指令 | BERT、Siamese Network | √ | √ | × | × |
表6 二进制代码相似性检测方法
Tab. 6 Binary code similarity detection methods
方法 | 提取特征 | 采用模型 | 跨编译器 | 跨优化 | 跨架构 | 抗混淆 |
---|---|---|---|---|---|---|
Asm2Vec[ | 汇编指令、CFG | PV-DM | × | √ | × | √ |
SAFE[ | 汇编指令 | word2vec、RNN、Siamese Network | √ | √ | × | × |
INNEREYE[ | 汇编指令、CFG | word2vec、LSTM、Siamese Network | × | √ | √ | × |
文献[ | 汇编指令 | CBOW | × | √ | √ | × |
MIRROR[ | 汇编指令、基本块 | Transformer | × | √ | √ | × |
Order Matters[ | CFG、基本块 | CNN、BERT、MPNN | √ | √ | √ | × |
DeepBinDiff[ | 汇编指令、CFG、基本块 | word2vec、TADW | × | √ | × | × |
TREX[ | 汇编指令 | word2vec、Transformer、LSTM | × | √ | √ | √ |
Codee[ | 汇编指令、CFG | Skip-gram | √ | √ | √ | √ |
BinDiffNN[ | 汇编指令 | Attention、Siamese Network | × | × | × | × |
QBinDiff[ | CFG、CG | 图编辑距离 | × | × | × | × |
PlamTree[ | 汇编指令 | BERT | √ | √ | √ | × |
jTrans[ | 汇编指令 | BERT | √ | √ | × | × |
XBA[ | 二进制分解图 | GCN | × | √ | √ | × |
BINSHOT[ | 汇编指令 | BERT、Siamese Network | √ | √ | × | × |
1 | APACHE. Apache Log 4j2[EB/OL]. [2023-04-27]. . |
2 | NVD. CVE-2021-44228[EB/OL]. [2023-04-27]. . |
3 | PEREZ D, CHIBA S. Cross-language clone detection by learning over abstract syntax trees [C]// Proceedings of the 2019 IEEE/ACM 16th International Conference on Mining Software Repositories. Piscataway: IEEE, 2019: 518-528. 10.1109/msr.2019.00078 |
4 | ROY C K, CORDY J R. NICAD: accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization [C]// Proceedings of the 2008 IEEE 16th International Conference on Program Comprehension. Piscataway: IEEE, 2008: 172-181. 10.1109/icpc.2008.41 |
5 | Stanford. Moss: a system for detecting software similarity[EB/OL]. [2023-04-27]. . |
6 | ALON U, ZILBERSTEIN M, LEVY O, et al. code2vec: learning distributed representations of code[J]. Proceedings of the ACM on Programming Languages, 2019, 3: No. 40. 10.1145/3290353 |
7 | NAFI K W, KAR T S, ROY B, et al. CLCDSA: cross language code clone detection using syntactical features and API documentation [C]// Proceedings of the 2019 34th IEEE/ACM International Conference on Automated Software Engineering. Piscataway: IEEE, 2019: 1026-1037. 10.1109/ase.2019.00099 |
8 | BELLON S, KOSCHKE R, ANTONIOl G, et al. Comparison and evaluation of clone detection tools[J]. IEEE Transactions on Software Engineering, 2007, 33(9): 577-591. 10.1109/tse.2007.70725 |
9 | 熊浩,晏海华,郭涛,等. 代码相似性检测技术:研究综述[J]. 计算机科学,2010, 37(8):9-14. 10.3969/j.issn.1002-137X.2010.08.002 |
XIONG H, YAN H H, GUO T, et al. Review of code similarity detection technology[J]. Computer Science, 2010, 37(8):9-14. 10.3969/j.issn.1002-137X.2010.08.002 | |
10 | XU X, LIU C, FENG Q, et al. Neural network-based graph embedding for cross-platform binary code similarity detection [C]// Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. New York: ACM, 2017: 363-376. 10.1145/3133956.3134018 |
11 | DING S H H, DING B C M, CHARLAND P. Asm2Vec: boosting static representation robustness for binary clone search against code obfuscation and compiler optimization [C]// Proceedings of the 2019 IEEE Symposium on Security and Privacy. Piscataway: IEEE, 2019: 472-489. 10.1109/sp.2019.00003 |
12 | PPEI K, XUAN Z, YANG J, et al. TREX: learning execution semantics from micro-traces for binary similarity [EB/OL]. (2020-12-16) [2023-04-27]. . 10.1109/tse.2022.3231621 |
13 | SAJNANI H, SAINI V, SVAJLENKO J, et al. SourcererCC: scaling code clone detection to big-code [C]// Proceedings of the 2016 IEEE 38th International Conference on Software Engineering. Piscataway: IEEE, 2016: 1157-1168. 10.1145/2884781.2884877 |
14 | LI L, FEMG H, ZHUANG W, et al. CCLearner: a deep learning-based clone detection approach [C]// Proceedings of the 2017 IEEE International Conference on Software Maintenance and Evolution. Piscataway: IEEE, 2017: 249-260. 10.1109/icsme.2017.46 |
15 | KIM S, WOO S, LEE H, et al. VUDDY: a scalable approach for vulnerable code clone discovery [C]// Proceedings of the 2017 IEEE Symposium on Security and Privacy. Piscataway: IEEE, 2017: 595-614. 10.1109/sp.2017.62 |
16 | WANG P, SVAJLENKO J, WU Y, et al. CCAligner: a token based large-gap clone detector [C]// Proceedings of the 2018 IEEE/ACM 40th International Conference on Software Engineering. New York: ACM, 2018: 1066-1077. 10.1145/3180155.3180179 |
17 | NAKAGAWA T, HIGO Y, KUSUMOTO S. NIL: large-scale detection of large-variance clones [C]// Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. New York: ACM, 2021: 830-841. 10.1145/3468264.3468564 |
18 | JIANG L, MISHERGHI G, SU Z, et al. DECKARD: scalable and accurate tree-based detection of code clones [C]// Proceedings of the 29th International Conference on Software Engineering. Washington, DC: IEEE Computer Society, 2007: 96-105. 10.1109/icse.2007.30 |
19 | ALON U, BRODY S, LEVY O, et al. code2seq: generating sequences from structured representations of code [EB/OL]. (2019-02-21) [2023-04-02]. . |
20 | ZHANG J, WANG X, ZHANG H, et al. A novel neural source code representation based on abstract syntax tree [C]// Proceedings of the 2019 IEEE/ACM 41st International Conference on Software Engineering. Piscataway: IEEE, 2019:783-794. 10.1109/icse.2019.00086 |
21 | HU Y, ZOU D, PENG J, et al. TreeCen: building tree graph for scalable semantic code clone detection [C]// Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. New York: ACM, 2022: No. 109. 10.1145/3551349.3556927 |
22 | ZHAO G, HUANG J. DeepSim: deep learning code functional similarity [C]// Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. New York: ACM, 2018: 141-151. 10.1145/3236024.3236068 |
23 | ZOU Y, BAN B, XUE Y, et al. CCGraph: a PDG-based code clone detector with approximate graph matching [C]// Proceedings of the 2020 35th IEEE/ACM International Conference on Automated Software Engineering. New York: ACM, 2020: 931-942. 10.1145/3324884.3416541 |
24 | WU Y, ZOU D, DOU S, et al. SCDetector: software functional clone detection based on semantic tokens analysis [C]// Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security. New York: ACM, 2020: 821-833. 10.1145/3324884.3416562 |
25 | XIAO Y, CHEN B, YU C, et al. MVP: detecting vulnerabilities using patch-enhanced vulnerability signatures [C]// Proceedings of the 29th USENIX Security Symposium. Berkeley: USENIX Association, 2020: 1165-1182. |
26 | KANG W, SON B, HEO K. TRACER: signature-based static analysis for detecting recurring vulnerabilities [C]// Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security. New York: ACM, 2022: 1695-1708. 10.1145/3548606.3560664 |
27 | 陈秋远, 李善平, 鄢萌,等. 代码克隆检测研究进展[J]. 软件学报, 2019, 30(4): 962-980. |
CHEN Q Y, LI S P, YAN M, et al. Code clone detection: a literature review[J]. Journal of Software, 2019, 30(4): 962-980. | |
28 | FANG C, LIU Z, SHI Y. Functional code clone detection with syntax and semantics fusion learning [C]// Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. New York: ACM, 2020:516-527. 10.1145/3395363.3397362 |
29 | WU Y, FENG S, ZOU D. Detecting semantic code clones by building AST-based Markov chains model [C]// Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. New York: ACM, 2020: No. 34. |
30 | NAFI K W, ROY B, ROY C K, et al. CroLSim: cross language software similarity detector using API documentation [C]// Proceedings of the 2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation. Piscataway: IEEE, 2018: 139-148. 10.1109/scam.2018.00023 |
31 | NAFI K W, ROY B, ROY C K, et al. A universal cross language software similarity detector for open source software categorization[J]. Journal of Systems and Software, 2020, 162: 110491. 10.1016/j.jss.2019.110491 |
32 | ULLAH F, NAEEM M R, NAEEM, H, et al. CroLSSim: cross-language software similarity detector using hybrid approach of LSA-based AST-MDrep features and CNN-LSTM model[J]. International Journal of Intelligent Systems, 2022, 37(9): 5768-5795. 10.1002/int.22813 |
33 | MATHEW G, STOLEE K T. Cross-language code search using static and dynamic analyses [C]// Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. New York: ACM, 2021:205-217. 10.1145/3468264.3468538 |
34 | CHEN B, ABEDJAN Z. Interactive cross-language code retrieval with auto-encoders [C]// Proceedings of the 2021 36th IEEE/ACM International Conference on Automated Software Engineering. Piscataway: IEEE, 2021: 167-178. 10.1109/ase51524.2021.9678929 |
35 | PINKU S N, MONDAL D, ROY C K, et al. Pathways to leverage transcompiler based data augmentation for cross-language clone detection [C]// Proceeding of the 2023 IEEE/ACM 31st International Conference on Program Comprehension. Piscataway: IEEE, 2023: 169-180. 10.1109/icpc58990.2023.00031 |
36 | WANG K, YAN M, ZHANG H, et al. Unified abstract syntax tree representation learning for cross-language program classification. [C]// Proceeding of the 2022 IEEE/ACM 30th International Conference on Program Comprehension. New York: ACM, 2022: 390-400. 10.1145/3524610.3527915 |
37 | MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space [EB/OL]. (2013-01-16)[2023-04-27]. . 10.3126/jiee.v3i1.34327 |
38 | FENG Z, GUO D, TANG D, et al. CodeBERT: a pre-trained model for programming and natural languages [C]// Findings of the Association for Computational Linguistics: EMNLP 2020. Stroudsburg: ACL, 2020: 1536-1547. 10.18653/v1/2020.findings-emnlp.139 |
39 | BUI N D Q, YU Y, JIANG L. InferCode: self-supervised learning of code representations by predicting subtrees [C]// Proceedings of the 2010 IEEE/ACM 43rd International Conference on Software Engineering. Piscataway: IEEE, 2021: 1186-1197. 10.1109/icse43902.2021.00109 |
40 | LIN Z, LI G, ZHANG J, et al. XCode: towards cross-language code representation with large-scale pre-training [J]. ACM Transactions on Software Engineering and Methodology, 2022, 31(3): No. 52. 10.1145/3506696 |
41 | TAO C, ZHAN Q, HU X, et al. C4: contrastive cross-language code clone detection [C]// Proceedings of the 2022 IEEE/ACM 30th International Conference on Program Comprehension. New York: ACM, 2022: 413-424. 10.1145/3524610.3527911 |
42 | YAHYA M A, KIM D-K. Cross-language source code clone detection using deep learning with InferCode [EB/OL]. (2022-05-10) [2023-04-27]. . 10.3390/computers12010012 |
43 | VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2017: 6000-6010. |
44 | LE Q, TOMAS M. Distributed representations of sentences and documents [C]// Proceedings of the 31st International Conference on Machine Learning. New York: JMLR.org, 2014: 1188-1196. |
45 | MOU L, LI G, ZHAN L, et al. Convolutional neural networks over tree structures for programming language processing [C]// Proceedings of the 30th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2016: 1287-1293. 10.1609/aaai.v30i1.10139 |
46 | HINTON G, VINYALS O, DEAN J. Distilling the knowledge in a neural network [EB/OL]. (2015-03-09)[2023-04-27]. . |
47 | HAQ I U, CABALLERO J. A survey of binary code similarity[J]. ACM Computing Surveys, 2021, 54(3): No. 51. 10.1145/3446371 |
48 | 夏冰, 庞建民, 周鑫,等. 二进制代码相似性搜索研究进展[J]. 计算机应用, 2022, 42(4): 985-998. 10.11772/j.issn.1001-9081.2021071267 |
XIA B, PANG J M, ZHOU X, et al. Research progress on binary code similarity search[J]. Journal of Computer Applications, 2022, 42(4): 985-998. 10.11772/j.issn.1001-9081.2021071267 | |
49 | Hex-Rays. State-of-the-art binary code analysis tools [EB/OL]. (2021-07-08)[2023-04-27]. . |
50 | DAI H, DAI B, SONG L. Discriminative embeddings of latent variable models for structured data [C]// Proceedings of the 33rd International Conference on Machine Learning. New York: JMLR.org, 2016: 2702-2711. |
51 | YU Z, CAO R, TANG Q, et al. Order Matters: semantic-aware neural networks for binary code similarity detection [C]// Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2020: 1145-1152. 10.1609/aaai.v34i01.5466 |
52 | DEVLIN J, CHANG M-W, LEE K-T, et al. BERT: pre-training of deep bidirectional transformers for language understanding [C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics. Stroudsburg: ACL, 2019: 4171-4186. 10.18653/v1/n18-2 |
53 | GILMER J, SCHOENHOLZ S S, RILEY P F, et al. Neural message passing for quantum chemistry [C]// Proceedings of the 34th International Conference on Machine Learning. New York: JMLR.org, 2017:1263-1272. |
54 | MENGIN E, ROSSY F. Binary diffing as a network alignment problem via belief propagation [C]// Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering. Piscataway: IEEE, 2021:967-978. 10.1109/ase51524.2021.9678782 |
55 | KIM G, HONG S, FRANZ M, et al. Improving cross-platform binary analysis using representation learning via graph alignment [C]// Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. New York: ACM, 2022: 151-163. 10.1145/3533767.3534383 |
56 | KIPF T N, WELLING M. Semi-supervised classification with graph convolutional networks [EB/OL]. (2016-09-09)[2023-04-27]. . 10.48550/arXiv.1609.02907 |
57 | DAN Y, LI X, WANG J, et al. DeepBinDiff: learning program-wide code representations for binary diffing [C/OL]// Proceedings of the 2020 International Conference on Network and Distributed Systems Security Symposium [2023-04-01]. . 10.14722/ndss.2020.24311 |
58 | LI X, QU Y, YIN H. PalmTree: learning an assembly language model for instruction embedding [C]// Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security. New York: ACM, 2021: 3236-3251. 10.1145/3460120.3484587 |
59 | WANG H, QU W, KATZ G, et al. jTrans: jump-aware Transformer for binary code similarity detection [C]// Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. New York: ACM, 2022: 1-13. 10.1145/3533767.3534367 |
60 | SHALEV N, PARTUSH N. Binary similarity detection using machine learning [C]// Proceedings of the 13th Workshop on Programming Languages and Analysis for Security. New York: ACM, 2018: 42-47. 10.1145/3264820.3264821 |
61 | MASSAERLLI L, DI LUNA G A, PETRONI F, et al. SAFE: self-attentive function embeddings for binary similarity [C]// Proceedings of the 2019 International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Cham: Springer, 2019:309-329. 10.1007/978-3-030-22038-9_15 |
62 | ZOU F, LI X, YOUNG P, et al. Neural machine translation inspired binary code similarity comparison beyond function pairs [EB/OL]. (2018-08-08) [2023-04-27]. . 10.14722/ndss.2019.23492 |
63 | REDMOND K, LUO L N, ZENG Q. A cross-architecture instruction embedding model for natural language processing-inspired binary code analysis [EB/OL]. (2018-12-23) [2023-04-27]. . 10.14722/bar.2019.23057 |
64 | ZHANG X, SUN W, PANG J, et al. Similarity metric method for binary basic blocks of cross-instruction set architecture [C/OL]// Proceedings of the 2020 Workshop on Binary Analysis Research [2023-04-27]. . 10.14722/bar.2020.23002 |
65 | YANG J, FU C, LIU X, et al. Codee: a tensor embedding scheme for binary code search[J]. IEEE Transactions on Software Engineering, 2022, 48(7):2224-2244. 10.1109/tse.2021.3056139 |
66 | ULLAH S, OH H. BinDiffNN: learning distributed representation of assembly for robust binary diffing against semantic differences[J].IEEE Transactions on Software Engineering, 2022, 48(9): 3442-3466. 10.1109/tse.2021.3093926 |
67 | AHN S, AHN S, KOO H, et al. Practical binary code similarity detection with BERT-based transferable similarity learning [C]// Proceedings of the 38th Annual Computer Security Applications Conference. New York: ACM, 2022: 361-374. 10.1145/3564625.3567975 |
68 | MARCELLI A, GRAZIANO M, UGARTE-PEDRERO X. How machine learning is solving the binary function similarity problem [C]// Proceedings of the 31st International Conference on USENIX Security Symposium. Berkeley: USENIX Association, 2022:390-400. |
69 | KIM D, KIM E, CHA S K, et al. Revisiting binary code similarity analysis using interpretable feature engineering and lessons learned[J]. IEEE Transactions on Software Engineering, 2023, 49(4):1661-1682. 10.1109/tse.2022.3187689 |
70 | 方磊, 武泽慧, 魏强. 二进制代码相似性检测技术综述[J]. 计算机科学,2021, 48(5):1-8. 10.11896/jsjkx.200400085 |
FANG L, WU Z H, WEI Q. Summary of binary code similarity detection techniques [J]. Computer Science,2021,48(5):1-8. 10.11896/jsjkx.200400085 | |
71 | YU Z, ZHENG W, WANG J, et al. CodeCMR: cross-modal retrieval for function-level binary source code matching [C]// Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2020:3872-3883. |
72 | GUI Y, WAN Y, ZHAN H Y, et al. Cross-language binary-source code matching with intermediate representations [C]// Proceedings of the 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering. Washington, DC: IEEE Computer Society, 2022:601-612. 10.1109/saner53432.2022.00077 |
73 | JI Y, CUI L, HUANG H H. BugGraph: differentiating source-binary code similarity with graph triplet-loss network [C]// Proceedings of the 2021 ACM Asia Conference on Computer and Communications Security. New York: ACM, 2021:702-715. 10.1145/3433210.3437533 |
[1] | 刘源泂, 何茂征, 黄益斌, 钱程. 基于ResNet50和改进注意力机制的船舶识别模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1935-1941. |
[2] | 吴锦富, 柳毅. 基于随机噪声和自适应步长的快速对抗训练方法[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1807-1815. |
[3] | 王晓路, 千王菲. 基于双支路卷积网络的步态识别方法[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1965-1971. |
[4] | 邴雅星, 王阳萍, 雍玖, 白浩谋. 基于筛选学习网络的六自由度目标位姿估计算法[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1920-1926. |
[5] | 赵雅娟, 孟繁军, 徐行健. 在线教育学习者知识追踪综述[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1683-1698. |
[6] | 孙子文, 钱立志, 杨传栋, 高一博, 陆庆阳, 袁广林. 基于Transformer的视觉目标跟踪方法综述[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1644-1654. |
[7] | 李鑫, 孟乔, 皇甫俊逸, 孟令辰. 基于分离式标签协同学习的YOLOv5多属性分类[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1619-1628. |
[8] | 时旺军, 王晶, 宁晓军, 林友芳. 小样本场景下的元迁移学习睡眠分期模型[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1445-1451. |
[9] | 郭琳, 刘坤虎, 马晨阳, 来佑雪, 徐映芬. 基于感受野扩展残差注意力网络的图像超分辨率重建[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1579-1587. |
[10] | 盖彦辛, 闫涛, 张江峰, 郭小英, 陈斌. 基于时空注意力的空间关联三维形貌重建[J]. 《计算机应用》唯一官方网站, 2024, 44(5): 1570-1578. |
[11] | 王铂越, 李英祥, 钟剑丹. 基于改进Res-UNet的昼夜地基云图分割网络[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1310-1316. |
[12] | 万泽轩, 谢春丽, 吕泉润, 梁瑶. 基于依赖增强的分层抽象语法树的代码克隆检测[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1259-1268. |
[13] | 唐睿, 岳士博, 张睿智, 刘川, 庞川林. UAV协助下非正交多址接入使能的数据采集系统中能效优化机制[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1209-1218. |
[14] | 张鹏飞, 韩李涛, 冯恒健, 李洪梅. 基于注意力机制和全局特征优化的点云语义分割[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1086-1092. |
[15] | 杨先凤, 汤依磊, 李自强. 基于交替注意力机制和图卷积网络的方面级情感分析模型[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1058-1064. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||