代码相似性检测技术综述

doi:10.11772/j.issn.1001-9081.2023040551

《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (4): 1248-1258.DOI: 10.11772/j.issn.1001-9081.2023040551

所属专题：综述

代码相似性检测技术综述

孙祥杰¹^,², 魏强², 王奕森²(), 杜江²

^1.郑州大学网络空间安全学院，郑州 450002
^2.信息工程大学网络空间安全学院，郑州 450001

收稿日期:2023-05-09 修回日期:2023-07-13 接受日期:2023-07-14 发布日期:2023-12-04 出版日期:2024-04-10
通讯作者: 王奕森
作者简介:孙祥杰（1999—），男，河南焦作人，硕士研究生，主要研究方向：软件成分分析
魏强（1979—），男，江西南昌人，教授，博士，主要研究方向：工业控制系统安全
王奕森（1990—），男，河南沈丘人，副教授，博士，主要研究方向：网络安全 851067568@qq.com
杜江（1990—），男，河南郑州人，博士研究生，主要研究方向：二进制代码相似性。
基金资助:
国家重点研发计划项目(2019QY0502)

Survey of code similarity detection technology

Xiangjie SUN¹^,², Qiang WEI², Yisen WANG²(), Jiang DU²

^1.School of Cyber Science and Engineering，Zhengzhou University，Zhengzhou Henan 450002，China
^2.School of Cyberspace Security，Information Engineering University，Zhengzhou Henan 450001，China

Received:2023-05-09 Revised:2023-07-13 Accepted:2023-07-14 Online:2023-12-04 Published:2024-04-10
Contact: Yisen WANG
About author:SUN Xiangjie， born in 1999， M. S. candidate. His research interests include software composition analysis.
WEI Qiang， born in 1979， Ph. D.， professor. His research interests include industrial control system security.
WANG Yisen， born in 1990， Ph. D.， associate professor. His research interests include network security.
DU Jiang， born in 1990， Ph. D. candidate. His research interests include binary code similarity.
Supported by:
National Key Research & Development Program(2019QY0502)

摘要/Abstract

摘要：

代码复用为软件开发带来便利的同时也引入了安全风险，如加速漏洞传播、代码恶意抄袭等，代码相似性检测技术通过分析代码间词法、语法、语义等信息计算代码相似程度，是判断代码复用最有效的技术之一，也是近年发展较快的程序安全分析技术。首先，系统梳理代码相似性检测的近期技术进展，根据目标代码是否开源，将代码相似性检测技术分为源码相似性检测和二进制代码相似性检测，又根据编程语言、指令集的不同进行二次细分；其次，总结每一种技术的思路和研究成果，分析机器学习技术在代码相似性检测领域成功的案例，并讨论现有技术的优势与不足；最后，给出代码相似性检测技术的发展趋势，为相关研究人员提供参考。

关键词: 二进制代码相似性, 源代码相似性, 跨语言代码相似性, 深度学习, 代码克隆

Abstract:

Code reuse not only brings convenience to software development， but also introduces security risks， such as accelerating vulnerability propagation and malicious code plagiarism. Code similarity detection technology is to calculate code similarity by analyzing lexical， syntactic， semantic and other information between codes. It is one of the most effective technologies to judge code reuse， and it is also a program security analysis technology that has developed rapidly in recent years. First， the latest technical progress of code similarity detection was systematically reviewed， and the current code similarity detection technology was classified. According to whether the target code was open source， it was divided into source code similarity detection and binary code similarity detection. According to the different programming languages and instruction sets， the second subdivision was carried out. Then， the ideas and research results of each technology were summarized， the successful cases of machine learning technology in the field of code similarity detection were analyzed， and the advantages and disadvantages of existing technologies were discussed. Finally， the development trend of code similarity detection technology was given to provide reference for relevant researchers.

Key words: binary code similarity, source code similarity, cross language code similarity, deep learning, code clone

中图分类号:

TP311.5

孙祥杰, 魏强, 王奕森, 杜江. 代码相似性检测技术综述[J]. 计算机应用, 2024, 44(4): 1248-1258.

Xiangjie SUN, Qiang WEI, Yisen WANG, Jiang DU. Survey of code similarity detection technology[J]. Journal of Computer Applications, 2024, 44(4): 1248-1258.

图/表 11

参考文献 73

1	APACHE. Apache Log 4j2［EB/OL］. ［2023-04-27］. .
2	NVD. CVE-2021-44228［EB/OL］. ［2023-04-27］. .
3	PEREZ D， CHIBA S. Cross-language clone detection by learning over abstract syntax trees ［C］// Proceedings of the 2019 IEEE/ACM 16th International Conference on Mining Software Repositories. Piscataway： IEEE， 2019： 518-528． 10.1109/msr.2019.00078
4	ROY C K， CORDY J R. NICAD： accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization ［C］// Proceedings of the 2008 IEEE 16th International Conference on Program Comprehension. Piscataway： IEEE， 2008： 172-181. 10.1109/icpc.2008.41
5	Stanford. Moss： a system for detecting software similarity［EB/OL］. ［2023-04-27］. .
6	ALON U， ZILBERSTEIN M， LEVY O， et al. code2vec： learning distributed representations of code［J］. Proceedings of the ACM on Programming Languages， 2019， 3： No. 40. 10.1145/3290353
7	NAFI K W， KAR T S， ROY B， et al. CLCDSA： cross language code clone detection using syntactical features and API documentation ［C］// Proceedings of the 2019 34th IEEE/ACM International Conference on Automated Software Engineering. Piscataway： IEEE， 2019： 1026-1037. 10.1109/ase.2019.00099
8	BELLON S， KOSCHKE R， ANTONIOl G， et al. Comparison and evaluation of clone detection tools［J］. IEEE Transactions on Software Engineering， 2007， 33（9）： 577-591. 10.1109/tse.2007.70725
9	熊浩，晏海华，郭涛，等. 代码相似性检测技术：研究综述［J］. 计算机科学，2010， 37（8）：9-14. 10.3969/j.issn.1002-137X.2010.08.002
	XIONG H， YAN H H， GUO T， et al. Review of code similarity detection technology［J］. Computer Science， 2010， 37（8）：9-14. 10.3969/j.issn.1002-137X.2010.08.002
10	XU X， LIU C， FENG Q， et al. Neural network-based graph embedding for cross-platform binary code similarity detection ［C］// Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. New York： ACM， 2017： 363-376. 10.1145/3133956.3134018
11	DING S H H， DING B C M， CHARLAND P. Asm2Vec： boosting static representation robustness for binary clone search against code obfuscation and compiler optimization ［C］// Proceedings of the 2019 IEEE Symposium on Security and Privacy. Piscataway： IEEE， 2019： 472-489. 10.1109/sp.2019.00003
12	PPEI K， XUAN Z， YANG J， et al. TREX： learning execution semantics from micro-traces for binary similarity ［EB/OL］. （2020-12-16）［2023-04-27］. . 10.1109/tse.2022.3231621
13	SAJNANI H， SAINI V， SVAJLENKO J， et al. SourcererCC： scaling code clone detection to big-code ［C］// Proceedings of the 2016 IEEE 38th International Conference on Software Engineering. Piscataway： IEEE， 2016： 1157-1168. 10.1145/2884781.2884877
14	LI L， FEMG H， ZHUANG W， et al. CCLearner： a deep learning-based clone detection approach ［C］// Proceedings of the 2017 IEEE International Conference on Software Maintenance and Evolution. Piscataway： IEEE， 2017： 249-260. 10.1109/icsme.2017.46
15	KIM S， WOO S， LEE H， et al. VUDDY： a scalable approach for vulnerable code clone discovery ［C］// Proceedings of the 2017 IEEE Symposium on Security and Privacy. Piscataway： IEEE， 2017： 595-614. 10.1109/sp.2017.62
16	WANG P， SVAJLENKO J， WU Y， et al. CCAligner： a token based large-gap clone detector ［C］// Proceedings of the 2018 IEEE/ACM 40th International Conference on Software Engineering. New York： ACM， 2018： 1066-1077. 10.1145/3180155.3180179
17	NAKAGAWA T， HIGO Y， KUSUMOTO S. NIL： large-scale detection of large-variance clones ［C］// Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. New York： ACM， 2021： 830-841. 10.1145/3468264.3468564
18	JIANG L， MISHERGHI G， SU Z， et al. DECKARD： scalable and accurate tree-based detection of code clones ［C］// Proceedings of the 29th International Conference on Software Engineering. Washington， DC： IEEE Computer Society， 2007： 96-105. 10.1109/icse.2007.30
19	ALON U， BRODY S， LEVY O， et al. code2seq： generating sequences from structured representations of code ［EB/OL］. （2019-02-21）［2023-04-02］. .
20	ZHANG J， WANG X， ZHANG H， et al. A novel neural source code representation based on abstract syntax tree ［C］// Proceedings of the 2019 IEEE/ACM 41st International Conference on Software Engineering. Piscataway： IEEE， 2019：783-794． 10.1109/icse.2019.00086
21	HU Y， ZOU D， PENG J， et al. TreeCen： building tree graph for scalable semantic code clone detection ［C］// Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. New York： ACM， 2022： No. 109. 10.1145/3551349.3556927
22	ZHAO G， HUANG J. DeepSim： deep learning code functional similarity ［C］// Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. New York： ACM， 2018： 141-151. 10.1145/3236024.3236068
23	ZOU Y， BAN B， XUE Y， et al. CCGraph： a PDG-based code clone detector with approximate graph matching ［C］// Proceedings of the 2020 35th IEEE/ACM International Conference on Automated Software Engineering. New York： ACM， 2020： 931-942. 10.1145/3324884.3416541
24	WU Y， ZOU D， DOU S， et al. SCDetector： software functional clone detection based on semantic tokens analysis ［C］// Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security. New York： ACM， 2020： 821-833. 10.1145/3324884.3416562
25	XIAO Y， CHEN B， YU C， et al. MVP： detecting vulnerabilities using patch-enhanced vulnerability signatures ［C］// Proceedings of the 29th USENIX Security Symposium. Berkeley： USENIX Association， 2020： 1165-1182.
26	KANG W， SON B， HEO K. TRACER： signature-based static analysis for detecting recurring vulnerabilities ［C］// Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security. New York： ACM， 2022： 1695-1708. 10.1145/3548606.3560664
27	陈秋远，李善平，鄢萌，等. 代码克隆检测研究进展［J］. 软件学报， 2019， 30（4）： 962-980.
	CHEN Q Y， LI S P， YAN M， et al. Code clone detection： a literature review［J］. Journal of Software， 2019， 30（4）： 962-980.
28	FANG C， LIU Z， SHI Y. Functional code clone detection with syntax and semantics fusion learning ［C］// Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. New York： ACM， 2020：516-527. 10.1145/3395363.3397362
29	WU Y， FENG S， ZOU D. Detecting semantic code clones by building AST-based Markov chains model ［C］// Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. New York： ACM， 2020： No. 34.
30	NAFI K W， ROY B， ROY C K， et al. CroLSim： cross language software similarity detector using API documentation ［C］// Proceedings of the 2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation. Piscataway： IEEE， 2018： 139-148. 10.1109/scam.2018.00023
31	NAFI K W， ROY B， ROY C K， et al. A universal cross language software similarity detector for open source software categorization［J］. Journal of Systems and Software， 2020， 162： 110491. 10.1016/j.jss.2019.110491
32	ULLAH F， NAEEM M R， NAEEM， H， et al. CroLSSim： cross-language software similarity detector using hybrid approach of LSA-based AST-MDrep features and CNN-LSTM model［J］. International Journal of Intelligent Systems， 2022， 37（9）： 5768-5795. 10.1002/int.22813
33	MATHEW G， STOLEE K T. Cross-language code search using static and dynamic analyses ［C］// Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. New York： ACM， 2021：205-217. 10.1145/3468264.3468538
34	CHEN B， ABEDJAN Z. Interactive cross-language code retrieval with auto-encoders ［C］// Proceedings of the 2021 36th IEEE/ACM International Conference on Automated Software Engineering. Piscataway： IEEE， 2021： 167-178. 10.1109/ase51524.2021.9678929
35	PINKU S N， MONDAL D， ROY C K， et al. Pathways to leverage transcompiler based data augmentation for cross-language clone detection ［C］// Proceeding of the 2023 IEEE/ACM 31st International Conference on Program Comprehension. Piscataway： IEEE， 2023： 169-180. 10.1109/icpc58990.2023.00031
36	WANG K， YAN M， ZHANG H， et al. Unified abstract syntax tree representation learning for cross-language program classification. ［C］// Proceeding of the 2022 IEEE/ACM 30th International Conference on Program Comprehension. New York： ACM， 2022： 390-400. 10.1145/3524610.3527915
37	MIKOLOV T， CHEN K， CORRADO G， et al. Efficient estimation of word representations in vector space ［EB/OL］. （2013-01-16）［2023-04-27］. . 10.3126/jiee.v3i1.34327
38	FENG Z， GUO D， TANG D， et al. CodeBERT： a pre-trained model for programming and natural languages ［C］// Findings of the Association for Computational Linguistics： EMNLP 2020. Stroudsburg： ACL， 2020： 1536-1547. 10.18653/v1/2020.findings-emnlp.139
39	BUI N D Q， YU Y， JIANG L. InferCode： self-supervised learning of code representations by predicting subtrees ［C］// Proceedings of the 2010 IEEE/ACM 43rd International Conference on Software Engineering. Piscataway： IEEE， 2021： 1186-1197. 10.1109/icse43902.2021.00109
40	LIN Z， LI G， ZHANG J， et al. XCode： towards cross-language code representation with large-scale pre-training ［J］. ACM Transactions on Software Engineering and Methodology， 2022， 31（3）： No. 52. 10.1145/3506696
41	TAO C， ZHAN Q， HU X， et al. C4： contrastive cross-language code clone detection ［C］// Proceedings of the 2022 IEEE/ACM 30th International Conference on Program Comprehension. New York： ACM， 2022： 413-424. 10.1145/3524610.3527911
42	YAHYA M A， KIM D-K. Cross-language source code clone detection using deep learning with InferCode ［EB/OL］. （2022-05-10）［2023-04-27］. . 10.3390/computers12010012
43	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need ［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2017： 6000-6010.
44	LE Q， TOMAS M. Distributed representations of sentences and documents ［C］// Proceedings of the 31st International Conference on Machine Learning. New York： JMLR.org， 2014： 1188-1196.
45	MOU L， LI G， ZHAN L， et al. Convolutional neural networks over tree structures for programming language processing ［C］// Proceedings of the 30th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2016： 1287-1293. 10.1609/aaai.v30i1.10139
46	HINTON G， VINYALS O， DEAN J. Distilling the knowledge in a neural network ［EB/OL］. （2015-03-09）［2023-04-27］. .
47	HAQ I U， CABALLERO J. A survey of binary code similarity［J］. ACM Computing Surveys， 2021， 54（3）： No. 51. 10.1145/3446371
48	夏冰，庞建民，周鑫，等. 二进制代码相似性搜索研究进展［J］. 计算机应用， 2022， 42（4）： 985-998. 10.11772/j.issn.1001-9081.2021071267
	XIA B， PANG J M， ZHOU X， et al. Research progress on binary code similarity search［J］. Journal of Computer Applications， 2022， 42（4）： 985-998. 10.11772/j.issn.1001-9081.2021071267
49	Hex-Rays. State-of-the-art binary code analysis tools ［EB/OL］. （2021-07-08）［2023-04-27］. .
50	DAI H， DAI B， SONG L. Discriminative embeddings of latent variable models for structured data ［C］// Proceedings of the 33rd International Conference on Machine Learning. New York： JMLR.org， 2016： 2702-2711.
51	YU Z， CAO R， TANG Q， et al. Order Matters： semantic-aware neural networks for binary code similarity detection ［C］// Proceedings of the 34th AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2020： 1145-1152. 10.1609/aaai.v34i01.5466
52	DEVLIN J， CHANG M-W， LEE K-T， et al. BERT： pre-training of deep bidirectional transformers for language understanding ［C］// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics. Stroudsburg： ACL， 2019： 4171-4186. 10.18653/v1/n18-2
53	GILMER J， SCHOENHOLZ S S， RILEY P F， et al. Neural message passing for quantum chemistry ［C］// Proceedings of the 34th International Conference on Machine Learning. New York： JMLR.org， 2017：1263-1272.
54	MENGIN E， ROSSY F. Binary diffing as a network alignment problem via belief propagation ［C］// Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering. Piscataway： IEEE， 2021：967-978. 10.1109/ase51524.2021.9678782
55	KIM G， HONG S， FRANZ M， et al. Improving cross-platform binary analysis using representation learning via graph alignment ［C］// Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. New York： ACM， 2022： 151-163. 10.1145/3533767.3534383
56	KIPF T N， WELLING M. Semi-supervised classification with graph convolutional networks ［EB/OL］. （2016-09-09）［2023-04-27］. . 10.48550/arXiv.1609.02907
57	DAN Y， LI X， WANG J， et al. DeepBinDiff： learning program-wide code representations for binary diffing ［C/OL］// Proceedings of the 2020 International Conference on Network and Distributed Systems Security Symposium ［2023-04-01］. . 10.14722/ndss.2020.24311
58	LI X， QU Y， YIN H. PalmTree： learning an assembly language model for instruction embedding ［C］// Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security. New York： ACM， 2021： 3236-3251. 10.1145/3460120.3484587
59	WANG H， QU W， KATZ G， et al. jTrans： jump-aware Transformer for binary code similarity detection ［C］// Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. New York： ACM， 2022： 1-13. 10.1145/3533767.3534367
60	SHALEV N， PARTUSH N. Binary similarity detection using machine learning ［C］// Proceedings of the 13th Workshop on Programming Languages and Analysis for Security. New York： ACM， 2018： 42-47. 10.1145/3264820.3264821
61	MASSAERLLI L， DI LUNA G A， PETRONI F， et al. SAFE： self-attentive function embeddings for binary similarity ［C］// Proceedings of the 2019 International Conference on Detection of Intrusions and Malware， and Vulnerability Assessment. Cham： Springer， 2019：309-329. 10.1007/978-3-030-22038-9_15
62	ZOU F， LI X， YOUNG P， et al. Neural machine translation inspired binary code similarity comparison beyond function pairs ［EB/OL］. （2018-08-08）［2023-04-27］. . 10.14722/ndss.2019.23492
63	REDMOND K， LUO L N， ZENG Q. A cross-architecture instruction embedding model for natural language processing-inspired binary code analysis ［EB/OL］. （2018-12-23）［2023-04-27］. . 10.14722/bar.2019.23057
64	ZHANG X， SUN W， PANG J， et al. Similarity metric method for binary basic blocks of cross-instruction set architecture ［C/OL］// Proceedings of the 2020 Workshop on Binary Analysis Research ［2023-04-27］. . 10.14722/bar.2020.23002
65	YANG J， FU C， LIU X， et al. Codee： a tensor embedding scheme for binary code search［J］. IEEE Transactions on Software Engineering， 2022， 48（7）：2224-2244. 10.1109/tse.2021.3056139
66	ULLAH S， OH H. BinDiff_NN： learning distributed representation of assembly for robust binary diffing against semantic differences［J］.IEEE Transactions on Software Engineering， 2022， 48（9）： 3442-3466. 10.1109/tse.2021.3093926
67	AHN S， AHN S， KOO H， et al. Practical binary code similarity detection with BERT-based transferable similarity learning ［C］// Proceedings of the 38th Annual Computer Security Applications Conference. New York： ACM， 2022： 361-374. 10.1145/3564625.3567975
68	MARCELLI A， GRAZIANO M， UGARTE-PEDRERO X. How machine learning is solving the binary function similarity problem ［C］// Proceedings of the 31st International Conference on USENIX Security Symposium. Berkeley： USENIX Association， 2022：390-400.
69	KIM D， KIM E， CHA S K， et al. Revisiting binary code similarity analysis using interpretable feature engineering and lessons learned［J］. IEEE Transactions on Software Engineering， 2023， 49（4）：1661-1682. 10.1109/tse.2022.3187689
70	方磊，武泽慧，魏强. 二进制代码相似性检测技术综述［J］. 计算机科学，2021， 48（5）：1-8. 10.11896/jsjkx.200400085
	FANG L， WU Z H， WEI Q. Summary of binary code similarity detection techniques ［J］. Computer Science，2021，48（5）：1-8. 10.11896/jsjkx.200400085
71	YU Z， ZHENG W， WANG J， et al. CodeCMR： cross-modal retrieval for function-level binary source code matching ［C］// Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2020：3872-3883.
72	GUI Y， WAN Y， ZHAN H Y， et al. Cross-language binary-source code matching with intermediate representations ［C］// Proceedings of the 2022 IEEE International Conference on Software Analysis， Evolution and Reengineering. Washington， DC： IEEE Computer Society， 2022：601-612. 10.1109/saner53432.2022.00077
73	JI Y， CUI L， HUANG H H. BugGraph： differentiating source-binary code similarity with graph triplet-loss network ［C］// Proceedings of the 2021 ACM Asia Conference on Computer and Communications Security. New York： ACM， 2021：702-715. 10.1145/3433210.3437533

方法	实现思路	优势	检测内容
SourcererCC^［13］	使用优化的反向索引和过滤的启发式算法实现	能够实现大规模克隆检测	Ⅰ型、Ⅱ型、Ⅲ型
CCLearner^［14］	利用token训练分类器，利用分类器检测	首次通过神经网络训练token进行相似性分析	Ⅰ型、Ⅱ型、Ⅲ型
VUDDY^［15］	利用函数级粒度和长度过滤技术减少函数签名比较数	有较高的可伸缩性和准确率	Ⅰ型、Ⅱ型
CCAligner^［16］	利用滑动窗口和模糊匹配	有良好的精度和召回率	Ⅰ型、Ⅱ型、Ⅲ型
NIL^［17］	利用最长公共子序列算法	在大规模检测时高精度	Ⅰ型、Ⅱ型、Ⅲ型

方法	实现思路	优势	检测内容
SourcererCC^［13］	使用优化的反向索引和过滤的启发式算法实现	能够实现大规模克隆检测	Ⅰ型、Ⅱ型、Ⅲ型
CCLearner^［14］	利用token训练分类器，利用分类器检测	首次通过神经网络训练token进行相似性分析	Ⅰ型、Ⅱ型、Ⅲ型
VUDDY^［15］	利用函数级粒度和长度过滤技术减少函数签名比较数	有较高的可伸缩性和准确率	Ⅰ型、Ⅱ型
CCAligner^［16］	利用滑动窗口和模糊匹配	有良好的精度和召回率	Ⅰ型、Ⅱ型、Ⅲ型
NIL^［17］	利用最长公共子序列算法	在大规模检测时高精度	Ⅰ型、Ⅱ型、Ⅲ型

方法	实现思路	优势	检测内容
DECKARD^［18］	将AST转化为向量并使用局部敏感哈希算法匹配	适用于任何有正式语法的编程语言	Ⅰ型、Ⅱ型、Ⅲ型
code2vec^［6］	转化为AST路径并使用神经网络训练代码表示	面对复杂情况时，代码表示依然有较好的泛化能力	Ⅰ型、Ⅱ型、Ⅲ型
code2seq^［19］	转化为AST路径并用LSTM编码	相较于code2vec有更好的代码表示	Ⅰ型、Ⅱ型、Ⅲ型
ASTNN^［20］	分割大型AST并使用双向RNN训练代码表示	可以对代码进行批处理且有较高的准确率	Ⅰ型、Ⅱ型、Ⅲ型、Ⅳ型
TreeCen^［21］	将AST转化为树图再转化为向量，采用SVM处理	有效保留结构信息且执行效率高	Ⅰ型、Ⅱ型、Ⅲ型、Ⅳ型

方法	实现思路	优势	检测内容
DECKARD^［18］	将AST转化为向量并使用局部敏感哈希算法匹配	适用于任何有正式语法的编程语言	Ⅰ型、Ⅱ型、Ⅲ型
code2vec^［6］	转化为AST路径并使用神经网络训练代码表示	面对复杂情况时，代码表示依然有较好的泛化能力	Ⅰ型、Ⅱ型、Ⅲ型
code2seq^［19］	转化为AST路径并用LSTM编码	相较于code2vec有更好的代码表示	Ⅰ型、Ⅱ型、Ⅲ型
ASTNN^［20］	分割大型AST并使用双向RNN训练代码表示	可以对代码进行批处理且有较高的准确率	Ⅰ型、Ⅱ型、Ⅲ型、Ⅳ型
TreeCen^［21］	将AST转化为树图再转化为向量，采用SVM处理	有效保留结构信息且执行效率高	Ⅰ型、Ⅱ型、Ⅲ型、Ⅳ型

方法	实现思路	优势	检测内容
DeepSim^［22］	处理控制流和数据流得到语义特征矩阵并学习	是一种端到端的方法，有较高的可扩展性	Ⅰ型、Ⅱ型、Ⅲ型、Ⅳ型
CCGraph^［23］	将代码转化为PDG并进行图匹配	可以高效检测高级别的克隆	Ⅰ型、Ⅱ型、Ⅲ型、Ⅳ型
SCDetector^［24］	结合了token和CFG，采用暹罗架构检测	比传统基于图方法检测的时间减少较多	Ⅰ型、Ⅱ型、Ⅲ型、Ⅳ型
MVP^［25］	使用程序切片提取漏洞的语法和语义特征并生成签名	可以有效检测重复漏洞	Ⅰ型、Ⅱ型、Ⅲ型
TRACER^［26］	通过污点分析获取脆弱路径生成签名	有较高的分析效率和可扩展性	检测相似漏洞

代码相似性检测技术综述

Survey of code similarity detection technology

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 11

参考文献 73

相关文章 15

编辑推荐

Metrics

方法	处理内容	支持语言
CLCDSA^［7］	API文档相似性和AST的特征值	Java、Python、C#
CroLSim^［30］	API描述信息	Java、Python、C#
CroLSim^［31］	API和方法描述	Java、Python、C#、C
CroLSSim^［32］	AST特征	Java、C#、C++
COSAL^［33］	AST树结构和IO信息	Java、Python
BIGPT^［34］	AST相关树结构和文本特征	Java、Python、C++
文献［35］方法	源代码和AST表示	Java、Python
UAST^［36］	AST的序列和AST的图结构	Java、Python、 C/C++、JavaScript

方法	提取特征	采用模型	跨编译器	跨优化	跨架构	抗混淆
Asm2Vec^［11］	汇编指令、CFG	PV-DM	×	√	×	√
SAFE^［61］	汇编指令	word2vec、RNN、Siamese Network	√	√	×	×
INNEREYE^［62］	汇编指令、CFG	word2vec、LSTM、Siamese Network	×	√	√	×
文献［63］方法	汇编指令	CBOW	×	√	√	×
MIRROR^［64］	汇编指令、基本块	Transformer	×	√	√	×
Order Matters^［51］	CFG、基本块	CNN、BERT、MPNN	√	√	√	×
DeepBinDiff^［57］	汇编指令、CFG、基本块	word2vec、TADW	×	√	×	×
TREX^［12］	汇编指令	word2vec、Transformer、LSTM	×	√	√	√
Codee^［65］	汇编指令、CFG	Skip-gram	√	√	√	√
BinDiff_NN^［66］	汇编指令	Attention、Siamese Network	×	×	×	×
QBinDiff^［54］	CFG、CG	图编辑距离	×	×	×	×
PlamTree^［58］	汇编指令	BERT	√	√	√	×
jTrans^［59］	汇编指令	BERT	√	√	×	×
XBA^［55］	二进制分解图	GCN	×	√	√	×
BINSHOT^［67］	汇编指令	BERT、Siamese Network	√	√	×	×

[1]	李顺勇, 李师毅, 胥瑞, 赵兴旺. 基于自注意力融合的不完整多视图聚类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2696-2703.
[2]	秦璟, 秦志光, 李发礼, 彭悦恒. 基于概率稀疏自注意力神经网络的重性抑郁疾患诊断[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2970-2974.
[3]	王熙源, 张战成, 徐少康, 张宝成, 罗晓清, 胡伏原. 面向手术导航3D/2D配准的无监督跨域迁移网络[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2911-2918.
[4]	潘烨新, 杨哲. 基于多级特征双向融合的小目标检测优化模型[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2871-2877.
[5]	黄云川, 江永全, 黄骏涛, 杨燕. 基于元图同构网络的分子毒性预测[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2964-2969.
[6]	刘禹含, 吉根林, 张红苹. 基于骨架图与混合注意力的视频行人异常检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2551-2557.
[7]	顾焰杰, 张英俊, 刘晓倩, 周围, 孙威. 基于时空多图融合的交通流量预测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2618-2625.
[8]	石乾宏, 杨燕, 江永全, 欧阳小草, 范武波, 陈强, 姜涛, 李媛. 面向空气质量预测的多粒度突变拟合网络[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2643-2650.
[9]	赵亦群, 张志禹, 董雪. 基于密集残差物理信息神经网络的各向异性旅行时计算方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2310-2318.
[10]	徐松, 张文博, 王一帆. 基于时空信息的轻量视频显著性目标检测网络[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2192-2199.
[11]	孙逊, 冯睿锋, 陈彦如. 基于深度与实例分割融合的单目3D目标检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2208-2215.
[12]	吴筝, 程志友, 汪真天, 汪传建, 王胜, 许辉. 基于深度学习的患者麻醉复苏过程中的头部运动幅度分类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2258-2263.
[13]	李欢欢, 黄添强, 丁雪梅, 罗海峰, 黄丽清. 基于多尺度时空图卷积网络的交通出行需求预测[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2065-2072.
[14]	张郅, 李欣, 叶乃夫, 胡凯茜. 基于暗知识保护的模型窃取防御技术DKP[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2080-2086.
[15]	赵雅娟, 孟繁军, 徐行健. 在线教育学习者知识追踪综述[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1683-1698.