基于依赖增强的分层抽象语法树的代码克隆检测

doi:10.11772/j.issn.1001-9081.2023040485

《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (4): 1259-1268.DOI: 10.11772/j.issn.1001-9081.2023040485

基于依赖增强的分层抽象语法树的代码克隆检测

万泽轩, 谢春丽(), 吕泉润, 梁瑶

江苏师范大学计算机科学与技术学院，江苏徐州 221116

收稿日期:2023-04-26 修回日期:2023-07-04 接受日期:2023-07-10 发布日期:2023-12-04 出版日期:2024-04-10
通讯作者: 谢春丽
作者简介:万泽轩（1998—），男，江苏徐州人，硕士研究生，主要研究方向：代码表征、代码克隆
谢春丽（1979—），女，江苏徐州人，副教授，博士，CCF会员，主要研究方向：智能软件、代码克隆、代码表示 6020030132@jsnu.edu.cn
吕泉润（1999—），男，江苏徐州人，硕士研究生，主要研究方向：代码表征、代码克隆
梁瑶（1997—），女，江苏徐州人，硕士研究生，CCF会员，主要研究方向：代码分析。
基金资助:
国家自然科学基金资助项目(62276119);江苏师范大学研究生科研与实践创新计划项目(2022XKT1530)

Code clone detection based on dependency enhanced hierarchical abstract syntax tree

Zexuan WAN, Chunli XIE(), Quanrun LYU, Yao LIANG

School of Computer Science and Technology，Jiangsu Normal University，Xuzhou Jiangsu 221116，China

Received:2023-04-26 Revised:2023-07-04 Accepted:2023-07-10 Online:2023-12-04 Published:2024-04-10
Contact: Chunli XIE
About author:WAN Zexuan， born in 1998， M. S. candidate. His research interests include code representation， code clone.
XIE Chunli， born in 1979， Ph. D.， associate professor. Her research interests include intelligent software， code clone， code representation.
LYU Quanrun， born in 1999， M. S. candidate. His research interests include code representation， code clone.
LIANG Yao， born in 1997， M. S. candidate. Her research interests include code analysis.
Supported by:
National Natural Science Foundation of China(62276119);Graduate Research and Practice Innovation Program of Jiangsu Normal University(2022XKT1530)

摘要/Abstract

摘要：

在软件工程领域，基于语义相似的代码克隆检测方法可以降低软件维护的成本并预防系统漏洞，抽象语法树（AST）作为典型的代码抽象表征形式，已成功应用于多种程序语言的代码克隆检测任务，然而现有工作主要利用原始AST提取代码的语义，没有深入挖掘AST中的深层语义和结构信息。针对上述问题，提出一种基于依赖增强的分层抽象语法树（DEHAST）的代码克隆检测方法。首先，对AST进行分层处理，将AST划分得到不同的语义层次；其次，为AST的不同层次添加相应的依赖增强边构建DEHAST，将简单的AST变成具有更丰富程序语义的异构图；最后，使用图匹配网络（GMN）模型检测异构图的相似性，实现代码克隆检测。在BigCloneBench和Google Code Jam两个数据集上的实验结果显示，DEHAST能够检测100%的Type-1和Type-2代码克隆、99%的Type-3代码克隆和97%的Type-4代码克隆；与基于树的方法ASTNN（AST-based Neural Network）相比，F1分数均提高了4个百分点，验证了DEHAST可以较好地完成代码语义克隆检测。

关键词: 代码克隆检测, 语义克隆, 抽象语法树, 深度学习, 图匹配网络

Abstract:

In the field of software engineering， code clone detection methods based on semantic similarity can reduce the cost of software maintenance and prevent system vulnerabilities. As a typical form of code abstract representation， Abstract Syntax Tree （AST） has achieved success in code clone detection tasks of many program languages. However， the existing work mainly uses the original AST to extract code semantics， and does not dig deep semantic and structural information in AST. To solve the above problem， a code clone detection method based on Dependency Enhanced Hierarchical Abstract Syntax Tree （DEHAST） was proposed. Firstly， the AST was layered and divided into different semantic levels. Secondly， corresponding dependency enhancement edges were added to different levels of AST to construct DEHAST， thus a simple AST was transformed into a heterogeneous graph with richer program semantics. Finally， a Graph Matching Network （GMN） model was used to detect the similarity of heterogeneous graphs to achieve code clone detection. Experimental results on two datasets BigCloneBench and Google Code Jam show that DEHAST is able to detect 100% of Type-1 and Type-2 code clones， 99% of Type-3 code clones， and 97% of Type-4 code clones； compared with the tree based method ASTNN （AST-based Neural Network）， the F1 values all increase by 4 percentage points. Therefore， DEHAST can effectively perform code semantic clone detection.

Key words: code clone detection, semantic clone, Abstract Syntax Tree (AST), deep learning, Graph Matching Network (GMN)

中图分类号:

TP311.5

万泽轩, 谢春丽, 吕泉润, 梁瑶. 基于依赖增强的分层抽象语法树的代码克隆检测[J]. 计算机应用, 2024, 44(4): 1259-1268.

Zexuan WAN, Chunli XIE, Quanrun LYU, Yao LIANG. Code clone detection based on dependency enhanced hierarchical abstract syntax tree[J]. Journal of Computer Applications, 2024, 44(4): 1259-1268.

图/表 11

参考文献 36

1	ROY C K， CORDY J R. A survey on software clone detection research： Technical Report No. 2007-541［R］. Ontario， Canada： Queen’s University at Kingston， School of Computing， 2007： 64-68.
2	KAMIYA T， KUSUMOTO S， INOUE K. CCFinder： a multilinguistic token-based code clone detection system for large scale source code［J］. IEEE Transactions on Software Engineering， 2002， 28（7）：654-670. 10.1109/tse.2002.1019480
3	LIU C， CHEN C， HAN J， et al. GPLAG： detection of software plagiarism by program dependence graph analysis［C］// Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York： ACM， 2006：872-881. 10.1145/1150402.1150522
4	WANG P， SVAJLENKO J， WU Y， et al. CCAligner： a token based large-gap clone detector［C］// Proceedings of the 40th International Conference on Software Engineering. New York： ACM， 2018：1066-1077. 10.1145/3180155.3180179
5	GABEL M， JIANG L， SU Z. Scalable detection of semantic clones［C］// Proceedings of the 30th International Conference on Software Engineering. New York： ACM， 2008： 321-330. 10.1145/1368088.1368132
6	ZHANG J， WANG X， ZHANG H， et al. A novel neural source code representation based on abstract syntax tree［C］// Proceedings of the 41st International Conference on Software Engineering. Piscataway： IEEE， 2019： 783-794. 10.1109/icse.2019.00086
7	WHITE M， TUFANO M， VENDOME C， et al. Deep learning code fragments for code clone detection［C］// Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. New York： ACM， 2016： 87-98. 10.1145/2970276.2970326
8	WANG M， WANG P， XU Y. CCSharp： an efficient three-phase code clone detector using modified PDGs［C］// Proceedings of the 2017 24th Asia-Pacific Software Engineering Conference. Piscataway： IEEE， 2017：100-109. 10.1109/apsec.2017.16
9	WEI H-H， LI M. Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code［C］// Proceedings of the 26th International Joint Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2017： 3034-3040. 10.24963/ijcai.2017/423
10	TAI K S， SOCHER R， MANNING C D. Improved semantic representations from tree-structured long short-term memory networks［C］// Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Stroudsberg： ACL， 2015：1556-1566. 10.3115/v1/p15-1150
11	WANG W， LI G， MA B， et al. Detecting code clones with graph neural network and flow-augmented abstract syntax tree［C］// Proceedings of the 2020 IEEE 27th International Conference on Software Analysis， Evolution and Reengineering. Piscataway： IEEE， 2020： 261-271. 10.1109/saner48275.2020.9054857
12	ZHAO G， HUANG J. DeepSim： deep learning code functional similarity［C］// Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. New York： ACM， 2018： 141-151. 10.1145/3236024.3236068
13	LI Y， GU C， DULLIEN T， et al. Graph matching networks for learning the similarity of graph structured objects［C］// Proceedings of the 36th International Conference on Machine Learning. New York： JMLR.org， 2019： 3835-3845.
14	乐乔艺，刘建勋，孙晓平，等.代码克隆检测研究进展综述［J］.计算机科学，2021，48（11A）：509-522. 10.11896/jsjkx.210300310
	LE Q Y， LIU J X， SUN X P， et al. Survey of research progress of code cloning detection［J］. Computer Science，2021，48（11A）：509-522. 10.11896/jsjkx.210300310
15	DUCASSE S， RIEGER M， DEMEYER S. A language independent approach for detecting duplicated code［C］// Proceedings of the 15th IEEE International Conference on Software Maintenance. Piscataway： IEEE， 1999：109-118. 10.1109/icsm.1999.792593
16	RAGKHITWETSAGUL C， KRINKE J. Using compilation/decompilation to enhance clone detection［C］// Proceedings of the 2017 IEEE 11th International Workshop on Software Clones. Piscataway： IEEE， 2017：1-7. 10.1109/iwsc.2017.7880502
17	ROY C K， CORDY J R. NICAD： accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization［C］// Proceedings of the 2008 IEEE 16th IEEE International Conference on Program Comprehension. Piscataway： IEEE， 2008： 172-181. 10.1109/icpc.2008.41
18	NISHI M A， DAMEVSKI K. Scalable code clone detection and search based on adaptive prefix filtering［J］. Journal of Systems and Software， 2018， 137：130-142. 10.1016/j.jss.2017.11.039
19	LI L， FENG H， ZHUANG W， et al. CCLearner： a deep learning-based clone detection approach［C］// Proceedings of the 2017 IEEE International Conference on Software Maintenance and Evolution. Piscataway： IEEE， 2017： 249-260. 10.1109/icsme.2017.46
20	SAJNANI H， SAINI V， SVAJLENKO J， et al. SourcererCC： scaling code clone detection to big-code［C］// Proceedings of the 2016 IEEE/ACM 38th International Conference on Software Engineering. Washington， DC： IEEE Computer Society， 2016：1157-1168. 10.1145/2884781.2884877
21	SVAJLENKO J， ROY C K. Fast and flexible large-scale clone detection with CloneWorks［C］// Proceedings of the 39th International Conference on Software Engineering Companion. Piscataway： IEEE， 2017：27-30. 10.1109/icse-c.2017.3
22	CHODAREV S， PIETRIKOVÁ E， KOLLÁR J. Haskell clone detection using pattern comparing algorithm［C］// Proceedings of the 2015 13th International Conference on Engineering of Modern Electric Systems. Piscataway： IEEE， 2015：1-4. 10.1109/emes.2015.7158423
23	YUAN D， FANG S， ZHANG T， et al. Java code clone detection by exploiting semantic and syntax information from intermediate code-based graph［J］. IEEE Transactions on Reliability， 2022， 72（2）： 511-526. 10.1109/tr.2022.3176922
24	HU Y， FANG Y， SUN Y， et al. Code2Img： tree-based image transformation for scalable code clone detection［J］. IEEE Transactions on Software Engineering， 2023， 49（9）： 4429-4442. 10.1109/tse.2023.3295801
25	MOU L， LI G， ZHANG L， et al. Convolutional neural networks over tree structures for programming language processing［C］// Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2016：1287-1293. 10.1609/aaai.v30i1.10139
26	SCARSELLI F， GORI M， TSOI A C， et al. The graph neural network model［J］. IEEE Transactions on Neural Networks，2009，20（1）：61-80. 10.1109/tnn.2008.2005605
27	GILMER J， SCHOENHOLZ S S， RILEY P F， et al. Neural message passing for quantum chemistry［C］// Proceedings of the 34th International Conference on Machine Learning. New York： JMLR.org， 2017：1263-1272.
28	ALLAMANIS M， BROCKSCHMIDT M， KHADEMI M. Learning to represent programs with graphs［EB/OL］. ［2022-07-30］. .
29	NGUYEN M， BUI N D Q. Learning to represent programs with code hierarchies［EB/OL］. （2022-05-31）［2022-07-30］. . 10.48550/arXiv.2205.15479
30	SVAJLENKO J， ISLAM J F， KEIVANLOO I， et al. Towards a big data curated benchmark of inter-project code clones［C］// Proceedings of the 2014 IEEE International Conference on Software Maintenance and Evolution. Piscataway： IEEE， 2014：476-480. 10.1109/icsme.2014.77
31	Google Code Jam［EB/OL］. ［2022-05-26］. . 10.4337/9781781006481.00014
32	FEY M， LENSSEN J E. Fast graph representation learning with PyTorch Geometric［EB/OL］. ［2022-08-02］. . 10.1109/cvpr.2018.00097
33	KINGMA D， BA J. Adam： a method for stochastic optimization［EB/OL］. ［2023-04-01］. .
34	JIANG L， MISHERGHI G， SU Z， et al. DECKARD： scalable and accurate tree-based detection of code clones［C］// Proceedings of the 29th International Conference on Software Engineering. Piscataway： IEEE， 2007： 96-105. 10.1109/icse.2007.30
35	WU Y， FENG S， ZOU D， et al. Detecting semantic code clones by building AST-based Markov chains model［C］// Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. New York： ACM， 2022： No. 34. 10.1145/3551349.3560426
36	HU Y， ZOU D， PENG J， et al. TreeCen： building tree graph for scalable semantic code clone detection［C］// Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. New York： ACM， 2022： No. 109. 10.1145/3551349.3556927

克隆类型	所占百分比
Type-1（T1）	0.46
Type-2（T2）	0.06
Strongly Type-3（ST3）	0.24
Moderately Type-3（MT3）	1.01
Weakly Type-3/Type-4（WT3/T4）	98.23

克隆类型	所占百分比
Type-1（T1）	0.46
Type-2（T2）	0.06
Strongly Type-3（ST3）	0.24
Moderately Type-3（MT3）	1.01
Weakly Type-3/Type-4（WT3/T4）	98.23

语句	BCB中平均出现次数	GCJ中平均出现次数
IfStatement	2.72	3.11
WhileStatement	0.44	0.44
ForStatement	0.42	4.06
BlockStatement	3.27	7.05
SwitchStatement	0.01	0.01

语句	BCB中平均出现次数	GCJ中平均出现次数
IfStatement	2.72	3.11
WhileStatement	0.44	0.44
ForStatement	0.42	4.06
BlockStatement	3.27	7.05
SwitchStatement	0.01	0.01

方法	T1	T2	ST3	MT3	WT3/T4
Deckard	0.73	0.71	0.54	0.21	0.03
RtvNN	1.00	0.97	0.60	0.03	0.01
CDLH	1.00	0.99	0.97	0.94	0.82
ASTNN	1.00	1.00	0.99	0.98	0.93
FA-AST	1.00	1.00	0.99	0.99	0.95
Amain	1.00	1.00	1.00	0.99	0.95
TreeCen	1.00	0.99	1.00	1.00	0.95
DEHAST	1.00	1.00	0.99	0.99	0.97

基于依赖增强的分层抽象语法树的代码克隆检测

Code clone detection based on dependency enhanced hierarchical abstract syntax tree

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 11

参考文献 36

相关文章 15

编辑推荐

Metrics

方法	Precision	Recall	F1分数
Deckard	0.93	0.02	0.03
RtvNN	0.95	0.01	0.02
CDLH	0.92	0.74	0.82
ASTNN	0.92	0.94	0.93
FA-AST	0.96	0.94	0.95
Amain	0.95	0.96	0.95
TreeCen	0.97	0.93	0.95
DEHAST	0.98	0.96	0.97

方法	Precision	Recall	F1分数
Deckard	0.45	0.44	0.44
RtvNN	0.20	0.90	0.33
CDLH	0.82	0.66	0.73
ASTNN	0.87	0.95	0.91
Amain	0.93	0.91	0.92
TreeCen	0.93	0.94	0.93
DEHAST	0.93	0.97	0.95

序号	消融描述	Precision	Recall	F1分数
1	原始AST及其依赖	0.88	0.91	0.89
2	去掉语句层相关依赖的DEHAST	0.91	0.94	0.92
3	去掉子树层相关依赖的DEHAST	0.89	0.86	0.87
4	标准模型DEHAST	0.98	0.96	0.97

[1]	李顺勇, 李师毅, 胥瑞, 赵兴旺. 基于自注意力融合的不完整多视图聚类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2696-2703.
[2]	黄云川, 江永全, 黄骏涛, 杨燕. 基于元图同构网络的分子毒性预测[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2964-2969.
[3]	秦璟, 秦志光, 李发礼, 彭悦恒. 基于概率稀疏自注意力神经网络的重性抑郁疾患诊断[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2970-2974.
[4]	王熙源, 张战成, 徐少康, 张宝成, 罗晓清, 胡伏原. 面向手术导航3D/2D配准的无监督跨域迁移网络[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2911-2918.
[5]	潘烨新, 杨哲. 基于多级特征双向融合的小目标检测优化模型[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2871-2877.
[6]	刘禹含, 吉根林, 张红苹. 基于骨架图与混合注意力的视频行人异常检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2551-2557.
[7]	顾焰杰, 张英俊, 刘晓倩, 周围, 孙威. 基于时空多图融合的交通流量预测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2618-2625.
[8]	石乾宏, 杨燕, 江永全, 欧阳小草, 范武波, 陈强, 姜涛, 李媛. 面向空气质量预测的多粒度突变拟合网络[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2643-2650.
[9]	赵亦群, 张志禹, 董雪. 基于密集残差物理信息神经网络的各向异性旅行时计算方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2310-2318.
[10]	徐松, 张文博, 王一帆. 基于时空信息的轻量视频显著性目标检测网络[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2192-2199.
[11]	孙逊, 冯睿锋, 陈彦如. 基于深度与实例分割融合的单目3D目标检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2208-2215.
[12]	吴筝, 程志友, 汪真天, 汪传建, 王胜, 许辉. 基于深度学习的患者麻醉复苏过程中的头部运动幅度分类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2258-2263.
[13]	李欢欢, 黄添强, 丁雪梅, 罗海峰, 黄丽清. 基于多尺度时空图卷积网络的交通出行需求预测[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2065-2072.
[14]	张郅, 李欣, 叶乃夫, 胡凯茜. 基于暗知识保护的模型窃取防御技术DKP[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2080-2086.
[15]	赵雅娟, 孟繁军, 徐行健. 在线教育学习者知识追踪综述[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1683-1698.