《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (4): 1259-1268.DOI: 10.11772/j.issn.1001-9081.2023040485
• 计算机软件技术 • 上一篇
收稿日期:
2023-04-26
修回日期:
2023-07-04
接受日期:
2023-07-10
发布日期:
2023-12-04
出版日期:
2024-04-10
通讯作者:
谢春丽
作者简介:
万泽轩(1998—),男,江苏徐州人,硕士研究生,主要研究方向:代码表征、代码克隆基金资助:
Zexuan WAN, Chunli XIE(), Quanrun LYU, Yao LIANG
Received:
2023-04-26
Revised:
2023-07-04
Accepted:
2023-07-10
Online:
2023-12-04
Published:
2024-04-10
Contact:
Chunli XIE
About author:
WAN Zexuan, born in 1998, M. S. candidate. His research interests include code representation, code clone.Supported by:
摘要:
在软件工程领域,基于语义相似的代码克隆检测方法可以降低软件维护的成本并预防系统漏洞,抽象语法树(AST)作为典型的代码抽象表征形式,已成功应用于多种程序语言的代码克隆检测任务,然而现有工作主要利用原始AST提取代码的语义,没有深入挖掘AST中的深层语义和结构信息。针对上述问题,提出一种基于依赖增强的分层抽象语法树(DEHAST)的代码克隆检测方法。首先,对AST进行分层处理,将AST划分得到不同的语义层次;其次,为AST的不同层次添加相应的依赖增强边构建DEHAST,将简单的AST变成具有更丰富程序语义的异构图;最后,使用图匹配网络(GMN)模型检测异构图的相似性,实现代码克隆检测。在BigCloneBench和Google Code Jam两个数据集上的实验结果显示,DEHAST能够检测100%的Type-1和Type-2代码克隆、99%的Type-3代码克隆和97%的Type-4代码克隆;与基于树的方法ASTNN(AST-based Neural Network)相比,F1分数均提高了4个百分点,验证了DEHAST可以较好地完成代码语义克隆检测。
中图分类号:
万泽轩, 谢春丽, 吕泉润, 梁瑶. 基于依赖增强的分层抽象语法树的代码克隆检测[J]. 计算机应用, 2024, 44(4): 1259-1268.
Zexuan WAN, Chunli XIE, Quanrun LYU, Yao LIANG. Code clone detection based on dependency enhanced hierarchical abstract syntax tree[J]. Journal of Computer Applications, 2024, 44(4): 1259-1268.
克隆类型 | 所占百分比 |
---|---|
Type-1(T1) | 0.46 |
Type-2(T2) | 0.06 |
Strongly Type-3(ST3) | 0.24 |
Moderately Type-3(MT3) | 1.01 |
Weakly Type-3/Type-4(WT3/T4) | 98.23 |
表1 在BCB数据集中不同克隆类型的百分比 (%)
Tab. 1 Percentage points of different clone types in BCB dataset
克隆类型 | 所占百分比 |
---|---|
Type-1(T1) | 0.46 |
Type-2(T2) | 0.06 |
Strongly Type-3(ST3) | 0.24 |
Moderately Type-3(MT3) | 1.01 |
Weakly Type-3/Type-4(WT3/T4) | 98.23 |
语句 | BCB中平均出现次数 | GCJ中平均出现次数 |
---|---|---|
IfStatement | 2.72 | 3.11 |
WhileStatement | 0.44 | 0.44 |
ForStatement | 0.42 | 4.06 |
BlockStatement | 3.27 | 7.05 |
SwitchStatement | 0.01 | 0.01 |
表2 在BCB和GCJ数据集中不同语句的平均出现次数
Tab. 2 Average occurrences of statements in BCB and GCJ datasets
语句 | BCB中平均出现次数 | GCJ中平均出现次数 |
---|---|---|
IfStatement | 2.72 | 3.11 |
WhileStatement | 0.44 | 0.44 |
ForStatement | 0.42 | 4.06 |
BlockStatement | 3.27 | 7.05 |
SwitchStatement | 0.01 | 0.01 |
方法 | T1 | T2 | ST3 | MT3 | WT3/T4 |
---|---|---|---|---|---|
Deckard | 0.73 | 0.71 | 0.54 | 0.21 | 0.03 |
RtvNN | 1.00 | 0.97 | 0.60 | 0.03 | 0.01 |
CDLH | 1.00 | 0.99 | 0.97 | 0.94 | 0.82 |
ASTNN | 1.00 | 1.00 | 0.99 | 0.98 | 0.93 |
FA-AST | 1.00 | 1.00 | 0.99 | 0.99 | 0.95 |
Amain | 1.00 | 1.00 | 1.00 | 0.99 | 0.95 |
TreeCen | 1.00 | 0.99 | 1.00 | 1.00 | 0.95 |
DEHAST | 1.00 | 1.00 | 0.99 | 0.99 | 0.97 |
表3 在BCB数据集上每个克隆类型的F1分数
Tab. 3 F1 scores for each clone type on BCB dataset
方法 | T1 | T2 | ST3 | MT3 | WT3/T4 |
---|---|---|---|---|---|
Deckard | 0.73 | 0.71 | 0.54 | 0.21 | 0.03 |
RtvNN | 1.00 | 0.97 | 0.60 | 0.03 | 0.01 |
CDLH | 1.00 | 0.99 | 0.97 | 0.94 | 0.82 |
ASTNN | 1.00 | 1.00 | 0.99 | 0.98 | 0.93 |
FA-AST | 1.00 | 1.00 | 0.99 | 0.99 | 0.95 |
Amain | 1.00 | 1.00 | 1.00 | 0.99 | 0.95 |
TreeCen | 1.00 | 0.99 | 1.00 | 1.00 | 0.95 |
DEHAST | 1.00 | 1.00 | 0.99 | 0.99 | 0.97 |
方法 | Precision | Recall | F1分数 |
---|---|---|---|
Deckard | 0.93 | 0.02 | 0.03 |
RtvNN | 0.95 | 0.01 | 0.02 |
CDLH | 0.92 | 0.74 | 0.82 |
ASTNN | 0.92 | 0.94 | 0.93 |
FA-AST | 0.96 | 0.94 | 0.95 |
Amain | 0.95 | 0.96 | 0.95 |
TreeCen | 0.97 | 0.93 | 0.95 |
DEHAST | 0.98 | 0.96 | 0.97 |
表4 各方法在BCB数据集上的性能
Tab. 4 Performance of different mehods on BCB dataset
方法 | Precision | Recall | F1分数 |
---|---|---|---|
Deckard | 0.93 | 0.02 | 0.03 |
RtvNN | 0.95 | 0.01 | 0.02 |
CDLH | 0.92 | 0.74 | 0.82 |
ASTNN | 0.92 | 0.94 | 0.93 |
FA-AST | 0.96 | 0.94 | 0.95 |
Amain | 0.95 | 0.96 | 0.95 |
TreeCen | 0.97 | 0.93 | 0.95 |
DEHAST | 0.98 | 0.96 | 0.97 |
方法 | Precision | Recall | F1分数 |
---|---|---|---|
Deckard | 0.45 | 0.44 | 0.44 |
RtvNN | 0.20 | 0.90 | 0.33 |
CDLH | 0.82 | 0.66 | 0.73 |
ASTNN | 0.87 | 0.95 | 0.91 |
Amain | 0.93 | 0.91 | 0.92 |
TreeCen | 0.93 | 0.94 | 0.93 |
DEHAST | 0.93 | 0.97 | 0.95 |
表5 各方法在GCJ数据集上的性能
Tab. 5 Performance of different methods on GCJ dataset
方法 | Precision | Recall | F1分数 |
---|---|---|---|
Deckard | 0.45 | 0.44 | 0.44 |
RtvNN | 0.20 | 0.90 | 0.33 |
CDLH | 0.82 | 0.66 | 0.73 |
ASTNN | 0.87 | 0.95 | 0.91 |
Amain | 0.93 | 0.91 | 0.92 |
TreeCen | 0.93 | 0.94 | 0.93 |
DEHAST | 0.93 | 0.97 | 0.95 |
序号 | 消融描述 | Precision | Recall | F1分数 |
---|---|---|---|---|
1 | 原始AST及其依赖 | 0.88 | 0.91 | 0.89 |
2 | 去掉语句层相关依赖的DEHAST | 0.91 | 0.94 | 0.92 |
3 | 去掉子树层相关依赖的DEHAST | 0.89 | 0.86 | 0.87 |
4 | 标准模型DEHAST | 0.98 | 0.96 | 0.97 |
表6 消融实验结果
Tab. 6 Results of ablation experiments
序号 | 消融描述 | Precision | Recall | F1分数 |
---|---|---|---|---|
1 | 原始AST及其依赖 | 0.88 | 0.91 | 0.89 |
2 | 去掉语句层相关依赖的DEHAST | 0.91 | 0.94 | 0.92 |
3 | 去掉子树层相关依赖的DEHAST | 0.89 | 0.86 | 0.87 |
4 | 标准模型DEHAST | 0.98 | 0.96 | 0.97 |
1 | ROY C K, CORDY J R. A survey on software clone detection research: Technical Report No. 2007-541[R]. Ontario, Canada: Queen’s University at Kingston, School of Computing, 2007: 64-68. |
2 | KAMIYA T, KUSUMOTO S, INOUE K. CCFinder: a multilinguistic token-based code clone detection system for large scale source code[J]. IEEE Transactions on Software Engineering, 2002, 28(7):654-670. 10.1109/tse.2002.1019480 |
3 | LIU C, CHEN C, HAN J, et al. GPLAG: detection of software plagiarism by program dependence graph analysis[C]// Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2006:872-881. 10.1145/1150402.1150522 |
4 | WANG P, SVAJLENKO J, WU Y, et al. CCAligner: a token based large-gap clone detector[C]// Proceedings of the 40th International Conference on Software Engineering. New York: ACM, 2018:1066-1077. 10.1145/3180155.3180179 |
5 | GABEL M, JIANG L, SU Z. Scalable detection of semantic clones[C]// Proceedings of the 30th International Conference on Software Engineering. New York: ACM, 2008: 321-330. 10.1145/1368088.1368132 |
6 | ZHANG J, WANG X, ZHANG H, et al. A novel neural source code representation based on abstract syntax tree[C]// Proceedings of the 41st International Conference on Software Engineering. Piscataway: IEEE, 2019: 783-794. 10.1109/icse.2019.00086 |
7 | WHITE M, TUFANO M, VENDOME C, et al. Deep learning code fragments for code clone detection[C]// Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. New York: ACM, 2016: 87-98. 10.1145/2970276.2970326 |
8 | WANG M, WANG P, XU Y. CCSharp: an efficient three-phase code clone detector using modified PDGs[C]// Proceedings of the 2017 24th Asia-Pacific Software Engineering Conference. Piscataway: IEEE, 2017:100-109. 10.1109/apsec.2017.16 |
9 | WEI H-H, LI M. Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code[C]// Proceedings of the 26th International Joint Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2017: 3034-3040. 10.24963/ijcai.2017/423 |
10 | TAI K S, SOCHER R, MANNING C D. Improved semantic representations from tree-structured long short-term memory networks[C]// Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Stroudsberg: ACL, 2015:1556-1566. 10.3115/v1/p15-1150 |
11 | WANG W, LI G, MA B, et al. Detecting code clones with graph neural network and flow-augmented abstract syntax tree[C]// Proceedings of the 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering. Piscataway: IEEE, 2020: 261-271. 10.1109/saner48275.2020.9054857 |
12 | ZHAO G, HUANG J. DeepSim: deep learning code functional similarity[C]// Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. New York: ACM, 2018: 141-151. 10.1145/3236024.3236068 |
13 | LI Y, GU C, DULLIEN T, et al. Graph matching networks for learning the similarity of graph structured objects[C]// Proceedings of the 36th International Conference on Machine Learning. New York: JMLR.org, 2019: 3835-3845. |
14 | 乐乔艺,刘建勋,孙晓平,等.代码克隆检测研究进展综述[J].计算机科学,2021,48(11A):509-522. 10.11896/jsjkx.210300310 |
LE Q Y, LIU J X, SUN X P, et al. Survey of research progress of code cloning detection[J]. Computer Science,2021,48(11A):509-522. 10.11896/jsjkx.210300310 | |
15 | DUCASSE S, RIEGER M, DEMEYER S. A language independent approach for detecting duplicated code[C]// Proceedings of the 15th IEEE International Conference on Software Maintenance. Piscataway: IEEE, 1999:109-118. 10.1109/icsm.1999.792593 |
16 | RAGKHITWETSAGUL C, KRINKE J. Using compilation/decompilation to enhance clone detection[C]// Proceedings of the 2017 IEEE 11th International Workshop on Software Clones. Piscataway: IEEE, 2017:1-7. 10.1109/iwsc.2017.7880502 |
17 | ROY C K, CORDY J R. NICAD: accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization[C]// Proceedings of the 2008 IEEE 16th IEEE International Conference on Program Comprehension. Piscataway: IEEE, 2008: 172-181. 10.1109/icpc.2008.41 |
18 | NISHI M A, DAMEVSKI K. Scalable code clone detection and search based on adaptive prefix filtering[J]. Journal of Systems and Software, 2018, 137:130-142. 10.1016/j.jss.2017.11.039 |
19 | LI L, FENG H, ZHUANG W, et al. CCLearner: a deep learning-based clone detection approach[C]// Proceedings of the 2017 IEEE International Conference on Software Maintenance and Evolution. Piscataway: IEEE, 2017: 249-260. 10.1109/icsme.2017.46 |
20 | SAJNANI H, SAINI V, SVAJLENKO J, et al. SourcererCC: scaling code clone detection to big-code[C]// Proceedings of the 2016 IEEE/ACM 38th International Conference on Software Engineering. Washington, DC: IEEE Computer Society, 2016:1157-1168. 10.1145/2884781.2884877 |
21 | SVAJLENKO J, ROY C K. Fast and flexible large-scale clone detection with CloneWorks[C]// Proceedings of the 39th International Conference on Software Engineering Companion. Piscataway: IEEE, 2017:27-30. 10.1109/icse-c.2017.3 |
22 | CHODAREV S, PIETRIKOVÁ E, KOLLÁR J. Haskell clone detection using pattern comparing algorithm[C]// Proceedings of the 2015 13th International Conference on Engineering of Modern Electric Systems. Piscataway: IEEE, 2015:1-4. 10.1109/emes.2015.7158423 |
23 | YUAN D, FANG S, ZHANG T, et al. Java code clone detection by exploiting semantic and syntax information from intermediate code-based graph[J]. IEEE Transactions on Reliability, 2022, 72(2): 511-526. 10.1109/tr.2022.3176922 |
24 | HU Y, FANG Y, SUN Y, et al. Code2Img: tree-based image transformation for scalable code clone detection[J]. IEEE Transactions on Software Engineering, 2023, 49(9): 4429-4442. 10.1109/tse.2023.3295801 |
25 | MOU L, LI G, ZHANG L, et al. Convolutional neural networks over tree structures for programming language processing[C]// Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2016:1287-1293. 10.1609/aaai.v30i1.10139 |
26 | SCARSELLI F, GORI M, TSOI A C, et al. The graph neural network model[J]. IEEE Transactions on Neural Networks,2009,20(1):61-80. 10.1109/tnn.2008.2005605 |
27 | GILMER J, SCHOENHOLZ S S, RILEY P F, et al. Neural message passing for quantum chemistry[C]// Proceedings of the 34th International Conference on Machine Learning. New York: JMLR.org, 2017:1263-1272. |
28 | ALLAMANIS M, BROCKSCHMIDT M, KHADEMI M. Learning to represent programs with graphs[EB/OL]. [2022-07-30]. . |
29 | NGUYEN M, BUI N D Q. Learning to represent programs with code hierarchies[EB/OL]. (2022-05-31)[2022-07-30]. . 10.48550/arXiv.2205.15479 |
30 | SVAJLENKO J, ISLAM J F, KEIVANLOO I, et al. Towards a big data curated benchmark of inter-project code clones[C]// Proceedings of the 2014 IEEE International Conference on Software Maintenance and Evolution. Piscataway: IEEE, 2014:476-480. 10.1109/icsme.2014.77 |
31 | Google Code Jam[EB/OL]. [2022-05-26]. . 10.4337/9781781006481.00014 |
32 | FEY M, LENSSEN J E. Fast graph representation learning with PyTorch Geometric[EB/OL]. [2022-08-02]. . 10.1109/cvpr.2018.00097 |
33 | KINGMA D, BA J. Adam: a method for stochastic optimization[EB/OL]. [2023-04-01]. . |
34 | JIANG L, MISHERGHI G, SU Z, et al. DECKARD: scalable and accurate tree-based detection of code clones[C]// Proceedings of the 29th International Conference on Software Engineering. Piscataway: IEEE, 2007: 96-105. 10.1109/icse.2007.30 |
35 | WU Y, FENG S, ZOU D, et al. Detecting semantic code clones by building AST-based Markov chains model[C]// Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. New York: ACM, 2022: No. 34. 10.1145/3551349.3560426 |
36 | HU Y, ZOU D, PENG J, et al. TreeCen: building tree graph for scalable semantic code clone detection[C]// Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. New York: ACM, 2022: No. 109. 10.1145/3551349.3556927 |
[1] | 张鹏飞, 韩李涛, 冯恒健, 李洪梅. 基于注意力机制和全局特征优化的点云语义分割[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1086-1092. |
[2] | 唐睿, 岳士博, 张睿智, 刘川, 庞川林. UAV协助下非正交多址接入使能的数据采集系统中能效优化机制[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1209-1218. |
[3] | 孙祥杰, 魏强, 王奕森, 杜江. 代码相似性检测技术综述[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1248-1258. |
[4] | 杨先凤, 汤依磊, 李自强. 基于交替注意力机制和图卷积网络的方面级情感分析模型[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1058-1064. |
[5] | 李雨秋, 侯利萍, 薛健, 吕科, 王泳. 基于内容解译的遥感图像推荐方法[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 722-731. |
[6] | 董炜娜, 刘佳, 潘晓中, 陈立峰, 孙文权. 基于编码-解码网络的大容量鲁棒图像隐写方案[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 772-779. |
[7] | 赵奎, 仇慧琪, 李旭, 徐知非. 结合注意力和多路径融合的实时肺结节检测算法[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 945-952. |
[8] | 蔡美玉, 朱润哲, 吴飞, 张开昱, 李家乐. 基于注意力机制和多粒度特征融合的跨视角匹配模型[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 901-908. |
[9] | 唐瑶瑶, 朱叶晨, 刘仰川, 高欣. CT图像环形伪影去除方法研究现状及展望[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 890-900. |
[10] | 徐大鹏, 侯新民. 基于网络结构设计的图神经网络特征选择方法[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 663-670. |
[11] | 张家伟, 高冠东, 肖珂, 宋胜尊. 基于改进分层注意网络和TextCNN联合建模的暴力犯罪分级算法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 403-410. |
[12] | 宋钰丹, 王晶, 王雪徽, 马朝阳, 林友芳. 基于自适应多任务学习的睡眠生理时序分类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 654-662. |
[13] | 刘祥, 华蓓, 林飞, 魏宏原. 面向深度学习应用的组件式开发框架的设计实现[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 526-535. |
[14] | 荆智文, 张屿佳, 孙伯廷, 郭浩. 二阶段孪生图卷积神经网络推荐算法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 469-476. |
[15] | 邱华禄, 蔺素珍, 王彦博, 刘峰, 李大威. 基于复卷积双域级联网络的欠采样磁共振图像重建算法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 580-587. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||