《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (4): 1259-1268.DOI: 10.11772/j.issn.1001-9081.2023040485
收稿日期:
2023-04-26
修回日期:
2023-07-04
接受日期:
2023-07-10
发布日期:
2023-12-04
出版日期:
2024-04-10
通讯作者:
谢春丽
作者简介:
万泽轩(1998—),男,江苏徐州人,硕士研究生,主要研究方向:代码表征、代码克隆基金资助:
Zexuan WAN, Chunli XIE(), Quanrun LYU, Yao LIANG
Received:
2023-04-26
Revised:
2023-07-04
Accepted:
2023-07-10
Online:
2023-12-04
Published:
2024-04-10
Contact:
Chunli XIE
About author:
WAN Zexuan, born in 1998, M. S. candidate. His research interests include code representation, code clone.Supported by:
摘要:
在软件工程领域,基于语义相似的代码克隆检测方法可以降低软件维护的成本并预防系统漏洞,抽象语法树(AST)作为典型的代码抽象表征形式,已成功应用于多种程序语言的代码克隆检测任务,然而现有工作主要利用原始AST提取代码的语义,没有深入挖掘AST中的深层语义和结构信息。针对上述问题,提出一种基于依赖增强的分层抽象语法树(DEHAST)的代码克隆检测方法。首先,对AST进行分层处理,将AST划分得到不同的语义层次;其次,为AST的不同层次添加相应的依赖增强边构建DEHAST,将简单的AST变成具有更丰富程序语义的异构图;最后,使用图匹配网络(GMN)模型检测异构图的相似性,实现代码克隆检测。在BigCloneBench和Google Code Jam两个数据集上的实验结果显示,DEHAST能够检测100%的Type-1和Type-2代码克隆、99%的Type-3代码克隆和97%的Type-4代码克隆;与基于树的方法ASTNN(AST-based Neural Network)相比,F1分数均提高了4个百分点,验证了DEHAST可以较好地完成代码语义克隆检测。
中图分类号:
万泽轩, 谢春丽, 吕泉润, 梁瑶. 基于依赖增强的分层抽象语法树的代码克隆检测[J]. 计算机应用, 2024, 44(4): 1259-1268.
Zexuan WAN, Chunli XIE, Quanrun LYU, Yao LIANG. Code clone detection based on dependency enhanced hierarchical abstract syntax tree[J]. Journal of Computer Applications, 2024, 44(4): 1259-1268.
克隆类型 | 所占百分比 |
---|---|
Type-1(T1) | 0.46 |
Type-2(T2) | 0.06 |
Strongly Type-3(ST3) | 0.24 |
Moderately Type-3(MT3) | 1.01 |
Weakly Type-3/Type-4(WT3/T4) | 98.23 |
表1 在BCB数据集中不同克隆类型的百分比 (%)
Tab. 1 Percentage points of different clone types in BCB dataset
克隆类型 | 所占百分比 |
---|---|
Type-1(T1) | 0.46 |
Type-2(T2) | 0.06 |
Strongly Type-3(ST3) | 0.24 |
Moderately Type-3(MT3) | 1.01 |
Weakly Type-3/Type-4(WT3/T4) | 98.23 |
语句 | BCB中平均出现次数 | GCJ中平均出现次数 |
---|---|---|
IfStatement | 2.72 | 3.11 |
WhileStatement | 0.44 | 0.44 |
ForStatement | 0.42 | 4.06 |
BlockStatement | 3.27 | 7.05 |
SwitchStatement | 0.01 | 0.01 |
表2 在BCB和GCJ数据集中不同语句的平均出现次数
Tab. 2 Average occurrences of statements in BCB and GCJ datasets
语句 | BCB中平均出现次数 | GCJ中平均出现次数 |
---|---|---|
IfStatement | 2.72 | 3.11 |
WhileStatement | 0.44 | 0.44 |
ForStatement | 0.42 | 4.06 |
BlockStatement | 3.27 | 7.05 |
SwitchStatement | 0.01 | 0.01 |
方法 | T1 | T2 | ST3 | MT3 | WT3/T4 |
---|---|---|---|---|---|
Deckard | 0.73 | 0.71 | 0.54 | 0.21 | 0.03 |
RtvNN | 1.00 | 0.97 | 0.60 | 0.03 | 0.01 |
CDLH | 1.00 | 0.99 | 0.97 | 0.94 | 0.82 |
ASTNN | 1.00 | 1.00 | 0.99 | 0.98 | 0.93 |
FA-AST | 1.00 | 1.00 | 0.99 | 0.99 | 0.95 |
Amain | 1.00 | 1.00 | 1.00 | 0.99 | 0.95 |
TreeCen | 1.00 | 0.99 | 1.00 | 1.00 | 0.95 |
DEHAST | 1.00 | 1.00 | 0.99 | 0.99 | 0.97 |
表3 在BCB数据集上每个克隆类型的F1分数
Tab. 3 F1 scores for each clone type on BCB dataset
方法 | T1 | T2 | ST3 | MT3 | WT3/T4 |
---|---|---|---|---|---|
Deckard | 0.73 | 0.71 | 0.54 | 0.21 | 0.03 |
RtvNN | 1.00 | 0.97 | 0.60 | 0.03 | 0.01 |
CDLH | 1.00 | 0.99 | 0.97 | 0.94 | 0.82 |
ASTNN | 1.00 | 1.00 | 0.99 | 0.98 | 0.93 |
FA-AST | 1.00 | 1.00 | 0.99 | 0.99 | 0.95 |
Amain | 1.00 | 1.00 | 1.00 | 0.99 | 0.95 |
TreeCen | 1.00 | 0.99 | 1.00 | 1.00 | 0.95 |
DEHAST | 1.00 | 1.00 | 0.99 | 0.99 | 0.97 |
方法 | Precision | Recall | F1分数 |
---|---|---|---|
Deckard | 0.93 | 0.02 | 0.03 |
RtvNN | 0.95 | 0.01 | 0.02 |
CDLH | 0.92 | 0.74 | 0.82 |
ASTNN | 0.92 | 0.94 | 0.93 |
FA-AST | 0.96 | 0.94 | 0.95 |
Amain | 0.95 | 0.96 | 0.95 |
TreeCen | 0.97 | 0.93 | 0.95 |
DEHAST | 0.98 | 0.96 | 0.97 |
表4 各方法在BCB数据集上的性能
Tab. 4 Performance of different mehods on BCB dataset
方法 | Precision | Recall | F1分数 |
---|---|---|---|
Deckard | 0.93 | 0.02 | 0.03 |
RtvNN | 0.95 | 0.01 | 0.02 |
CDLH | 0.92 | 0.74 | 0.82 |
ASTNN | 0.92 | 0.94 | 0.93 |
FA-AST | 0.96 | 0.94 | 0.95 |
Amain | 0.95 | 0.96 | 0.95 |
TreeCen | 0.97 | 0.93 | 0.95 |
DEHAST | 0.98 | 0.96 | 0.97 |
方法 | Precision | Recall | F1分数 |
---|---|---|---|
Deckard | 0.45 | 0.44 | 0.44 |
RtvNN | 0.20 | 0.90 | 0.33 |
CDLH | 0.82 | 0.66 | 0.73 |
ASTNN | 0.87 | 0.95 | 0.91 |
Amain | 0.93 | 0.91 | 0.92 |
TreeCen | 0.93 | 0.94 | 0.93 |
DEHAST | 0.93 | 0.97 | 0.95 |
表5 各方法在GCJ数据集上的性能
Tab. 5 Performance of different methods on GCJ dataset
方法 | Precision | Recall | F1分数 |
---|---|---|---|
Deckard | 0.45 | 0.44 | 0.44 |
RtvNN | 0.20 | 0.90 | 0.33 |
CDLH | 0.82 | 0.66 | 0.73 |
ASTNN | 0.87 | 0.95 | 0.91 |
Amain | 0.93 | 0.91 | 0.92 |
TreeCen | 0.93 | 0.94 | 0.93 |
DEHAST | 0.93 | 0.97 | 0.95 |
序号 | 消融描述 | Precision | Recall | F1分数 |
---|---|---|---|---|
1 | 原始AST及其依赖 | 0.88 | 0.91 | 0.89 |
2 | 去掉语句层相关依赖的DEHAST | 0.91 | 0.94 | 0.92 |
3 | 去掉子树层相关依赖的DEHAST | 0.89 | 0.86 | 0.87 |
4 | 标准模型DEHAST | 0.98 | 0.96 | 0.97 |
表6 消融实验结果
Tab. 6 Results of ablation experiments
序号 | 消融描述 | Precision | Recall | F1分数 |
---|---|---|---|---|
1 | 原始AST及其依赖 | 0.88 | 0.91 | 0.89 |
2 | 去掉语句层相关依赖的DEHAST | 0.91 | 0.94 | 0.92 |
3 | 去掉子树层相关依赖的DEHAST | 0.89 | 0.86 | 0.87 |
4 | 标准模型DEHAST | 0.98 | 0.96 | 0.97 |
1 | ROY C K, CORDY J R. A survey on software clone detection research: Technical Report No. 2007-541[R]. Ontario, Canada: Queen’s University at Kingston, School of Computing, 2007: 64-68. |
2 | KAMIYA T, KUSUMOTO S, INOUE K. CCFinder: a multilinguistic token-based code clone detection system for large scale source code[J]. IEEE Transactions on Software Engineering, 2002, 28(7):654-670. 10.1109/tse.2002.1019480 |
3 | LIU C, CHEN C, HAN J, et al. GPLAG: detection of software plagiarism by program dependence graph analysis[C]// Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2006:872-881. 10.1145/1150402.1150522 |
4 | WANG P, SVAJLENKO J, WU Y, et al. CCAligner: a token based large-gap clone detector[C]// Proceedings of the 40th International Conference on Software Engineering. New York: ACM, 2018:1066-1077. 10.1145/3180155.3180179 |
5 | GABEL M, JIANG L, SU Z. Scalable detection of semantic clones[C]// Proceedings of the 30th International Conference on Software Engineering. New York: ACM, 2008: 321-330. 10.1145/1368088.1368132 |
6 | ZHANG J, WANG X, ZHANG H, et al. A novel neural source code representation based on abstract syntax tree[C]// Proceedings of the 41st International Conference on Software Engineering. Piscataway: IEEE, 2019: 783-794. 10.1109/icse.2019.00086 |
7 | WHITE M, TUFANO M, VENDOME C, et al. Deep learning code fragments for code clone detection[C]// Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. New York: ACM, 2016: 87-98. 10.1145/2970276.2970326 |
8 | WANG M, WANG P, XU Y. CCSharp: an efficient three-phase code clone detector using modified PDGs[C]// Proceedings of the 2017 24th Asia-Pacific Software Engineering Conference. Piscataway: IEEE, 2017:100-109. 10.1109/apsec.2017.16 |
9 | WEI H-H, LI M. Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code[C]// Proceedings of the 26th International Joint Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2017: 3034-3040. 10.24963/ijcai.2017/423 |
10 | TAI K S, SOCHER R, MANNING C D. Improved semantic representations from tree-structured long short-term memory networks[C]// Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Stroudsberg: ACL, 2015:1556-1566. 10.3115/v1/p15-1150 |
11 | WANG W, LI G, MA B, et al. Detecting code clones with graph neural network and flow-augmented abstract syntax tree[C]// Proceedings of the 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering. Piscataway: IEEE, 2020: 261-271. 10.1109/saner48275.2020.9054857 |
12 | ZHAO G, HUANG J. DeepSim: deep learning code functional similarity[C]// Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. New York: ACM, 2018: 141-151. 10.1145/3236024.3236068 |
13 | LI Y, GU C, DULLIEN T, et al. Graph matching networks for learning the similarity of graph structured objects[C]// Proceedings of the 36th International Conference on Machine Learning. New York: JMLR.org, 2019: 3835-3845. |
14 | 乐乔艺,刘建勋,孙晓平,等.代码克隆检测研究进展综述[J].计算机科学,2021,48(11A):509-522. 10.11896/jsjkx.210300310 |
LE Q Y, LIU J X, SUN X P, et al. Survey of research progress of code cloning detection[J]. Computer Science,2021,48(11A):509-522. 10.11896/jsjkx.210300310 | |
15 | DUCASSE S, RIEGER M, DEMEYER S. A language independent approach for detecting duplicated code[C]// Proceedings of the 15th IEEE International Conference on Software Maintenance. Piscataway: IEEE, 1999:109-118. 10.1109/icsm.1999.792593 |
16 | RAGKHITWETSAGUL C, KRINKE J. Using compilation/decompilation to enhance clone detection[C]// Proceedings of the 2017 IEEE 11th International Workshop on Software Clones. Piscataway: IEEE, 2017:1-7. 10.1109/iwsc.2017.7880502 |
17 | ROY C K, CORDY J R. NICAD: accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization[C]// Proceedings of the 2008 IEEE 16th IEEE International Conference on Program Comprehension. Piscataway: IEEE, 2008: 172-181. 10.1109/icpc.2008.41 |
18 | NISHI M A, DAMEVSKI K. Scalable code clone detection and search based on adaptive prefix filtering[J]. Journal of Systems and Software, 2018, 137:130-142. 10.1016/j.jss.2017.11.039 |
19 | LI L, FENG H, ZHUANG W, et al. CCLearner: a deep learning-based clone detection approach[C]// Proceedings of the 2017 IEEE International Conference on Software Maintenance and Evolution. Piscataway: IEEE, 2017: 249-260. 10.1109/icsme.2017.46 |
20 | SAJNANI H, SAINI V, SVAJLENKO J, et al. SourcererCC: scaling code clone detection to big-code[C]// Proceedings of the 2016 IEEE/ACM 38th International Conference on Software Engineering. Washington, DC: IEEE Computer Society, 2016:1157-1168. 10.1145/2884781.2884877 |
21 | SVAJLENKO J, ROY C K. Fast and flexible large-scale clone detection with CloneWorks[C]// Proceedings of the 39th International Conference on Software Engineering Companion. Piscataway: IEEE, 2017:27-30. 10.1109/icse-c.2017.3 |
22 | CHODAREV S, PIETRIKOVÁ E, KOLLÁR J. Haskell clone detection using pattern comparing algorithm[C]// Proceedings of the 2015 13th International Conference on Engineering of Modern Electric Systems. Piscataway: IEEE, 2015:1-4. 10.1109/emes.2015.7158423 |
23 | YUAN D, FANG S, ZHANG T, et al. Java code clone detection by exploiting semantic and syntax information from intermediate code-based graph[J]. IEEE Transactions on Reliability, 2022, 72(2): 511-526. 10.1109/tr.2022.3176922 |
24 | HU Y, FANG Y, SUN Y, et al. Code2Img: tree-based image transformation for scalable code clone detection[J]. IEEE Transactions on Software Engineering, 2023, 49(9): 4429-4442. 10.1109/tse.2023.3295801 |
25 | MOU L, LI G, ZHANG L, et al. Convolutional neural networks over tree structures for programming language processing[C]// Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2016:1287-1293. 10.1609/aaai.v30i1.10139 |
26 | SCARSELLI F, GORI M, TSOI A C, et al. The graph neural network model[J]. IEEE Transactions on Neural Networks,2009,20(1):61-80. 10.1109/tnn.2008.2005605 |
27 | GILMER J, SCHOENHOLZ S S, RILEY P F, et al. Neural message passing for quantum chemistry[C]// Proceedings of the 34th International Conference on Machine Learning. New York: JMLR.org, 2017:1263-1272. |
28 | ALLAMANIS M, BROCKSCHMIDT M, KHADEMI M. Learning to represent programs with graphs[EB/OL]. [2022-07-30]. . |
29 | NGUYEN M, BUI N D Q. Learning to represent programs with code hierarchies[EB/OL]. (2022-05-31)[2022-07-30]. . 10.48550/arXiv.2205.15479 |
30 | SVAJLENKO J, ISLAM J F, KEIVANLOO I, et al. Towards a big data curated benchmark of inter-project code clones[C]// Proceedings of the 2014 IEEE International Conference on Software Maintenance and Evolution. Piscataway: IEEE, 2014:476-480. 10.1109/icsme.2014.77 |
31 | Google Code Jam[EB/OL]. [2022-05-26]. . 10.4337/9781781006481.00014 |
32 | FEY M, LENSSEN J E. Fast graph representation learning with PyTorch Geometric[EB/OL]. [2022-08-02]. . 10.1109/cvpr.2018.00097 |
33 | KINGMA D, BA J. Adam: a method for stochastic optimization[EB/OL]. [2023-04-01]. . |
34 | JIANG L, MISHERGHI G, SU Z, et al. DECKARD: scalable and accurate tree-based detection of code clones[C]// Proceedings of the 29th International Conference on Software Engineering. Piscataway: IEEE, 2007: 96-105. 10.1109/icse.2007.30 |
35 | WU Y, FENG S, ZOU D, et al. Detecting semantic code clones by building AST-based Markov chains model[C]// Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. New York: ACM, 2022: No. 34. 10.1145/3551349.3560426 |
36 | HU Y, ZOU D, PENG J, et al. TreeCen: building tree graph for scalable semantic code clone detection[C]// Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. New York: ACM, 2022: No. 109. 10.1145/3551349.3556927 |
[1] | 李顺勇, 李师毅, 胥瑞, 赵兴旺. 基于自注意力融合的不完整多视图聚类算法[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2696-2703. |
[2] | 黄云川, 江永全, 黄骏涛, 杨燕. 基于元图同构网络的分子毒性预测[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2964-2969. |
[3] | 秦璟, 秦志光, 李发礼, 彭悦恒. 基于概率稀疏自注意力神经网络的重性抑郁疾患诊断[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2970-2974. |
[4] | 王熙源, 张战成, 徐少康, 张宝成, 罗晓清, 胡伏原. 面向手术导航3D/2D配准的无监督跨域迁移网络[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2911-2918. |
[5] | 潘烨新, 杨哲. 基于多级特征双向融合的小目标检测优化模型[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2871-2877. |
[6] | 刘禹含, 吉根林, 张红苹. 基于骨架图与混合注意力的视频行人异常检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2551-2557. |
[7] | 顾焰杰, 张英俊, 刘晓倩, 周围, 孙威. 基于时空多图融合的交通流量预测[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2618-2625. |
[8] | 石乾宏, 杨燕, 江永全, 欧阳小草, 范武波, 陈强, 姜涛, 李媛. 面向空气质量预测的多粒度突变拟合网络[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2643-2650. |
[9] | 赵亦群, 张志禹, 董雪. 基于密集残差物理信息神经网络的各向异性旅行时计算方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2310-2318. |
[10] | 徐松, 张文博, 王一帆. 基于时空信息的轻量视频显著性目标检测网络[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2192-2199. |
[11] | 孙逊, 冯睿锋, 陈彦如. 基于深度与实例分割融合的单目3D目标检测方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2208-2215. |
[12] | 吴筝, 程志友, 汪真天, 汪传建, 王胜, 许辉. 基于深度学习的患者麻醉复苏过程中的头部运动幅度分类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2258-2263. |
[13] | 李欢欢, 黄添强, 丁雪梅, 罗海峰, 黄丽清. 基于多尺度时空图卷积网络的交通出行需求预测[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2065-2072. |
[14] | 张郅, 李欣, 叶乃夫, 胡凯茜. 基于暗知识保护的模型窃取防御技术DKP[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2080-2086. |
[15] | 赵雅娟, 孟繁军, 徐行健. 在线教育学习者知识追踪综述[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1683-1698. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||