Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (4): 1259-1268.DOI: 10.11772/j.issn.1001-9081.2023040485
• Computer software technology • Previous Articles
Zexuan WAN, Chunli XIE(), Quanrun LYU, Yao LIANG
Received:
2023-04-26
Revised:
2023-07-04
Accepted:
2023-07-10
Online:
2023-12-04
Published:
2024-04-10
Contact:
Chunli XIE
About author:
WAN Zexuan, born in 1998, M. S. candidate. His research interests include code representation, code clone.Supported by:
通讯作者:
谢春丽
作者简介:
万泽轩(1998—),男,江苏徐州人,硕士研究生,主要研究方向:代码表征、代码克隆基金资助:
CLC Number:
Zexuan WAN, Chunli XIE, Quanrun LYU, Yao LIANG. Code clone detection based on dependency enhanced hierarchical abstract syntax tree[J]. Journal of Computer Applications, 2024, 44(4): 1259-1268.
万泽轩, 谢春丽, 吕泉润, 梁瑶. 基于依赖增强的分层抽象语法树的代码克隆检测[J]. 《计算机应用》唯一官方网站, 2024, 44(4): 1259-1268.
Add to citation manager EndNote|Ris|BibTeX
URL: http://www.joca.cn/EN/10.11772/j.issn.1001-9081.2023040485
克隆类型 | 所占百分比 |
---|---|
Type-1(T1) | 0.46 |
Type-2(T2) | 0.06 |
Strongly Type-3(ST3) | 0.24 |
Moderately Type-3(MT3) | 1.01 |
Weakly Type-3/Type-4(WT3/T4) | 98.23 |
Tab. 1 Percentage points of different clone types in BCB dataset
克隆类型 | 所占百分比 |
---|---|
Type-1(T1) | 0.46 |
Type-2(T2) | 0.06 |
Strongly Type-3(ST3) | 0.24 |
Moderately Type-3(MT3) | 1.01 |
Weakly Type-3/Type-4(WT3/T4) | 98.23 |
语句 | BCB中平均出现次数 | GCJ中平均出现次数 |
---|---|---|
IfStatement | 2.72 | 3.11 |
WhileStatement | 0.44 | 0.44 |
ForStatement | 0.42 | 4.06 |
BlockStatement | 3.27 | 7.05 |
SwitchStatement | 0.01 | 0.01 |
Tab. 2 Average occurrences of statements in BCB and GCJ datasets
语句 | BCB中平均出现次数 | GCJ中平均出现次数 |
---|---|---|
IfStatement | 2.72 | 3.11 |
WhileStatement | 0.44 | 0.44 |
ForStatement | 0.42 | 4.06 |
BlockStatement | 3.27 | 7.05 |
SwitchStatement | 0.01 | 0.01 |
方法 | T1 | T2 | ST3 | MT3 | WT3/T4 |
---|---|---|---|---|---|
Deckard | 0.73 | 0.71 | 0.54 | 0.21 | 0.03 |
RtvNN | 1.00 | 0.97 | 0.60 | 0.03 | 0.01 |
CDLH | 1.00 | 0.99 | 0.97 | 0.94 | 0.82 |
ASTNN | 1.00 | 1.00 | 0.99 | 0.98 | 0.93 |
FA-AST | 1.00 | 1.00 | 0.99 | 0.99 | 0.95 |
Amain | 1.00 | 1.00 | 1.00 | 0.99 | 0.95 |
TreeCen | 1.00 | 0.99 | 1.00 | 1.00 | 0.95 |
DEHAST | 1.00 | 1.00 | 0.99 | 0.99 | 0.97 |
Tab. 3 F1 scores for each clone type on BCB dataset
方法 | T1 | T2 | ST3 | MT3 | WT3/T4 |
---|---|---|---|---|---|
Deckard | 0.73 | 0.71 | 0.54 | 0.21 | 0.03 |
RtvNN | 1.00 | 0.97 | 0.60 | 0.03 | 0.01 |
CDLH | 1.00 | 0.99 | 0.97 | 0.94 | 0.82 |
ASTNN | 1.00 | 1.00 | 0.99 | 0.98 | 0.93 |
FA-AST | 1.00 | 1.00 | 0.99 | 0.99 | 0.95 |
Amain | 1.00 | 1.00 | 1.00 | 0.99 | 0.95 |
TreeCen | 1.00 | 0.99 | 1.00 | 1.00 | 0.95 |
DEHAST | 1.00 | 1.00 | 0.99 | 0.99 | 0.97 |
方法 | Precision | Recall | F1分数 |
---|---|---|---|
Deckard | 0.93 | 0.02 | 0.03 |
RtvNN | 0.95 | 0.01 | 0.02 |
CDLH | 0.92 | 0.74 | 0.82 |
ASTNN | 0.92 | 0.94 | 0.93 |
FA-AST | 0.96 | 0.94 | 0.95 |
Amain | 0.95 | 0.96 | 0.95 |
TreeCen | 0.97 | 0.93 | 0.95 |
DEHAST | 0.98 | 0.96 | 0.97 |
Tab. 4 Performance of different mehods on BCB dataset
方法 | Precision | Recall | F1分数 |
---|---|---|---|
Deckard | 0.93 | 0.02 | 0.03 |
RtvNN | 0.95 | 0.01 | 0.02 |
CDLH | 0.92 | 0.74 | 0.82 |
ASTNN | 0.92 | 0.94 | 0.93 |
FA-AST | 0.96 | 0.94 | 0.95 |
Amain | 0.95 | 0.96 | 0.95 |
TreeCen | 0.97 | 0.93 | 0.95 |
DEHAST | 0.98 | 0.96 | 0.97 |
方法 | Precision | Recall | F1分数 |
---|---|---|---|
Deckard | 0.45 | 0.44 | 0.44 |
RtvNN | 0.20 | 0.90 | 0.33 |
CDLH | 0.82 | 0.66 | 0.73 |
ASTNN | 0.87 | 0.95 | 0.91 |
Amain | 0.93 | 0.91 | 0.92 |
TreeCen | 0.93 | 0.94 | 0.93 |
DEHAST | 0.93 | 0.97 | 0.95 |
Tab. 5 Performance of different methods on GCJ dataset
方法 | Precision | Recall | F1分数 |
---|---|---|---|
Deckard | 0.45 | 0.44 | 0.44 |
RtvNN | 0.20 | 0.90 | 0.33 |
CDLH | 0.82 | 0.66 | 0.73 |
ASTNN | 0.87 | 0.95 | 0.91 |
Amain | 0.93 | 0.91 | 0.92 |
TreeCen | 0.93 | 0.94 | 0.93 |
DEHAST | 0.93 | 0.97 | 0.95 |
序号 | 消融描述 | Precision | Recall | F1分数 |
---|---|---|---|---|
1 | 原始AST及其依赖 | 0.88 | 0.91 | 0.89 |
2 | 去掉语句层相关依赖的DEHAST | 0.91 | 0.94 | 0.92 |
3 | 去掉子树层相关依赖的DEHAST | 0.89 | 0.86 | 0.87 |
4 | 标准模型DEHAST | 0.98 | 0.96 | 0.97 |
Tab. 6 Results of ablation experiments
序号 | 消融描述 | Precision | Recall | F1分数 |
---|---|---|---|---|
1 | 原始AST及其依赖 | 0.88 | 0.91 | 0.89 |
2 | 去掉语句层相关依赖的DEHAST | 0.91 | 0.94 | 0.92 |
3 | 去掉子树层相关依赖的DEHAST | 0.89 | 0.86 | 0.87 |
4 | 标准模型DEHAST | 0.98 | 0.96 | 0.97 |
1 | ROY C K, CORDY J R. A survey on software clone detection research: Technical Report No. 2007-541[R]. Ontario, Canada: Queen’s University at Kingston, School of Computing, 2007: 64-68. |
2 | KAMIYA T, KUSUMOTO S, INOUE K. CCFinder: a multilinguistic token-based code clone detection system for large scale source code[J]. IEEE Transactions on Software Engineering, 2002, 28(7):654-670. 10.1109/tse.2002.1019480 |
3 | LIU C, CHEN C, HAN J, et al. GPLAG: detection of software plagiarism by program dependence graph analysis[C]// Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2006:872-881. 10.1145/1150402.1150522 |
4 | WANG P, SVAJLENKO J, WU Y, et al. CCAligner: a token based large-gap clone detector[C]// Proceedings of the 40th International Conference on Software Engineering. New York: ACM, 2018:1066-1077. 10.1145/3180155.3180179 |
5 | GABEL M, JIANG L, SU Z. Scalable detection of semantic clones[C]// Proceedings of the 30th International Conference on Software Engineering. New York: ACM, 2008: 321-330. 10.1145/1368088.1368132 |
6 | ZHANG J, WANG X, ZHANG H, et al. A novel neural source code representation based on abstract syntax tree[C]// Proceedings of the 41st International Conference on Software Engineering. Piscataway: IEEE, 2019: 783-794. 10.1109/icse.2019.00086 |
7 | WHITE M, TUFANO M, VENDOME C, et al. Deep learning code fragments for code clone detection[C]// Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. New York: ACM, 2016: 87-98. 10.1145/2970276.2970326 |
8 | WANG M, WANG P, XU Y. CCSharp: an efficient three-phase code clone detector using modified PDGs[C]// Proceedings of the 2017 24th Asia-Pacific Software Engineering Conference. Piscataway: IEEE, 2017:100-109. 10.1109/apsec.2017.16 |
9 | WEI H-H, LI M. Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code[C]// Proceedings of the 26th International Joint Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2017: 3034-3040. 10.24963/ijcai.2017/423 |
10 | TAI K S, SOCHER R, MANNING C D. Improved semantic representations from tree-structured long short-term memory networks[C]// Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Stroudsberg: ACL, 2015:1556-1566. 10.3115/v1/p15-1150 |
11 | WANG W, LI G, MA B, et al. Detecting code clones with graph neural network and flow-augmented abstract syntax tree[C]// Proceedings of the 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering. Piscataway: IEEE, 2020: 261-271. 10.1109/saner48275.2020.9054857 |
12 | ZHAO G, HUANG J. DeepSim: deep learning code functional similarity[C]// Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. New York: ACM, 2018: 141-151. 10.1145/3236024.3236068 |
13 | LI Y, GU C, DULLIEN T, et al. Graph matching networks for learning the similarity of graph structured objects[C]// Proceedings of the 36th International Conference on Machine Learning. New York: JMLR.org, 2019: 3835-3845. |
14 | 乐乔艺,刘建勋,孙晓平,等.代码克隆检测研究进展综述[J].计算机科学,2021,48(11A):509-522. 10.11896/jsjkx.210300310 |
LE Q Y, LIU J X, SUN X P, et al. Survey of research progress of code cloning detection[J]. Computer Science,2021,48(11A):509-522. 10.11896/jsjkx.210300310 | |
15 | DUCASSE S, RIEGER M, DEMEYER S. A language independent approach for detecting duplicated code[C]// Proceedings of the 15th IEEE International Conference on Software Maintenance. Piscataway: IEEE, 1999:109-118. 10.1109/icsm.1999.792593 |
16 | RAGKHITWETSAGUL C, KRINKE J. Using compilation/decompilation to enhance clone detection[C]// Proceedings of the 2017 IEEE 11th International Workshop on Software Clones. Piscataway: IEEE, 2017:1-7. 10.1109/iwsc.2017.7880502 |
17 | ROY C K, CORDY J R. NICAD: accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization[C]// Proceedings of the 2008 IEEE 16th IEEE International Conference on Program Comprehension. Piscataway: IEEE, 2008: 172-181. 10.1109/icpc.2008.41 |
18 | NISHI M A, DAMEVSKI K. Scalable code clone detection and search based on adaptive prefix filtering[J]. Journal of Systems and Software, 2018, 137:130-142. 10.1016/j.jss.2017.11.039 |
19 | LI L, FENG H, ZHUANG W, et al. CCLearner: a deep learning-based clone detection approach[C]// Proceedings of the 2017 IEEE International Conference on Software Maintenance and Evolution. Piscataway: IEEE, 2017: 249-260. 10.1109/icsme.2017.46 |
20 | SAJNANI H, SAINI V, SVAJLENKO J, et al. SourcererCC: scaling code clone detection to big-code[C]// Proceedings of the 2016 IEEE/ACM 38th International Conference on Software Engineering. Washington, DC: IEEE Computer Society, 2016:1157-1168. 10.1145/2884781.2884877 |
21 | SVAJLENKO J, ROY C K. Fast and flexible large-scale clone detection with CloneWorks[C]// Proceedings of the 39th International Conference on Software Engineering Companion. Piscataway: IEEE, 2017:27-30. 10.1109/icse-c.2017.3 |
22 | CHODAREV S, PIETRIKOVÁ E, KOLLÁR J. Haskell clone detection using pattern comparing algorithm[C]// Proceedings of the 2015 13th International Conference on Engineering of Modern Electric Systems. Piscataway: IEEE, 2015:1-4. 10.1109/emes.2015.7158423 |
23 | YUAN D, FANG S, ZHANG T, et al. Java code clone detection by exploiting semantic and syntax information from intermediate code-based graph[J]. IEEE Transactions on Reliability, 2022, 72(2): 511-526. 10.1109/tr.2022.3176922 |
24 | HU Y, FANG Y, SUN Y, et al. Code2Img: tree-based image transformation for scalable code clone detection[J]. IEEE Transactions on Software Engineering, 2023, 49(9): 4429-4442. 10.1109/tse.2023.3295801 |
25 | MOU L, LI G, ZHANG L, et al. Convolutional neural networks over tree structures for programming language processing[C]// Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2016:1287-1293. 10.1609/aaai.v30i1.10139 |
26 | SCARSELLI F, GORI M, TSOI A C, et al. The graph neural network model[J]. IEEE Transactions on Neural Networks,2009,20(1):61-80. 10.1109/tnn.2008.2005605 |
27 | GILMER J, SCHOENHOLZ S S, RILEY P F, et al. Neural message passing for quantum chemistry[C]// Proceedings of the 34th International Conference on Machine Learning. New York: JMLR.org, 2017:1263-1272. |
28 | ALLAMANIS M, BROCKSCHMIDT M, KHADEMI M. Learning to represent programs with graphs[EB/OL]. [2022-07-30]. . |
29 | NGUYEN M, BUI N D Q. Learning to represent programs with code hierarchies[EB/OL]. (2022-05-31)[2022-07-30]. . 10.48550/arXiv.2205.15479 |
30 | SVAJLENKO J, ISLAM J F, KEIVANLOO I, et al. Towards a big data curated benchmark of inter-project code clones[C]// Proceedings of the 2014 IEEE International Conference on Software Maintenance and Evolution. Piscataway: IEEE, 2014:476-480. 10.1109/icsme.2014.77 |
31 | Google Code Jam[EB/OL]. [2022-05-26]. . 10.4337/9781781006481.00014 |
32 | FEY M, LENSSEN J E. Fast graph representation learning with PyTorch Geometric[EB/OL]. [2022-08-02]. . 10.1109/cvpr.2018.00097 |
33 | KINGMA D, BA J. Adam: a method for stochastic optimization[EB/OL]. [2023-04-01]. . |
34 | JIANG L, MISHERGHI G, SU Z, et al. DECKARD: scalable and accurate tree-based detection of code clones[C]// Proceedings of the 29th International Conference on Software Engineering. Piscataway: IEEE, 2007: 96-105. 10.1109/icse.2007.30 |
35 | WU Y, FENG S, ZOU D, et al. Detecting semantic code clones by building AST-based Markov chains model[C]// Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. New York: ACM, 2022: No. 34. 10.1145/3551349.3560426 |
36 | HU Y, ZOU D, PENG J, et al. TreeCen: building tree graph for scalable semantic code clone detection[C]// Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. New York: ACM, 2022: No. 109. 10.1145/3551349.3556927 |
[1] | Rui TANG, Shibo YUE, Ruizhi ZHANG, Chuan LIU, Chuanlin PANG. Energy efficiency optimization mechanism for UAV-assisted and non-orthogonal multiple access-enabled data collection system [J]. Journal of Computer Applications, 2024, 44(4): 1209-1218. |
[2] | Xiangjie SUN, Qiang WEI, Yisen WANG, Jiang DU. Survey of code similarity detection technology [J]. Journal of Computer Applications, 2024, 44(4): 1248-1258. |
[3] | Xianfeng YANG, Yilei TANG, Ziqiang LI. Aspect-level sentiment analysis model based on alternating‑attention mechanism and graph convolutional network [J]. Journal of Computer Applications, 2024, 44(4): 1058-1064. |
[4] | Pengfei ZHANG, Litao HAN, Hengjian FENG, Hongmei LI. Point cloud semantic segmentation based on attention mechanism and global feature optimization [J]. Journal of Computer Applications, 2024, 44(4): 1086-1092. |
[5] | Yaoyao TANG, Yechen ZHU, Yangchuan LIU, Xin GAO. Research status and prospect of CT image ring artifact removal methods [J]. Journal of Computer Applications, 2024, 44(3): 890-900. |
[6] | Yuqiu LI, Liping HOU, Jian XUE, Ke LYU, Yong WANG. Remote sensing image recommendation method based on content interpretation [J]. Journal of Computer Applications, 2024, 44(3): 722-731. |
[7] | Dapeng XU, Xinmin HOU. Feature selection method for graph neural network based on network architecture design [J]. Journal of Computer Applications, 2024, 44(3): 663-670. |
[8] | Meiyu CAI, Runzhe ZHU, Fei WU, Kaiyu ZHANG, Jiale LI. Cross-view matching model based on attention mechanism and multi-granularity feature fusion [J]. Journal of Computer Applications, 2024, 44(3): 901-908. |
[9] | Weina DONG, Jia LIU, Xiaozhong PAN, Lifeng CHEN, Wenquan SUN. High-capacity robust image steganography scheme based on encoding-decoding network [J]. Journal of Computer Applications, 2024, 44(3): 772-779. |
[10] | Kui ZHAO, Huiqi QIU, Xu LI, Zhifei XU. Real-time pulmonary nodule detection algorithm combining attention and multipath fusion [J]. Journal of Computer Applications, 2024, 44(3): 945-952. |
[11] | Xiang LIU, Bei HUA, Fei LIN, Hongyuan WEI. Design and implementation of component-based development framework for deep learning applications [J]. Journal of Computer Applications, 2024, 44(2): 526-535. |
[12] | Yudan SONG, Jing WANG, Xuehui WANG, Zhaoyang MA, Youfang LIN. Sleep physiological time series classification method based on adaptive multi-task learning [J]. Journal of Computer Applications, 2024, 44(2): 654-662. |
[13] | Jiawei ZHANG, Guandong GAO, Ke XIAO, Shengzun SONG. Violent crime hierarchy algorithm by joint modeling of improved hierarchical attention network and TextCNN [J]. Journal of Computer Applications, 2024, 44(2): 403-410. |
[14] | Zhiwen JING, Yujia ZHANG, Boting SUN, Hao GUO. Two-stage recommendation algorithm of Siamese graph convolutional neural network [J]. Journal of Computer Applications, 2024, 44(2): 469-476. |
[15] | Hualu QIU, Suzhen LIN, Yanbo WANG, Feng LIU, Dawei LI. Reconstruction algorithm for undersampled magnetic resonance images based on complex convolution dual-domain cascade network [J]. Journal of Computer Applications, 2024, 44(2): 580-587. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||