《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (4): 1259-1268.DOI: 10.11772/j.issn.1001-9081.2023040485

• 计算机软件技术 • 上一篇    

基于依赖增强的分层抽象语法树的代码克隆检测

万泽轩, 谢春丽(), 吕泉润, 梁瑶   

  1. 江苏师范大学 计算机科学与技术学院,江苏 徐州 221116
  • 收稿日期:2023-04-26 修回日期:2023-07-04 接受日期:2023-07-10 发布日期:2023-12-04 出版日期:2024-04-10
  • 通讯作者: 谢春丽
  • 作者简介:万泽轩(1998—),男,江苏徐州人,硕士研究生,主要研究方向:代码表征、代码克隆
    谢春丽(1979—),女,江苏徐州人,副教授,博士,CCF会员,主要研究方向:智能软件、代码克隆、代码表示 6020030132@jsnu.edu.cn
    吕泉润(1999—),男,江苏徐州人,硕士研究生,主要研究方向:代码表征、代码克隆
    梁瑶(1997—),女,江苏徐州人,硕士研究生,CCF会员,主要研究方向:代码分析。
  • 基金资助:
    国家自然科学基金资助项目(62276119);江苏师范大学研究生科研与实践创新计划项目(2022XKT1530)

Code clone detection based on dependency enhanced hierarchical abstract syntax tree

Zexuan WAN, Chunli XIE(), Quanrun LYU, Yao LIANG   

  1. School of Computer Science and Technology,Jiangsu Normal University,Xuzhou Jiangsu 221116,China
  • Received:2023-04-26 Revised:2023-07-04 Accepted:2023-07-10 Online:2023-12-04 Published:2024-04-10
  • Contact: Chunli XIE
  • About author:WAN Zexuan, born in 1998, M. S. candidate. His research interests include code representation, code clone.
    XIE Chunli, born in 1979, Ph. D., associate professor. Her research interests include intelligent software, code clone, code representation.
    LYU Quanrun, born in 1999, M. S. candidate. His research interests include code representation, code clone.
    LIANG Yao, born in 1997, M. S. candidate. Her research interests include code analysis.
  • Supported by:
    National Natural Science Foundation of China(62276119);Graduate Research and Practice Innovation Program of Jiangsu Normal University(2022XKT1530)

摘要:

在软件工程领域,基于语义相似的代码克隆检测方法可以降低软件维护的成本并预防系统漏洞,抽象语法树(AST)作为典型的代码抽象表征形式,已成功应用于多种程序语言的代码克隆检测任务,然而现有工作主要利用原始AST提取代码的语义,没有深入挖掘AST中的深层语义和结构信息。针对上述问题,提出一种基于依赖增强的分层抽象语法树(DEHAST)的代码克隆检测方法。首先,对AST进行分层处理,将AST划分得到不同的语义层次;其次,为AST的不同层次添加相应的依赖增强边构建DEHAST,将简单的AST变成具有更丰富程序语义的异构图;最后,使用图匹配网络(GMN)模型检测异构图的相似性,实现代码克隆检测。在BigCloneBench和Google Code Jam两个数据集上的实验结果显示,DEHAST能够检测100%的Type-1和Type-2代码克隆、99%的Type-3代码克隆和97%的Type-4代码克隆;与基于树的方法ASTNN(AST-based Neural Network)相比,F1分数均提高了4个百分点,验证了DEHAST可以较好地完成代码语义克隆检测。

关键词: 代码克隆检测, 语义克隆, 抽象语法树, 深度学习, 图匹配网络

Abstract:

In the field of software engineering, code clone detection methods based on semantic similarity can reduce the cost of software maintenance and prevent system vulnerabilities. As a typical form of code abstract representation, Abstract Syntax Tree (AST) has achieved success in code clone detection tasks of many program languages. However, the existing work mainly uses the original AST to extract code semantics, and does not dig deep semantic and structural information in AST. To solve the above problem, a code clone detection method based on Dependency Enhanced Hierarchical Abstract Syntax Tree (DEHAST) was proposed. Firstly, the AST was layered and divided into different semantic levels. Secondly, corresponding dependency enhancement edges were added to different levels of AST to construct DEHAST, thus a simple AST was transformed into a heterogeneous graph with richer program semantics. Finally, a Graph Matching Network (GMN) model was used to detect the similarity of heterogeneous graphs to achieve code clone detection. Experimental results on two datasets BigCloneBench and Google Code Jam show that DEHAST is able to detect 100% of Type-1 and Type-2 code clones, 99% of Type-3 code clones, and 97% of Type-4 code clones; compared with the tree based method ASTNN (AST-based Neural Network), the F1 values all increase by 4 percentage points. Therefore, DEHAST can effectively perform code semantic clone detection.

Key words: code clone detection, semantic clone, Abstract Syntax Tree (AST), deep learning, Graph Matching Network (GMN)

中图分类号: