计算机应用 ›› 2015, Vol. 35 ›› Issue (12): 3536-3543.DOI: 10.11772/j.issn.1001-9081.2015.12.3536

• 计算机软件技术 • 上一篇    下一篇

基于Token编辑距离检测克隆代码

张久杰, 王春晖, 张丽萍, 侯敏, 刘东升   

  1. 内蒙古师范大学计算机与信息工程学院, 呼和浩特 010022
  • 收稿日期:2015-06-12 修回日期:2015-09-06 出版日期:2015-12-10 发布日期:2015-12-10
  • 通讯作者: 刘东升(1956-),男,内蒙古呼和浩特人,教授,CCF会员,主要研究方向:软件工程、计算机教育
  • 作者简介:张久杰(1990-),男,内蒙古赤峰人,硕士研究生,主要研究方向:软件分析;王春晖(1979-),女(蒙古族),内蒙古通辽人,讲师,硕士,CCF会员,主要研究方向:软件分析、多媒体、计算机辅助教学;张丽萍(1974-),女,内蒙古呼和浩特人,教授,硕士,主要研究方向:软件工程、软件分析;侯敏(1979-),女,内蒙古呼和浩特人,讲师,硕士,主要研究方向:软件分析、计算机教育。
  • 基金资助:
    国家自然科学基金资助项目(61363017,61462071);内蒙古自然科学基金资助项目(2014MS0613);内蒙古自治区硕士研究生科研创新基金资助项目(S20141013524);内蒙古师范大学研究生科研创新基金资助项目(CXJJS14077)。

Clone code detection based on Levenshtein distance of token

ZHANG Jiujie, WANG Chunhui, ZHANG Liping, HOU Min, LIU Dongsheng   

  1. College of Computer and Information Engineering, Inner Mongolia Normal University, Hohhot Nei Mongol 010022, China
  • Received:2015-06-12 Revised:2015-09-06 Online:2015-12-10 Published:2015-12-10

摘要: 针对当前Type-3克隆代码检测工具较少、效率偏低等问题,提出了一种基于Token的能有效检测Type-3克隆代码的检测方法。该方法同时能有效检测Type-1和Type-2克隆代码。首先将源代码Token化得到特定代码粒度的Token串,其次将所有Token串的定长子串进行映射,在对映射信息进行查询的基础上,利用编辑距离算法确定克隆对,然后通过并查集算法快速构建克隆群,最终反馈克隆代码信息。实现了原型工具FClones,利用基于代码突变的框架对工具进行了评价,并与领域内较优秀的两款工具NiCad及SimCad进行了对比。实验结果表明,FClones在检测三类克隆代码时查全率均不低于95%,查准率均不低于98%,能更好地检测Type-3克隆代码。

关键词: 克隆代码, 克隆检测, 编辑距离, Type-3, token

Abstract: Aiming at the problems of less clone code detection tools and low efficiency for the current Type-3, an effective clone code detection method for Type-3 based on the levenshtein distance of token was proposed. Type-1, Type-2 and Type-3 clone codes could be detected by the proposed method in an efficient way. Firstly, the source codes of a subject system were tokenized into some token sequences with specified code size. Secondly, each definite-sized substring of the token sequences was mapped with corresponding index. Thirdly, the clone pairs were built by the levenshtein distance algorithm and the clone groups were built by the disjoint-set algorithm on the basis of the mapping information query. Finally, the feedback information of clone codes were given. A prototype tool named FClones was implemented. It was evaluated by the code mutation-based framework and compared with two state-of-the-art tools SimCad and NiCad. The experimental results show that the recall of FCloens is equal to or greater than 95% and its precision is not lower than 98% in detecting all of these three types of clone codes. FClones can do better in detecting Type-3 clones than others.

Key words: clone code, clone detection, Levenshtein distance, Type-3, token

中图分类号: