Journal of Computer Applications ›› 2019, Vol. 39 ›› Issue (7): 2074-2080.DOI: 10.11772/j.issn.1001-9081.2019010083

• Computer software technology • Previous Articles     Next Articles

Clone code detection based on image similarity

WANG Yafang, LIU Dongsheng, HOU Min   

  1. College of Computer Science and Technology, Inner Mongolia Normal University, Hohhot Nei Mongol 010022, China
  • Received:2019-01-14 Revised:2019-03-29 Online:2019-04-15 Published:2019-07-10
  • Supported by:

    This work is partially supported by the National Natural Science Foundation of China (61363017), the Natural Science Foundation of Inner Mongolia Autonomous Region (2016MS0612), the Foundation Projects of Inner Mongolia Education Department (NJZY18025).

基于图像相似度检测代码克隆

王亚芳, 刘东升, 侯敏   

  1. 内蒙古师范大学 计算机科学技术学院, 呼和浩特 010022
  • 通讯作者: 刘东升
  • 作者简介:王亚芳(1994-),女,内蒙古鄂尔多斯人,硕士研究生,主要研究方向:软件分析、代码分析;刘东升(1956-),男,内蒙古呼和浩特人,教授,CCF会员,主要研究方向:软件分析、代码分析、计算机辅助教学;侯敏(1973-),女,内蒙古呼和浩特人,讲师,硕士,主要研究方向:软件分析、计算机辅助教学。
  • 基金资助:

    国家自然科学基金资助项目(61363017);内蒙古自治区自然科学基金资助项目(2016MS0612);内蒙古教育厅资助项目(NJZY18025)。

Abstract:

At present, scholars mainly focus on four perspectives of text, vocabulary, grammar and semantics in the field of clone code detection. However, few breakthroughs have been made in the effect of clone code detection for a long time. In view of this problem, a new method called Clone Code detection based on Image Similarity (CCIS) was proposed. Firstly, the source code was preprocessed by removing comments, white space, etc., from which a "clean" function fragment was able to be obtained, and the identifiers, keywords, etc. in the function were highlighted. Then the processed source code was converted into images and these images were normalized. Finally, Jaccard distance and perceptual Hash algorithm were used for detection, obtaining the clone code information from these images. In order to verify the validity of this method, six open source softwares were used to constitute the evaluation dataset for testing. The experimental results show that CCIS method can detect 100% type-1 clone code, 88% type-2 clone code and 60% type-3 clone code, which proves the good effect of CCIS method on clone code detection.

Key words: clone code, clone detection, Jaccard distance, perceptual Hash algorithm, syntax highlighting

摘要:

目前在代码克隆检测领域,学者们主要从文本、词汇、语法和语义四种角度展开研究,然而长期以来代码克隆检测效果并未取得新的突破。针对这一问题,从图像处理角度提出了一种基于图像相似度的新型代码克隆检测(CCIS)方法。首先对源代码进行移除注释、空白符等操作,以获取"干净"的函数片段,并将函数中的标识符、关键字等进行高亮处理;然后将处理好的源代码转换为图像,并对图像进行规范化处理;最后使用Jaccard距离和感知哈希算法进行检测,得到代码克隆信息。为了验证实验的有效性,使用6款开源软件构建评价数据集进行测试。实验结果表明,CCIS方法能够检测出100%的类型一代码克隆、88%的类型二代码克隆与60%的类型三代码克隆,因此CCIS方法可以很好地进行代码克隆检测。

关键词: 代码克隆, 克隆检测, Jaccard距离, 感知哈希算法, 语法高亮

CLC Number: