计算机应用 ›› 2016, Vol. 36 ›› Issue (12): 3468-3475.DOI: 10.11772/j.issn.1001-9081.2016.12.3468

• 计算机软件技术 • 上一篇    下一篇

克隆代码有害性预测中分类不平衡问题的解决方法

王欢, 张丽萍, 闫盛   

  1. 内蒙古师范大学 计算机与信息工程学院, 呼和浩特 010022
  • 收稿日期:2016-05-30 修回日期:2016-07-07 出版日期:2016-12-10 发布日期:2016-12-08
  • 通讯作者: 张丽萍
  • 作者简介:王欢(1991-),男,内蒙古巴彦淖尔人,硕士研究生,主要研究方向:代码分析;张丽萍(1974-),女,内蒙古呼和浩特人,教授,硕士,CCF会员,主要研究方向:软件工程、软件分析;闫盛(1984-),男,内蒙古包头人,讲师,硕士,主要研究方向:软件分析、并行计算。
  • 基金资助:
    国家自然科学基金资助项目(61363017,61462071);内蒙古自然科学基金资助项目(2015MS0606);内蒙古自治区高等学校科学研究项目(NJZY16045)。

Solution for classification imbalance in harmfulness prediction of clone code

WANG Huan, ZHANG Liping, YAN Sheng   

  1. College of Computer and Information Engineering, Inner Mongolia Normal University, Hohhot Nei Mongol 010022, China
  • Received:2016-05-30 Revised:2016-07-07 Online:2016-12-10 Published:2016-12-08
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61363017,61462071), the Natural Science Foundation of Inner Mongolia (2015MS0606), the Foundation Projects of Inner Mongolia Education Department (NJZY16045).

摘要: 针对克隆代码有害性预测中有害和无害数据分类不平衡的问题,提出一种基于随机下采样(RUS)的能够自动调整分类不平衡的K-Balance算法。首先对克隆代码提取静态特征和演化特征构建样本数据集;然后选取比例不同的分类不平衡新数据集;接着对已选取的新数据集进行有害性预测;最后,通过观察分类器的不同表现自动选择一个最适合的分类不平衡比例值。在7款C语言开源软件共170个版本上对克隆有害性预测模型的性能进行评估,并和其他分类不平衡解决方法进行对比,实验结果表明所提方法对有害和无害克隆的分类预测效果(受试者工作特征曲线下方面积(AUC)值)提高了2.62个百分点~36.70个百分点,能有效地改善分类不平衡的预测问题。

关键词: 克隆代码, 有害性, 不平衡分类, 随机下采样, 参数搜索

Abstract: Focusing on the problem of imbalanced classification of harmful data and harmless data in the prediction of the harmful effects of clone code, a K-Balance algorithm based on Random Under-Sampling (RUS) was proposed, which could adjust the classification imbalance automatically. Firstly, a sample data set was constructed by extracting static features and evolution features of clone code. Then, a new data set of imbalanced classification with different proportion was selected. Next, the harmful prediction was carried out to the new selected data set. Finally, the most suitable percentage value of classification imbalance was chosen automatically by observing the different performance of the classifier. The performance of the harmfulness prediction model of clone code was evaluated with seven different types of open-source software systems containing 170 versions written in C language. Compared with the other classification imbalance solution methods, the experimental results show that the proposed method is increased by 2.62 percentage points to 36.7 percentage points in the classification prediction effects (Area Under ROC(Receive Operating Characteristic) Curve (AUC)) of harmful and harmless clones. The proposed method can improve the classification imbalance prediction effectively.

Key words: code clone, harmfulness, imbalanced classification, random undersampling, parameter search

中图分类号: