计算机应用 ›› 2017, Vol. 37 ›› Issue (4): 1135-1142.DOI: 10.11772/j.issn.1001-9081.2017.04.1135

• 计算机软件技术 • 上一篇    下一篇

克隆代码有害性预测中的特征选择模型

王欢, 张丽萍, 闫盛, 刘东升   

  1. 内蒙古师范大学 计算机与信息工程学院, 呼和浩特 010022
  • 收稿日期:2016-08-24 修回日期:2016-09-30 出版日期:2017-04-10 发布日期:2017-04-19
  • 通讯作者: 张丽萍
  • 作者简介:王欢(1991-),男,内蒙古巴彦淖尔人,硕士研究生,主要研究方向:代码分析;张丽萍(1974-),女,内蒙古呼和浩特人,教授,博士,CCF会员,主要研究方向:软件工程、软件分析;闫盛(1984-),男,内蒙古包头人,讲师,硕士,主要研究方向:软件分析、并行计算;刘东升(1956-),男,内蒙古呼和浩特人,教授,主要研究方向:软件分析、计算机教育。
  • 基金资助:
    国家自然科学基金资助项目(61363017,61462071);内蒙古自治区自然科学基金资助项目(2014MS0613,2015MS0606)。

Feature selection model for harmfulness prediction of clone code

WANG Huan, ZHANG Liping, YAN Sheng, LIU Dongsheng   

  1. College of Computer and Information Engineering, Inner Mongolia Normal University, Hohhot Nei Mongol 010022, China
  • Received:2016-08-24 Revised:2016-09-30 Online:2017-04-10 Published:2017-04-19
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61363017, 61462071), the Natural Science Foundation of Inner Mongolia Autonomous Region (2014MS0613, 2015MS0606).

摘要: 为解决克隆代码有害性预测过程中特征无关与特征冗余的问题,提出一种基于相关程度和影响程度的克隆代码有害性特征选择组合模型。首先,利用信息增益率对特征数据进行相关性的初步排序;然后,保留相关性排名较高的特征并去除其他无关特征,减小特征的搜索空间;接着,采用基于朴素贝叶斯等六种分类器分别与封装型序列浮动前向选择算法结合来确定最优特征子集。最后对不同的特征选择方法进行对比分析,将各种方法在不同选择准则上的优势加以利用,对特征数据进行分析、筛选和优化。实验结果表明,与未进行特征选择之前对比发现有害性预测准确率提高15.2~34个百分点以上;与其他特征选择方法比较,该方法在F1测度上提高1.1~10.1个百分点,在AUC指标上提升达到0.7~22.1个百分点,能极大地提高有害性预测模型的准确度。

关键词: 克隆代码, 有害性预测, 特征子集, 信息增益率, 特征选择

Abstract: To solve the problem of irrelevant and redundant features in harmfulness prediction of clone code, a combination model for harmfulness feature selection of code clone was proposed based on relevance and influence. Firstly, a preliminary sorting for the correlation of feature data was proceeded by the information gain ratio, then the features with high correlation was preserved and other irrelevant features were removed to reduce the search space of features. Next, the optimal feature subset was determined by using the wrapper sequential floating forward selection algorithm combined with six kinds of classifiers including Naive Bayes and so on. Finally, the different feature selection methods were analyzed, and feature data was analyzed, filtered and optimized by using the advantages of various methods in different selection critera. Experimental results show that the prediction accuracy is increased by15.2-34 percentage pointsafter feature selection; and compared with other feature selection methods, F1-measure of this method is increased by 1.1-10.1 percentage points, and AUC measure is increased by 0.7-22.1 percentage points. As a result, this method can greatly improve the accuracy of harmfulness prediction model.

Key words: clone code, harmfulness prediction, feature subset, information gain ratio, feature selection

中图分类号: