克隆代码有害性预测中的特征选择模型

doi:10.11772/j.issn.1001-9081.2017.04.1135

计算机应用 ›› 2017, Vol. 37 ›› Issue (4): 1135-1142.DOI: 10.11772/j.issn.1001-9081.2017.04.1135

克隆代码有害性预测中的特征选择模型

王欢, 张丽萍, 闫盛, 刘东升

内蒙古师范大学计算机与信息工程学院, 呼和浩特 010022

收稿日期:2016-08-24 修回日期:2016-09-30 出版日期:2017-04-10 发布日期:2017-04-19
通讯作者: 张丽萍
作者简介:王欢(1991-),男,内蒙古巴彦淖尔人,硕士研究生,主要研究方向:代码分析;张丽萍(1974-),女,内蒙古呼和浩特人,教授,博士,CCF会员,主要研究方向:软件工程、软件分析;闫盛(1984-),男,内蒙古包头人,讲师,硕士,主要研究方向:软件分析、并行计算;刘东升(1956-),男,内蒙古呼和浩特人,教授,主要研究方向:软件分析、计算机教育。
基金资助:
国家自然科学基金资助项目（61363017，61462071）；内蒙古自治区自然科学基金资助项目（2014MS0613，2015MS0606）。

Feature selection model for harmfulness prediction of clone code

WANG Huan, ZHANG Liping, YAN Sheng, LIU Dongsheng

College of Computer and Information Engineering, Inner Mongolia Normal University, Hohhot Nei Mongol 010022, China

Received:2016-08-24 Revised:2016-09-30 Online:2017-04-10 Published:2017-04-19
Supported by:
This work is partially supported by the National Natural Science Foundation of China (61363017, 61462071), the Natural Science Foundation of Inner Mongolia Autonomous Region (2014MS0613, 2015MS0606).

摘要/Abstract

摘要： 为解决克隆代码有害性预测过程中特征无关与特征冗余的问题，提出一种基于相关程度和影响程度的克隆代码有害性特征选择组合模型。首先，利用信息增益率对特征数据进行相关性的初步排序；然后，保留相关性排名较高的特征并去除其他无关特征，减小特征的搜索空间；接着，采用基于朴素贝叶斯等六种分类器分别与封装型序列浮动前向选择算法结合来确定最优特征子集。最后对不同的特征选择方法进行对比分析，将各种方法在不同选择准则上的优势加以利用，对特征数据进行分析、筛选和优化。实验结果表明，与未进行特征选择之前对比发现有害性预测准确率提高15.2~34个百分点以上；与其他特征选择方法比较，该方法在F1测度上提高1.1~10.1个百分点，在AUC指标上提升达到0.7~22.1个百分点，能极大地提高有害性预测模型的准确度。

关键词: 克隆代码, 有害性预测, 特征子集, 信息增益率, 特征选择

Abstract: To solve the problem of irrelevant and redundant features in harmfulness prediction of clone code, a combination model for harmfulness feature selection of code clone was proposed based on relevance and influence. Firstly, a preliminary sorting for the correlation of feature data was proceeded by the information gain ratio, then the features with high correlation was preserved and other irrelevant features were removed to reduce the search space of features. Next, the optimal feature subset was determined by using the wrapper sequential floating forward selection algorithm combined with six kinds of classifiers including Naive Bayes and so on. Finally, the different feature selection methods were analyzed, and feature data was analyzed, filtered and optimized by using the advantages of various methods in different selection critera. Experimental results show that the prediction accuracy is increased by15.2-34 percentage pointsafter feature selection; and compared with other feature selection methods, F1-measure of this method is increased by 1.1-10.1 percentage points, and AUC measure is increased by 0.7-22.1 percentage points. As a result, this method can greatly improve the accuracy of harmfulness prediction model.

Key words: clone code, harmfulness prediction, feature subset, information gain ratio, feature selection

中图分类号:

TP311.5

王欢, 张丽萍, 闫盛, 刘东升. 克隆代码有害性预测中的特征选择模型[J]. 计算机应用, 2017, 37(4): 1135-1142.

WANG Huan, ZHANG Liping, YAN Sheng, LIU Dongsheng. Feature selection model for harmfulness prediction of clone code[J]. Journal of Computer Applications, 2017, 37(4): 1135-1142.

参考文献

[1] WAGNER S, ABDULKHALEQ A, KAYA K, et al. On the relationship of inconsistent software clones and faults: an empirical study[C]//Proceedings of the 23rd International Conference on Software Analysis, Evolution, and Reengineering. Washington, DC: IEEE Computer Society, 2016:79-89.
[2] 梅宏, 王千祥, 张路, 等. 软件分析技术进展[J]. 计算机学报, 2009, 32(9): 1697-1710.(MEI H, WANG Q X, ZHANG L, et al. Software analysis technology progress[J]. Chinese Journal of Computers, 2009, 32(9): 1697-1710.)
[3] WANG X, DANG Y, ZHANG L, et al. Can I clone this piece of code here?[C]//Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering. Piscataway, NJ: IEEE, 2012: 170-179.
[4] 李智超. 基于支持向量机的克隆代码有害性评价方法研究[D]. 哈尔滨:哈尔滨工业大学, 2013:32-34.(LI Z C. Research on SVM-based evaluation method of clone code harmfulness[D]. Harbin: Harbin Institute of Technology, 2013: 32-34.)
[5] 张丽萍, 张瑞霞, 王欢, 等. 基于贝叶斯网络的克隆代码有害性预测[J]. 计算机应用, 2016, 36(1):260-265.(ZHANG L P, ZHANG R X, WANG H, et al. Harmfulness prediction of clone code based on Bayesian network[J]. Journal of Computer Applications, 2016, 36(1):260-265.)
[6] 尹丽丽. 基于主题模型的克隆代码有害性预测研究[D]. 呼和浩特:内蒙古师范大学, 2014:6-7.(YIN L L. Research on predicting harmfulness of code clones based on the topic model[D]. Hohhot: Inner Mongolia Normal University, 2014:6-7.)
[7] INOUE K, HIGO Y, YOSHIDA N, et al. Experience of finding inconsistently-changed bugs in code clones of mobile software[C]//IWSC 2012: Proceedings of the 6th International Workshop on Software Clones. Piscataway, NJ: IEEE, 2012:94-95.
[8] JUERGEBS E, DEISSENBOECK F, HUMMEL B, et al. Do code clones matter?[C]//ICSE 2009: Proceedings of the 31st International Conference on Software Engineering. Piscataway, NJ: IEEE, 2009:485-495.
[9] STEIDL D, GODE N. Feature-based detection of bugs in clones[C]//IWSC 2013: Proceedings of the 20137th International Workshop on Software Clones. Piscataway, NJ: IEEE, 2013: 76-82.
[10] GUYON I, ELISSEEFF A. An introduction to variable and feature selection[J]. Journal of Machine Learning Research, 2002, 3(6):1157-1182.
[11] BONEV B, ESCOLANO F, CAZORLA M. Feature selection, mutual information, and the classification of high-dimensional patterns[J]. Formal Pattern Analysis & Applications, 2008, 11(3/4):309-319.
[12] MOUSTAKIDIS S P, THEOCHARIS J B. A fast SVM-based wrapper feature selection method driven by a fuzzy complementary criterion[J]. Formal Pattern Analysis & Applications, 2012, 15(4):379-397.
[13] MENZIES T, GREENWALD J, FRANK A. Data mining static code attributes to learn defect predictors[J]. IEEE Transactions on Software Engineering, 2007, 33(1):2-13.
[14] KOHAVI R, JOHN G H. Wrappers for feature subset selection[J]. Artificial Intelligence, 1997, 97(1/2):273-324.
[15] JANECEK A, GANSTERER W N, DEMEL M, et al. On the relationship between feature selection and classification accuracy[EB/OL].[2016-05-10]. http://www.jmlr.org/proceedings/papers/v4/janecek08a/janecek08a.pdf.
[16] 张久杰, 王春晖, 张丽萍, 等.基于Token编辑距离检测克隆代码[J]. 计算机应用, 2015, 35(12): 3536-3543.(ZHANG J J, WANG C H, ZHANG L P, et al. Clone code detection based on Levenshtein distance of Token[J]. Journal of Computer Applications, 2015, 35(12):3536-3543.)
[17] 涂颖, 张丽萍, 王春晖, 等.基于软件多版本演化提取克隆谱系[J]. 计算机应用, 2015, 35(4): 1169-1173.(TU Y, ZHANG L P, WANG C H, et al. Clone genealogies extraction based on software evolution over multiple versions[J]. Journal of Computer Applications, 2015, 35(4): 1169-1173.)
[18] FAWCETT T. An introduction to ROC analysis[J]. Pattern Recognition Letters, 2006, 27(8):861-874.

克隆代码有害性预测中的特征选择模型

Feature selection model for harmfulness prediction of clone code

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	湛航, 何朗, 黄樟灿, 李华峰, 张蔷, 谈庆. 改进的基于层次距离的基因表达式编程特征选择分类算法[J]. 计算机应用, 2021, 41(9): 2658-2667.
[2]	祝承, 赵晓琦, 赵丽萍, 焦玉宏, 朱亚飞, 陈建英, 周伟, 谭颖. 基于谱聚类半监督特征选择的功能磁共振成像数据分类[J]. 计算机应用, 2021, 41(8): 2288-2293.
[3]	李蒙蒙, 秦伟, 刘艺, 刁兴春. 结合头脑风暴优化的混合蚁群优化算法[J]. 计算机应用, 2021, 41(8): 2412-2417.
[4]	贾鹤鸣, 姜子超, 李瑶, 孙康健. 基于改进斑点鬣狗优化算法的同步优化特征选择[J]. 计算机应用, 2021, 41(5): 1290-1298.
[5]	林筠超, 万源. 基于图结构优化的自适应多度量非监督特征选择方法[J]. 计算机应用, 2021, 41(5): 1282-1289.
[6]	张志浩, 林耀进, 卢舜, 郭晨, 王晨曦. 缺失标记下基于类属属性的多标记特征选择[J]. 计算机应用, 2021, 41(10): 2849-2857.
[7]	黄学雨, 徐浩特, 陶剑文. 具有特征选择的多源自适应分类框架[J]. 计算机应用, 2020, 40(9): 2499-2506.
[8]	顾桐, 许国良, 李万林, 李家浩, 王志愿, 雒江涛. 基于集成LightGBM和贝叶斯优化策略的房价智能评估模型[J]. 计算机应用, 2020, 40(9): 2762-2767.
[9]	刘丹, 姚立霜, 王云锋, 裴作飞. 面向类不平衡流量数据的分类模型[J]. 计算机应用, 2020, 40(8): 2327-2333.
[10]	肖跃雷, 张云娇. 基于特征选择和超参数优化的恐怖袭击组织预测方法[J]. 计算机应用, 2020, 40(8): 2262-2267.
[11]	汪志远, 降爱莲, 奥斯曼·穆罕默德. 基于正则互表示的无监督特征选择方法[J]. 计算机应用, 2020, 40(7): 1896-1900.
[12]	谢琪, 徐旭, 程耕国, 陈和平. 基于新的森林优化算法的特征选择算法[J]. 计算机应用, 2020, 40(5): 1266-1271.
[13]	曹堉, 王成, 王鑫, 高悦尔. 基于时空节点选择和深度学习的城市道路短时交通流预测[J]. 计算机应用, 2020, 40(5): 1488-1493.
[14]	曾元鹏, 王开军, 林崧. 面向二类区分能力的干扰熵特征选择方法[J]. 计算机应用, 2020, 40(3): 626-630.
[15]	章夏杰, 朱敬华, 陈杨. Spark下的分布式粗糙集属性约简算法[J]. 计算机应用, 2020, 40(2): 518-523.