跨项目缺陷预测中训练数据选择方法

doi:10.11772/j.issn.1001-9081.2016.11.3165

计算机应用 ›› 2016, Vol. 36 ›› Issue (11): 3165-3169.DOI: 10.11772/j.issn.1001-9081.2016.11.3165

跨项目缺陷预测中训练数据选择方法

王星¹, 何鹏^1,2, 陈丹¹, 曾诚^1,2

1. 湖北大学计算机与信息工程学院, 武汉 430062;
2. 湖北省教育信息化工程技术研究中心(湖北大学), 武汉 430062

收稿日期:2016-04-13 修回日期:2016-06-23 出版日期:2016-11-10 发布日期:2016-11-12
通讯作者: 何鹏
作者简介:王星(1991-),女,湖北武汉人,硕士研究生,主要研究方向:软件度量、复杂网络;何鹏(1988-),男,江西宜春人,讲师,博士,CCF会员,主要研究方向:软件度量、软件维护、复杂网络;陈丹(1992-),女,湖北黄冈人,硕士研究生,主要研究方向:软件度量、软件维护;曾诚(1976-),男,湖北武汉人,副教授,博士,CCF会员,主要研究方向:服务计算、云计算。
基金资助:
国家973计划项目（2014CB340401）；国家自然科学基金资助项目（61273216，61272111，61202048，61202032）；湖北省知识创新专项项目（2016CFB309）；武汉市青年科技晨光计划项目（2014070404010232）。

Selection of training data for cross-project defect prediction

WANG Xing¹, HE Peng^1,2, CHEN Dan¹, ZENG Cheng^1,2

1. School of Computer Science and Information Engineering, Hubei University, Wuhan Hubei 430062, China;
2. Hubei Province Engineering Technology Research Center for Education Informationization(Hubei University), Wuhan Hubei 430062, China

Received:2016-04-13 Revised:2016-06-23 Online:2016-11-10 Published:2016-11-12
Supported by:
This work is partially supported by the National Basic Research Program (973 Program) of China (2014CB340401), the National Natural Science Foundation of China (61273216, 61272111, 61202048, 61202032), the Special Knowledge Innovation Project in Hubei Province (2013AAA020), the Youth Chenguang Project of Science and Technology of Wuhan City (2014070404010232).

摘要/Abstract

摘要： 跨项目缺陷预测（CPDP）利用来自其他项目的缺陷数据预测目标项目的缺陷情况，为解决以往缺陷预测方法面临的训练数据受限问题提供了一个新的视角。训练数据的质量将直接影响跨项目缺陷预测模型的性能，因此，需尽可能选择与目标项目更相似的数据用于模型的训练。利用PROMISE提供的34个公开数据集，从训练数据选择方面，分析了四种典型的相似性度量方法对跨项目预测结果的影响以及各种方法之间的差异。研究结果表明：使用不同的相似性度量方法选出的训练数据质量不同，其中余弦相似性与相关系数两种方法效果更好，且最大改进比例达到6.7%；同时，根据目标项目的缺陷率，发现余弦相似性更适合于缺陷率高于0.25的项目。

关键词: 软件质量保证, 缺陷预测, 跨项目缺陷预测, 相似性度量, 数据选择

Abstract: Cross-Project Defect Prediction (CPDP), which uses data from other projects to predict defects in the target project, provides a new perspective to resolve the shortcoming of limited training data encountered in traditional defect prediction. The data more similar to target project should be given priority in the context, because the quality of train cross-project data will directly affect the performance of cross-project defect prediction. In this paper, to analyze the impact of different similarity measures on the selection of training data for cross-project defect prediction, experiments were performed on 34 datasets from the PROMISE repository. The results show that the quality of training data selected by different similarity measure methods is various, and cosine similarity and correlation coefficient can achieve better performance as a whole. The greatest improvement rate is up to 6.7%. According to defect rate of target project, cosine similarity is seem to be more suitable when the defect rate is more than 0.25.

Key words: software quality assurance, defect prediction, Cross-Project Defect Prediction (CPDP), similarity measure, data selection

中图分类号:

TP310

王星, 何鹏, 陈丹, 曾诚. 跨项目缺陷预测中训练数据选择方法[J]. 计算机应用, 2016, 36(11): 3165-3169.

WANG Xing, HE Peng, CHEN Dan, ZENG Cheng. Selection of training data for cross-project defect prediction[J]. Journal of Computer Applications, 2016, 36(11): 3165-3169.

参考文献

[1] ZIMMERMANN T, NAGAPPAN N, GALL H, et al. Cross-project defect prediction a large scale experiment on data vs. domain vs. process[C]//ESEC/FSE 2009:Proceedings of the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering. New York:ACM, 2009:91-100.
[2] RAINER A, GALE S. Evaluating the quality and quantity of data on open source software projects[EB/OL].[2015-01-01]. http://uhra.herts.ac.uk/handle/2299/2076.
[3] HE Z, SHU F, YANG Y, et al. An investigation on the feasibility of cross-project defect prediction[J]. Automated Software Engineering, 2012, 19(2):167-199.
[4] PETERS F, MENZIES T, MARCUS A. Better cross company defect prediction[C]//Proceedings of the 10th Working Conference on Mining Software Repositories. Piscataway, NJ:IEEE, 2013:409-418.
[5] RAHMAN F, POSNETT D, DEVANBU P. Recalling the imprecision of cross-project defect prediction[C]//Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering. New York:ACM, 2012:1-11.
[6] TURHAN B, MENZIES T, BENER A B, et al. On the relative value of cross-company and within-company data for defect prediction[J]. Empirical Software Engineering, 2009, 14(5):540-578.
[7] TURHAN B, MISIRLI A T, BENER A. Empirical evaluation of the effects of mixed project data on learning defect predictors[J]. Information & Software Technology, 2013, 55(6):1101-1118.
[8] HERBOLD S. Training data selection for cross-project defect prediction[C]//Proceedings of the 9th International Conference on Predictive Models in Software Engineering. New York:ACM, 2013:1-10.
[9] BRIAND L C, MELO W L, WUST J. Assessing the applicability of fault-proneness models across object-oriented software projects[J]. IEEE Transactions on Software Engineering, 2002, 28(7):706-720.
[10] RYU D, JANG J, BAIK J. A hybrid instance selection using nearest-neighbor for cross-project defect prediction[J]. Journal of Computer Science and Technology, 2015, 30(5):969-980.
[11] 王青, 伍书剑, 李明树.软件缺陷预测技术[J].软件学报, 2008, 19(7):1565-1580.(WANG Q, WU S J, LI M S. Software defect prediction[J]. Journal of Software, 2008, 19(7):1565-1580.)
[12] CATAL C. Software fault prediction:a literature review and current trends[J]. Expert Systems with Applications, 2011, 38(4):4626-4636.
[13] HALL T, BEECHAM S, BOWES D, et al. A systematic literature review on fault prediction performance in software engineering[J]. IEEE Transactions on Software Engineering, 2012, 38(6):1276-1304.
[14] 陈翔, 顾庆, 刘望舒, 等.静态软件缺陷预测方法研究[J]. 软件学报, 2016, 27(1):1-25.(CHEN X, GU Q, LIU W S, et al. Survey of static software defect prediction[J]. Journal of Software, 2016, 27(10):1-25.)
[15] MA Y, LUO G, ZENG X, et al. Transfer learning for cross-company software defect prediction[J]. Information and Software Technology, 2012, 54(3):248-256.
[16] 程铭, 毋国庆, 袁梦霆.基于迁移学习的软件缺陷预测[J]. 电子学报, 2016, 44(1):115-122.(CHENG M, WU G Q, YUAN M T. Transfer learning for software defect prediction[J]. Acta Electronica Sinica, 2016, 44(1):115-122.)
[17] NAM J, PAN S J, KIM S. Transfer defect learning[C]//Proceedings of the 2013 International Conference on Software Engineering. Piscataway, NJ:IEEE, 2013:382-391.
[18] CHEN L, FANG B, SHANG Z, et al. Negative samples reduction in cross-company software defects prediction[J]. Information & Software Technology, 2015, 62(C):67-77.
[19] HE P, LI B, MA YT. Towards cross-project defect prediction with imbalanced feature sets[EB/OL].[2015-01-01]. http://arxiv.org/abs/1411.422.
[20] NAM J, KIM S. Heterogeneous defect prediction[C]//Proceedings of the 10th Joint Meeting of the European Software Engineering Conference and the ACM Sigsoft Symposium on the Foundations of Software Engineering. New York:ACM, 2015:508-519.
[21] JING X, WU F, DONG X, et al. Heterogeneous cross-company defect prediction by unified metric representation and CCA-based transfer learning[C]//Proceedings of the 10th Joint Meeting of the European Software Engineering Conference and the Symposium on the Foundations of Software Engineering. New York:ACM, 2015:496-507.
[22] ZHANG F, ZHENG Q, ZOU Y, et al. Cross-project defect prediction using a connectivity-based unsupervised classifier[C]//Proceedings of the 38th International Conference on Software Engineering. New York:ACM, 2016:309-320.
[23] XIA X, LO D, PAN S J, et al. HYDRA:massively compositional model for cross-project defect prediction[J/OL]. IEEE Transactions on Software Engineering.[2016-02-10]. http://doi.ieeecomputersociety.org/10.1109/TSE.2016.2543218.
[24] WANG S, LIU T, TAN L. Automatically learning semantic features for defect prediction[C]//Proceedings of the 38th International Conference on Software Engineering. New York:ACM, 2016:297-308.
[25] RYU D, BAIK J. Effective multi-objective naïve Bayes learning for cross-project defect prediction[J/OL]. Applied Soft Computing.[2016-02-01]. http://dx.doi.org/10.1016/j.asoc.2016.04.009.
[26] SONG Q, JIA Z, SHEPPERD M, et al. A general software defect-proneness prediction framework[J]. IEEE Transactions on Software Engineering, 2011, 37(3):356-370.
[27] FAWCETT T. An introduction to ROC analysis[J]. Pattern Recognition Letters, 2006, 27(8):861-874.
[28] JURECZKO M, MADEYSKI L. Towards identifying software project clusters with regard to defect prediction[C]//Proceedings of the 6th International Conference on Predictive Models in Software Engineering. New York:ACM, 2010:1-10.
[29] HE P, LI B, LIU X, et al. An empirical study on software defect prediction with a simplified metric set[J]. Information and Software Technology, 2015, 59(C):170-190.

跨项目缺陷预测中训练数据选择方法

Selection of training data for cross-project defect prediction

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	杨蒙蒙, 张爱华. 基于灰度共生矩阵和同步正交匹配追踪的分形图像压缩[J]. 计算机应用, 2021, 41(5): 1445-1449.
[2]	胡立华, 左威健, 聂瑶瑶. 基于加权相似性度量的特征匹配方法[J]. 计算机应用, 2021, 41(2): 511-516.
[3]	周玉彬, 肖红, 王涛, 姜文超, 熊梦, 贺忠堂. 基于动作周期退化相似性度量的机械轴健康指标构建与剩余寿命预测[J]. 《计算机应用》唯一官方网站, 2021, 41(11): 3192-3199.
[4]	潘春霞, 杨秋辉, 谭武坤, 邓惠心, 伍佳. 软件缺陷预测中的数据预处理方法[J]. 计算机应用, 2020, 40(11): 3273-3279.
[5]	马伟苹, 李文新, 孙晋川, 曹鹏霞. 基于粗精立体匹配的双目视觉目标定位方法[J]. 计算机应用, 2020, 40(1): 227-232.
[6]	章永来, 周耀鉴. 聚类算法综述[J]. 计算机应用, 2019, 39(7): 1869-1882.
[7]	姜逸凡, 叶青. 基于孪生神经网络的时间序列相似性度量[J]. 计算机应用, 2019, 39(4): 1041-1045.
[8]	刘成斌, 郑巍, 樊鑫, 杨丰玉. 基于网络表征学习的混合缺陷预测模型[J]. 计算机应用, 2019, 39(12): 3633-3638.
[9]	简艺恒, 余啸. 基于数据过采样和集成学习的软件缺陷数目预测方法[J]. 计算机应用, 2018, 38(9): 2637-2643.
[10]	常炳国, 臧虹颖. 基于分段降维和路径修正DTW的时序特征分类器设计[J]. 计算机应用, 2018, 38(7): 1910-1915.
[11]	鲍舒婷, 孙丽萍, 郑孝遥, 郭良敏. 基于共享近邻相似度的密度峰聚类算法[J]. 计算机应用, 2018, 38(6): 1601-1607.
[12]	徐苏, 周颖玥. 基于图像分割的非局部均值去噪算法[J]. 计算机应用, 2017, 37(7): 2078-2083.
[13]	于金明, 孟军, 吴秋峰. 基于改进相似性度量的项目协同过滤推荐算法[J]. 计算机应用, 2017, 37(5): 1387-1391.
[14]	石陆魁, 张延茹, 张欣. 基于时空模式的轨迹数据聚类算法[J]. 计算机应用, 2017, 37(3): 854-859.
[15]	杨家慧, 刘方爱. 基于巴氏系数和Jaccard系数的协同过滤算法[J]. 计算机应用, 2016, 36(7): 2006-2010.