计算机应用 ›› 2016, Vol. 36 ›› Issue (11): 3165-3169.DOI: 10.11772/j.issn.1001-9081.2016.11.3165

• 计算机软件技术 • 上一篇    下一篇

跨项目缺陷预测中训练数据选择方法

王星1, 何鹏1,2, 陈丹1, 曾诚1,2   

  1. 1. 湖北大学 计算机与信息工程学院, 武汉 430062;
    2. 湖北省教育信息化工程技术研究中心(湖北大学), 武汉 430062
  • 收稿日期:2016-04-13 修回日期:2016-06-23 出版日期:2016-11-10 发布日期:2016-11-12
  • 通讯作者: 何鹏
  • 作者简介:王星(1991-),女,湖北武汉人,硕士研究生,主要研究方向:软件度量、复杂网络;何鹏(1988-),男,江西宜春人,讲师,博士,CCF会员,主要研究方向:软件度量、软件维护、复杂网络;陈丹(1992-),女,湖北黄冈人,硕士研究生,主要研究方向:软件度量、软件维护;曾诚(1976-),男,湖北武汉人,副教授,博士,CCF会员,主要研究方向:服务计算、云计算。
  • 基金资助:
    国家973计划项目(2014CB340401);国家自然科学基金资助项目(61273216,61272111,61202048,61202032);湖北省知识创新专项项目(2016CFB309);武汉市青年科技晨光计划项目(2014070404010232)。

Selection of training data for cross-project defect prediction

WANG Xing1, HE Peng1,2, CHEN Dan1, ZENG Cheng1,2   

  1. 1. School of Computer Science and Information Engineering, Hubei University, Wuhan Hubei 430062, China;
    2. Hubei Province Engineering Technology Research Center for Education Informationization(Hubei University), Wuhan Hubei 430062, China
  • Received:2016-04-13 Revised:2016-06-23 Online:2016-11-10 Published:2016-11-12
  • Supported by:
    This work is partially supported by the National Basic Research Program (973 Program) of China (2014CB340401), the National Natural Science Foundation of China (61273216, 61272111, 61202048, 61202032), the Special Knowledge Innovation Project in Hubei Province (2013AAA020), the Youth Chenguang Project of Science and Technology of Wuhan City (2014070404010232).

摘要: 跨项目缺陷预测(CPDP)利用来自其他项目的缺陷数据预测目标项目的缺陷情况,为解决以往缺陷预测方法面临的训练数据受限问题提供了一个新的视角。训练数据的质量将直接影响跨项目缺陷预测模型的性能,因此,需尽可能选择与目标项目更相似的数据用于模型的训练。利用PROMISE提供的34个公开数据集,从训练数据选择方面,分析了四种典型的相似性度量方法对跨项目预测结果的影响以及各种方法之间的差异。研究结果表明:使用不同的相似性度量方法选出的训练数据质量不同,其中余弦相似性与相关系数两种方法效果更好,且最大改进比例达到6.7%;同时,根据目标项目的缺陷率,发现余弦相似性更适合于缺陷率高于0.25的项目。

关键词: 软件质量保证, 缺陷预测, 跨项目缺陷预测, 相似性度量, 数据选择

Abstract: Cross-Project Defect Prediction (CPDP), which uses data from other projects to predict defects in the target project, provides a new perspective to resolve the shortcoming of limited training data encountered in traditional defect prediction. The data more similar to target project should be given priority in the context, because the quality of train cross-project data will directly affect the performance of cross-project defect prediction. In this paper, to analyze the impact of different similarity measures on the selection of training data for cross-project defect prediction, experiments were performed on 34 datasets from the PROMISE repository. The results show that the quality of training data selected by different similarity measure methods is various, and cosine similarity and correlation coefficient can achieve better performance as a whole. The greatest improvement rate is up to 6.7%. According to defect rate of target project, cosine similarity is seem to be more suitable when the defect rate is more than 0.25.

Key words: software quality assurance, defect prediction, Cross-Project Defect Prediction (CPDP), similarity measure, data selection

中图分类号: