《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (5): 1554-1562.DOI: 10.11772/j.issn.1001-9081.2021050867

• 计算机软件技术 • 上一篇    下一篇

基于特征选择和TrAdaBoost的跨项目缺陷预测方法

李莉(), 石可欣, 任振康   

  1. 东北林业大学 信息与计算机工程学院,哈尔滨 150040
  • 收稿日期:2021-05-25 修回日期:2022-01-24 接受日期:2022-02-18 发布日期:2022-03-08 出版日期:2022-05-10
  • 通讯作者: 李莉
  • 作者简介:李莉(1977—),女,河南孟州人,副教授,博士,CCF会员,主要研究方向:先进软件工程、区块链、群智能优化、大型分布式计算 lli@nefu.edu.cn
    石可欣(1997—),女,山东聊城人,硕士研究生,主要研究方向:软件缺陷预测
    任振康(1996—),男,山东青岛人,硕士研究生,主要研究方向:软件缺陷预测。

Cross-project defect prediction method based on feature selection and TrAdaBoost

Li LI(), Kexin SHI, Zhenkang REN   

  1. College of Information and Computer Engineering,Northeast Forestry University,Harbin Heilongjiang 150040,China
  • Received:2021-05-25 Revised:2022-01-24 Accepted:2022-02-18 Online:2022-03-08 Published:2022-05-10
  • Contact: Li LI
  • About author:LI Li, born in 1977, Ph. D., associate professor. Her research interests include advanced software engineering, blockchain, swarm intelligence optimization, large-scale distributed computing.
    SHI Kexin, born in 1997, M. S. candidate. Her research interests include software defect prediction.
    REN Zhenkang, born in 1996, M. S. candidate. His research interests include software defect prediction.

摘要:

跨项目软件缺陷预测可以解决预测项目中训练数据较少的问题,然而源项目和目标项目通常会有较大的数据分布差异,这降低了预测性能。针对该问题,提出了一种基于特征选择和TrAdaBoost的跨项目缺陷预测方法(CPDP-FSTr)。首先,在特征选择阶段,采用核主成分分析法(KPCA)删除源项目中的冗余数据;然后,根据源项目和目标项目的属性特征分布,按距离选出与目标项目分布最接近的候选源项目数据;最后,在实例迁移阶段,通过采用评估因子改进的TrAdaBoost方法,在源项目中找出与目标项目中少量有标签实例分布相近的实例,并建立缺陷预测模型。以F1作为评价指标,与基于特征聚类和TrAdaBoost的跨项目软件缺陷预测(FeCTrA)方法以及基于多核集成学习的跨项目软件缺陷预测(CMKEL)方法相比,CPDP-FSTr的预测性能在AEEEM数据集上分别提高了5.84%、105.42%,在NASA数据集上分别提高了5.25%、85.97%,且其两过程特征选择优于单一特征选择过程。实验结果表明,当源项目特征选择比例和目标项目有类标实例比例分别为60%、20%时,所提CPDP-FSTr能取得较好的预测性能。

关键词: 跨项目缺陷预测, 特征选择, 核主成分分析, 实例迁移, TrAdaBoost

Abstract:

Cross-project software defect prediction can solve the problem of few training data in prediction projects. However, the source project and the target project usually have the large distribution difference, which reduces the prediction performance. In order to solve the problem, a new Cross-Project Defect Prediction method based on Feature Selection and TrAdaBoost (CPDP-FSTr) was proposed. Firstly, in the feature selection stage, Kernel Principal Component Analysis (KPCA) was used to delete redundant data in the source project. Then, according to the attribute feature distribution of the source project and the target project, the candidate source project data closest to the target project distribution were selected according to the distance. Finally, in the instance transfer stage, the TrAdaBoost method improved by the evaluation factor was used to find out the instances in the source project which were similar to the distribution of a few labeled instances in the target project, and establish a defect prediction model. Using F1 as the evaluation index, compared with the methods such as cross-project software defect prediction using Feature Clustering and TrAdaBoost (FeCTrA), Cross-project software defect prediction based on Multiple Kernel Ensemble Learning (CMKEL), the proposed CPDP-FSTr had the prediction performance improved by 5.84% and 105.42% respectively on AEEEM dataset, enhanced by 5.25% and 85.97% respectively on NASA dataset, and its two-process feature selection is better than the single feature selection process. Experimental results show that the proposed CPDP-FSTr can achieve better prediction performance when the source project feature selection proportion and the target project labeled instance proportion are 60% and 20% respectively.

Key words: cross-project defect prediction, feature selection, Kernel Principal Component Analysis (KPCA), instance transfer, TrAdaBoost

中图分类号: