计算机应用 ›› 2018, Vol. 38 ›› Issue (9): 2637-2643.DOI: 10.11772/j.issn.1001-9081.2018020507

• 计算机软件技术 • 上一篇    下一篇

基于数据过采样和集成学习的软件缺陷数目预测方法

简艺恒1, 余啸2   

  1. 1. 北京理工大学 信息与电子学院, 北京 102488;
    2. 武汉大学 计算机学院, 武汉 430072
  • 收稿日期:2018-03-12 修回日期:2018-05-17 出版日期:2018-09-10 发布日期:2018-09-06
  • 通讯作者: 余啸
  • 作者简介:简艺恒(1998—),男,湖北武汉人,主要研究方向:软件工程、数据挖掘;余啸(1994—),男,湖北汉川人,博士研究生,主要研究方向:软件工程、数据挖掘。

Software defect number prediction method based on data oversampling and ensemble learning

JIAN Yiheng1, YU Xiao2   

  1. 1. School of Information and Electronics, Beijing Institute of Technology, Beijing 102488, China;
    2. School of Computer Science, Wuhan University, Wuhan Hubei 430072, China
  • Received:2018-03-12 Revised:2018-05-17 Online:2018-09-10 Published:2018-09-06
  • Contact: 余啸

摘要: 预测软件缺陷的数目有助于软件测试人员更多地关注缺陷数量多的模块,从而合理地分配有限的测试资源。针对软件缺陷数据集不平衡的问题,提出了一种基于数据过采样和集成学习的软件缺陷数目预测方法——SMOTENDEL。首先,对原始软件缺陷数据集进行n次过采样,得到n个平衡的数据集;然后基于这n个平衡的数据集利用回归算法训练出n个个体软件缺陷数目预测模型;最后对这n个个体模型进行结合得到一个组合软件缺陷数目预测模型,利用该组合预测模型对新的软件模块的缺陷数目进行预测。实验结果表明SMOTENDEL相比原始的预测方法在性能上有较大提升,当分别利用决策树回归(DTR)、贝叶斯岭回归(BRR)和线性回归(LR)作为个体预测模型时,提升率分别为7.68%、3.31%和3.38%。

关键词: 软件缺陷预测, 数据不平衡, 过采样, 集成学习

Abstract: Predicting the number of the defects in software modules can help testers pay more attention to the modules with more defects, thus reasonably allocating limited testing resource. Focusing on the issue that software defect datasets are imbalanced, a method based on oversampling and ensemble learning (abbreviate as SMOTENDEL) for predicting the number of defects was proposed in this paper. Firstly, n balanced datasets were obtained by oversampling the original software defect dataset n times. Then, n individual models for predicting the number of defects were trained on the n balanced datasets using regression algorithms. Finally, the n individual models were combined to obtain an ensemble prediction model, and the ensemble prediction model was used to predict the number of defects in a new software module. The experimental results show that SMOTENDEL has better performance than the original prediction method. When using Decision Tree Regression (DTR), Bayesian Ridge Regression (BRR) and Linear Regression (LR) as the individual prediction model, the improvement is 7.68%, 3.31% and 3.38%, respectively.

Key words: software defect prediction, data imbalance, over sampling, ensemble learning

中图分类号: