Software defect number prediction method based on data oversampling and ensemble learning

doi:10.11772/j.issn.1001-9081.2018020507

Abstract

Abstract: Predicting the number of the defects in software modules can help testers pay more attention to the modules with more defects, thus reasonably allocating limited testing resource. Focusing on the issue that software defect datasets are imbalanced, a method based on oversampling and ensemble learning (abbreviate as SMOTENDEL) for predicting the number of defects was proposed in this paper. Firstly, n balanced datasets were obtained by oversampling the original software defect dataset n times. Then, n individual models for predicting the number of defects were trained on the n balanced datasets using regression algorithms. Finally, the n individual models were combined to obtain an ensemble prediction model, and the ensemble prediction model was used to predict the number of defects in a new software module. The experimental results show that SMOTENDEL has better performance than the original prediction method. When using Decision Tree Regression (DTR), Bayesian Ridge Regression (BRR) and Linear Regression (LR) as the individual prediction model, the improvement is 7.68%, 3.31% and 3.38%, respectively.

Key words: software defect prediction, data imbalance, over sampling, ensemble learning

摘要： 预测软件缺陷的数目有助于软件测试人员更多地关注缺陷数量多的模块，从而合理地分配有限的测试资源。针对软件缺陷数据集不平衡的问题，提出了一种基于数据过采样和集成学习的软件缺陷数目预测方法——SMOTENDEL。首先，对原始软件缺陷数据集进行n次过采样，得到n个平衡的数据集；然后基于这n个平衡的数据集利用回归算法训练出n个个体软件缺陷数目预测模型；最后对这n个个体模型进行结合得到一个组合软件缺陷数目预测模型，利用该组合预测模型对新的软件模块的缺陷数目进行预测。实验结果表明SMOTENDEL相比原始的预测方法在性能上有较大提升，当分别利用决策树回归（DTR）、贝叶斯岭回归（BRR）和线性回归（LR）作为个体预测模型时，提升率分别为7.68%、3.31%和3.38%。

关键词: 软件缺陷预测, 数据不平衡, 过采样, 集成学习

CLC Number:

TP181

JIAN Yiheng, YU Xiao. Software defect number prediction method based on data oversampling and ensemble learning[J]. Journal of Computer Applications, 2018, 38(9): 2637-2643.

简艺恒, 余啸. 基于数据过采样和集成学习的软件缺陷数目预测方法[J]. 计算机应用, 2018, 38(9): 2637-2643.

References

[1] 王青,伍书剑,李明树.软件缺陷预测技术[J].软件学报,2008,19(7):1565-1580.(WANG Q, WU S J, LI M S. Software defect prediction[J]. Journal of Software, 2008, 19(7):1565-1580.)
[2] MALHOTRA R. A systematic review of machine learning techniques for software fault prediction[J]. Applied Soft Computing Journal, 2015, 27(C):504-518.
[3] LI M, ZHANG H, WU R, et al. Sample-based software defect prediction with active and semi-supervised learning[J]. Automated Software Engineering, 2012, 19(2):201-230.
[4] SHEPPERD M, BOWES D, HALL T. Researcher bias:the use of machine learning in software defect prediction[J]. IEEE Transactions on Software Engineering, 2014, 42(11):1092-1094.
[5] 蒋盛益,谢照青,余雯.基于代价敏感的朴素贝叶斯不平衡数据分类研究[J].计算机研究与发展,2011,48(S1):387-390.(JIANG S Y, XIE Z Q, YU W. Naïve Bayes classification algorithm based on cost sensitive for imbalanced data distribution[J]. Journal of Computer Research and Development, 2011,48(S1):387-390.)
[6] BACH M, WERNER A, ZYWIEC J, et al. The study of under-and over-sampling methods' utility in analysis of highly imbalanced data on osteoporosis[J]. Information Sciences, 2017, 384:174-190.
[7] TORGO L, BRANCO P, RIBEIRO R P. Resampling strategies for regression[J]. Expert Systems, 2015, 32(3):465-476.
[8] 戴翔,毛宇光.跨机构的软件缺陷集成采样预测研究[J].小型微型计算机系统,2015,36(8):1700-1705.(DAI X, MAO Y G. Research on cross-company software defect prediction based on integrated sampling and ensemble learning[J]. Journal of Chinese Computer Systems, 2015, 36(8):1700-1705.)
[9] 戴翔,毛宇光.基于集成混合采样的软件缺陷预测研究[J].计算机工程与科学,2015,37(5):930-936.(DAI X, MAO Y G. Research on software defect prediction based on integrated sampling and ensemble learning[J]. Computer Engineering and Science, 2015, 37(5):930-936.)
[10] 李勇.结合欠抽样与集成的软件缺陷预测[J].计算机应用,2014,34(8):2291-2294.(LI Y. Software defects prediction based on under-sampling and ensemble algorithm[J]. Journal of Computer Applications, 2014, 34(8):2291-2294.)
[11] RATHORE S S, KUMAR S. Linear and non-linear heterogeneous ensemble methods to predict the number of faults in software systems[J]. Knowledge-Based Systems, 2017, 119:232-256.
[12] CHEN M, MA Y. An empirical study on predicting defect numbers[EB/OL].[2018-01-21]. http://pdfs.semanticscholar.org/43b5/eb8026719fe47338684060b843979981a0c7.pdf.
[13] WEYUKER E J, OSTRAND T J, BELL R M. Comparing the effectiveness of several modeling methods for fault prediction[J]. Empirical Software Engineering, 2010, 15(3):277-295.
[14] HERBOLD S, TRAUTSCH A, GRABOWSKI J. Global vs. local models for cross-project defect prediction[J]. Empirical Software Engineering, 2016, 22(4):1-37.
[15] ZHANG Y, LO D, XIA X, et al. Combined classifier for cross-project defect prediction:an extended empirical study[J]. Frontiers of Computer Science, 2018, 12(2):280-296.
[16] HOSSEINI S, TURHAN B, MÄNTYLÄ M. A benchmark study on the effectiveness of search-based data selection and feature selection for cross project defect prediction[J]. Information and Software Technology, 2018, 95:296-312.
[17] CHEN X, ZHAO Y, WANG Q, et al. MULTI:multi-objective effort-aware just-in-time software defect prediction[J]. Information and Software Technology, 2018, 93:1-13.
[18] KAI M T. An instance-weighting method to induce cost-sensitive trees[J]. IEEE Transactions on Knowledge and Data Engineering, 2002, 14(3):659-665.
[19] ZHOU Z H, LIU X Y. Training cost-sensitive neural networks with methods addressing the class imbalance problem[J]. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(1):63-77.
[20] KHOSHGOFTAAR T M, GELEYN E, NGUYEN L, et al. Cost-sensitive boosting in software quality modeling[C]//HASE'02:Proceedings of the 7th IEEE International Symposium on High Assurance Systems Engineering. Washington, DC:IEEE Computer Society, 2002:51.
[21] CHEN L, FANG B, SHANG Z, et al. Tackling class overlap and imbalance problems in software defect prediction[J]. Software Quality Journal, 2018, 26(1):97-125.
[22] ESTABROOKS A, JO T, JAPKOWICZ N. A multiple resampling method for learning from imbalanced data sets[J]. Computational Intelligence, 2010, 20(1):18-36.
[23] BENNIN K E, KEUNG J, PHANNACHITTA P, et al. MAHAKIL:diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction[J]. IEEE Transactions on Software Engineering, 2018, 44(6):534-550.
[24] TONG H, LIU B, WANG S. Software defect prediction using stacked denoising autoencoders and two-stage ensemble learning[J]. Information and Software Technology, 2017,96:94-111.
[25] OKUTAN A, YILDIZ O T. Software defect prediction using Bayesian networks[J]. Empirical Software Engineering, 2014, 19(1):154-181.
[26] WANG J, ZHANG H. Predicting defect numbers based on defect state transition models[C]//ESEM'12:Proceedings of the ACM-IEEE International Symposium on Empirical Software Engineering and Measurement. New York:ACM, 2012:191-200.
[27] RATHORE S S, KUMAR S. A decision tree regression based approach for the number of software faults prediction[J]. ACM SIGSOFT Software Engineering Notes, 2016, 41(1):1-6.
[28] RATHORE S S, KUMAR S. An empirical study of some software fault prediction techniques for the number of faults prediction[J]. Soft Computing, 2017, 21(24):7417-7434.
[29] RATHORE S S, KUMAR S. Towards an ensemble based system for predicting the number of software faults[J]. Expert Systems with Applications, 2017, 82:357-382.
[30] OSTRAND T J, WEYUKER E J, BELL R M. Predicting the location and number of faults in large software systems[J]. IEEE Transactions on Software Engineering, 2005, 31(4):340-355.
[31] YU X, LIU J, YANG Z, et al. Learning from imbalanced data for predicting the number of software defects[C]//ISSRE'17:Proceedings of the 2017 IEEE 28th International Symposium on Software Reliability Engineering. Washington, DC:IEEE Computer Society, 2017:78-89.
[32] CHAWLA N V, BOWYER K W, HALL L O, et al. SMOTE:synthetic minority over-sampling technique[J]. Journal of Artificial Intelligence Research, 2002, 16(1):321-357.
[33] YANG X, TANG K, YAO X. A learning-to-rank approach to software defect prediction[J]. IEEE Transactions on Reliability, 2015, 64(1):234-246.