计算机应用

• 人工智能与仿真 •    下一篇

回归算法对软件缺陷个数预测模型性能的影响研究

付忠旺1,余啸2,肖蓉1,谷懿3   

  1. 1. 湖北大学计算机与信息工程学院
    2. 2武汉大学软件工程国家重点实验室
    3. 武汉大学软件工程国家重点实验室
  • 收稿日期:2017-08-07 修回日期:2017-09-22 发布日期:2017-09-22 出版日期:2017-10-18
  • 通讯作者: 余啸
  • 作者简介:付忠旺(1993—),男,山东聊城人,硕士研究生,主要研究方向:数据挖掘、软件工程; 余啸(1994—),男,湖北汉川人,博士研究生,主要研究方向:软件工程、深度学习; 肖蓉(1980—),女,湖北宜昌人,讲师,主要研究方向:软件工程; 谷懿(1996—),男,云南大理人,本科生,主要研究方向:机器学习。

Impact study of regression algorithms on the performance of the model for predicting the number of defects

  • Received:2017-08-07 Revised:2017-09-22 Online:2017-09-22 Published:2017-10-18
  • About author:Fu Zhongwang, born in 1993, M. S. candidate. His research interests include data mining and software engineering. Yu Xiao, born in 1994, Ph. D. candidate. His research interests include software engineering and deep learning. Xiao Rong, born in 1980, Lecturer. Her research interests include software engineering. Gu Yi, born in 1996, undergraduate. His research interests include machine learning.

摘要:

针对已有研究在评价软件缺陷个数预测模型性能时没有考虑到软件缺陷数据集存在数据不平衡的问题而采用了评估回归模型的不合适的评价指标的问题,提出以平均缺陷百分比作为评价指标,讨论不同回归算法对软件缺陷个数预测模型性能的影响程度。利用PROMISE提供的6个开源数据集,分析了10个回归算法对软件缺陷个数预测模型预测结果的影响以及各种回归算法之间的差异。研究结果表明:使用不同的回归算法建立的软件缺陷个数预测模型具有不同的预测效果,其中梯度Boosting回归算法和贝叶斯岭回归算法预测效果更好。

关键词: 软件缺陷个数预测, 数据不平衡, 回归算法

Abstract:

Focusing on the issue that the existing studies did not consider the imbalanced data distribution problem in defect datasets and employed improper performance measures for evaluating the regression models to evaluate the performance of models for predicting the number of defects, the impact on models for predicting the number of defects of different regression algorithms were explored by using Fault-Percentile-Average (FPA) as the performance measure. Experiments were conducted on six datasets from PROMISE repository to analyze the impact on the models and the difference of ten regression algorithms for predicting the number of defects. The results show that the forecast result of models for predicting the number of defects built by different regression algorithms are various, and Gradient Boosting Regression algorithm and Bayesian Ridge Regression algorithm can achieve better performance as a whole.

Key words: prediction of the number of defects, imbalanced data distribution, regression algorithm

中图分类号: