Journal of Computer Applications ›› 2020, Vol. 40 ›› Issue (11): 3273-3279.DOI: 10.11772/j.issn.1001-9081.2020040464

• Computer software technology • Previous Articles     Next Articles

Data preprocessing method in software defect prediction

PAN Chunxia, YANG Qiuhui, TAN Wukun, DENG Huixin, WU Jia   

  1. College of Computer Science, Sichuan University, Chengdu Sichuan 610065, China
  • Received:2020-04-14 Revised:2020-07-01 Online:2020-11-10 Published:2020-07-20

软件缺陷预测中的数据预处理方法

潘春霞, 杨秋辉, 谭武坤, 邓惠心, 伍佳   

  1. 四川大学 计算机学院, 成都 610065
  • 通讯作者: 杨秋辉(1970-),女,山东青岛人,副教授,博士,CCF会员,主要研究方向:软件工程、软件项目管理;yangqiuhui@scu.edu.cn
  • 作者简介:潘春霞(1995-),女,四川内江人,硕士研究生,主要研究方向:软件质量保证与测试;谭武坤(1990-),男,安徽亳州人,硕士,主要研究方向:软件测试、软件质量保证;邓惠心(1996-),女,四川南充人,硕士研究生,主要研究方向:软件质量保证与测试;伍佳(1996-),女,四川成都人,硕士研究生,主要研究方向:软件缺陷定位

Abstract: Software defect prediction is a hot research topic in the field of software quality assurance. The quality of defect prediction models is closely related to the training data. The datasets used for defect prediction mainly have the problems of data feature selection and data class imbalance. Aiming at the problem of data feature selection, common process features of software development and the newly proposed extended process features were used, and then the feature selection algorithm based on clustering analysis was used to perform feature selection. Aiming at the data class imbalance problem, an improved Borderline-SMOTE (Borderline-Synthetic Minority Oversampling Technique) method was proposed to make the numbers of positive and negative samples in the training dataset relatively balanced, and make the characteristics of the synthesized samples more consistent with the actual sample characteristics. Experiments were performed by using the open source datasets of projects such as bugzilla and jUnit. The results show that the used feature selection algorithm can reduce the model training time by 57.94% while keeping high F-measure value of the model; compared to the defect prediction model obtained by using the original method to process samples, the model obtained by the improved Borderline-SMOTE method respectively increase the Precision, Recall, F-measure, and AUC (Area Under the Curve) by 2.36 percentage points, 1.8 percentage points, 2.13 percentage points and 2.36 percentage points on average; the defect prediction model obtained by introducing the extended process features has an average improvement of 3.79% in F-measure value compared to the model without the extended process features; compared with the models obtained by methods in the literatures, the model obtained by the proposed method has an average increase of 15.79% in F-measure value. The experimental results prove that the proposed method can effectively improve the quality of the defect prediction model.

Key words: defect prediction, data preprocessing, development process feature, feature selection, class imbalance processing

摘要: 软件缺陷预测是软件质量保障领域的热点研究课题,缺陷预测模型的质量与训练数据有密切关系。用于缺陷预测的数据集主要存在数据特征的选择和数据类不平衡问题。针对数据特征选择问题,采用软件开发常用的过程特征和新提出的扩展过程特征,然后采用基于聚类分析的特征选择算法进行特征选择;针对数据类不平衡问题,提出改进的Borderline-SMOTE过采样方法,使得训练数据集的正负样本数量相对平衡且合成样本的特征更符合实际样本特征。采用bugzilla、jUnit等项目的开源数据集进行实验,结果表明:所采用的特征选择算法在保证模型F-measure值的同时,可以降低57.94%的模型训练时间;使用改进的Borderline-SMOTE方法处理样本得到的缺陷预测模型在Precision、Recall、F-measure、AUC指标上比原始方法得到的模型平均分别提高了2.36个百分点、1.8个百分点、2.13个百分点、2.36个百分点;引入了扩展过程特征得到的缺陷预测模型比未引入扩展过程特征得到的模型在F-measure值上平均提高了3.79%;与文献中的方法得到的模型相比,所提方法得到的模型在F-measure值上平均提高了15.79%。实验结果证明所提方法能有效提升缺陷预测模型的质量。

关键词: 缺陷预测, 数据预处理, 开发过程特征, 特征选择, 类不平衡处理

CLC Number: