Construction of software defect prediction dataset with explainability

doi:10.11772/j.issn.1001-9081.2025080987

Journal of Computer Applications

Received:2025-08-27 Revised:2025-11-06 Online:2025-11-17 Published:2025-11-17
Contact: Zhe Cui

具备可解释性的软件缺陷预测数据集构建方法

边赟¹,王海全²,陈义²,崔喆²

1. 成都计算机应用研究所
2. 中国科学院成都计算机应用研究所

通讯作者: 崔喆
基金资助:
中国科学院“西部之光”人才培养计划

Abstract

Abstract: Software defect prediction often lacks explainable information, such as defect localization, explanation, and repair suggestions. This limitation makes the prediction results difficult to apply in actual development. To address this problem, this paper proposes an explainable approach for constructing a software defect prediction dataset based on context engineering and large language models (LLMs). In addition, it introduces HandPick, the first accompanying multi-programming-language dataset for software defect prediction. First, the TriCogVuln-LLM method was designed based on software engineering principles and prior defect knowledge, guiding LLMs to sequentially generate function descriptions, CWE defect predictions, and repair suggestions. Next, a consensus voting mechanism was employed to form an optimal ensemble of generative models for defect prediction, thereby improving the quality and diversity of the generated data. Finally, the HandPick dataset was constructed through consensus-driven automated data generation, covering code in four mainstream programming languages. Downstream task validation shows that the Qwen2.5-14B-HandPick model, fine-tuned on the HandPick dataset, achieves significant improvements over baseline models on an independent, publicly available test set, with gains of 19.29, 21.26, 24.11, and 18.30 percentage points in precision, recall, F1 score, and accuracy, respectively. These results highlight substantial improvements in the model’s defect identification and analysis capabilities, which can assist developers in more effectively addressing software defects.

Key words: explainability, software defect prediction, context engineering, large language models, common weakness enumeration

摘要： 针对软件缺陷预测缺乏缺陷定位、缺陷解释及修复建议等可解释性信息，导致预测结果难以在实际开发中应用的问题，提出一种基于上下文工程与大语言模型（LLMs）的、具有可解释性的软件缺陷预测数据集构建方法，并发布了首个配套的多编程语言的软件缺陷预测数据集HandPick。首先，基于软件工程原则和缺陷先验知识，设计了TriCogVuln-LLM方法，引导LLMs依次完成功能描述生成、CWE缺陷预测和缺陷修复建议生成。其次，设计了共识投票机制，构建了缺陷预测的最佳生成模型池，进一步提升了生成数据的质量与多样性。最后，利用共识驱动的自动化数据生成，构建出包含四种主流编程语言的软件缺陷预测数据集——HandPick。下游任务验证表明，与使用基线模型相比，采用HandPick数据集微调后的Qwen2.5-14B-HandPick模型在独立公开测试集的表现显著提升，其精确率、召回率、F1分数与准确率分别提升了19.29、21.26、24.11与18.30个百分点，能够显著提升模型的缺陷识别和分析能力，有望辅助开发人员更好地修复软件缺陷。

关键词: 可解释性, 软件缺陷预测, 上下文工程, 大语言模型, 通用缺陷枚举

CLC Number:

TP391.1

边赟王海全陈义崔喆. 具备可解释性的软件缺陷预测数据集构建方法[J]. 《计算机应用》唯一官方网站, DOI: 10.11772/j.issn.1001-9081.2025080987.

[1]	Binbin ZHANG, Yongbin QIN, Ruizhang HUANG, Yanping CHEN. Judgment document summarization method combining large language model and dynamic prompts [J]. Journal of Computer Applications, 2025, 45(9): 2783-2789.
[2]	Tao FENG, Chen LIU. Dual-stage prompt tuning method for automated preference alignment [J]. Journal of Computer Applications, 2025, 45(8): 2442-2447.
[3]	Jinxian SUO, Liping ZHANG, Sheng YAN, Dongqi WANG, Yawen ZHANG. Review of interpretable deep knowledge tracing methods [J]. Journal of Computer Applications, 2025, 45(7): 2043-2055.
[4]	Chun XU, Shuangyan JI, Huan MA, Enwei SUN, Mengmeng WANG, Mingyu SU. Consultation recommendation method based on knowledge graph and dialogue structure [J]. Journal of Computer Applications, 2025, 45(4): 1157-1168.
[5]	Yiheng SUN, Maofu LIU. Tender information extraction method based on prompt tuning of knowledge [J]. Journal of Computer Applications, 2025, 45(4): 1169-1176.
[6]	Xiaolin QIN, Xu GU, Dicheng LI, Haiwen XU. Survey and prospect of large language models [J]. Journal of Computer Applications, 2025, 45(3): 685-696.
[7]	Chengzhe YUAN, Guohua CHEN, Dingding LI, Yuan ZHU, Ronghua LIN, Hao ZHONG, Yong TANG. ScholatGPT： a large language model for academic social networks and its intelligent applications [J]. Journal of Computer Applications, 2025, 45(3): 755-764.
[8]	Chenwei SUN, Junli HOU, Xianggen LIU, Jiancheng LYU. Large language model prompt generation method for engineering drawing understanding [J]. Journal of Computer Applications, 2025, 45(3): 801-807.
[9]	Yanmin DONG, Jiajia LIN, Zheng ZHANG, Cheng CHENG, Jinze WU, Shijin WANG, Zhenya HUANG, Qi LIU, Enhong CHEN. Design and practice of intelligent tutoring algorithm based on personalized student capability perception [J]. Journal of Computer Applications, 2025, 45(3): 765-772.
[10]	Can MA, Ruizhang HUANG, Lina REN, Ruina BAI, Yaoyao WU. Chinese spelling correction method based on LLM with multiple inputs [J]. Journal of Computer Applications, 2025, 45(3): 849-855.
[11]	Jing HE, Yang SHEN, Runfeng XIE. Recognition and optimization of hallucination phenomena in large language models [J]. Journal of Computer Applications, 2025, 45(3): 709-714.
[12]	Wei CHEN, Changyong SHI, Chuanxiang MA. Crop disease recognition method based on multi-modal data fusion [J]. Journal of Computer Applications, 2025, 45(3): 840-848.
[13]	Xuefei ZHANG, Liping ZHANG, Sheng YAN, Min HOU, Yubo ZHAO. Personalized learning recommendation in collaboration of knowledge graph and large language model [J]. Journal of Computer Applications, 2025, 45(3): 773-784.
[14]	Peng CAO, Guangqi WEN, Jinzhu YANG, Gang CHEN, Xinyi LIU, Xuechun JI. Efficient fine-tuning method of large language models for test case generation [J]. Journal of Computer Applications, 2025, 45(3): 725-731.
[15]	Chaofeng LU, Ye TAO, Lianqing WEN, Fei MENG, Xiugong QIN, Yongjie DU, Yunlong TIAN. Speaker-emotion voice conversion method with limited corpus based on large language model and pre-trained model [J]. Journal of Computer Applications, 2025, 45(3): 815-822.