Efficient fine-tuning method of large language models for test case generation

doi:10.11772/j.issn.1001-9081.2024111598

Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (3): 725-731.DOI: 10.11772/j.issn.1001-9081.2024111598

• Frontier research and typical applications of large models • Previous Articles Next Articles

Efficient fine-tuning method of large language models for test case generation

Peng CAO¹(), Guangqi WEN¹, Jinzhu YANG¹, Gang CHEN², Xinyi LIU², Xuechun JI³

^1.School of Computer Science and Engineering，Northeastern University，Shenyang Liaoning 110169，China
^2.State Grid Information and Telecommunication Company Limited，Beijing 102200，China
^3.State Grid Electric Power Research Institute Company Limited，Nanjing Jiangsu 211106，China

Received:2024-11-11 Revised:2025-01-16 Accepted:2025-01-17 Online:2025-01-21 Published:2025-03-10
Contact: Peng CAO
About author:WEN Guangqi， born in 1998， Ph. D. candidate. His research interests include machine learning， smart healthcare.
YANG Jinzhu， born in 1979， Ph. D.， professor. His research interests include artificial intelligence， image processing and analysis， medical image reconstruction and optimization.
CHEN Gang， born in 1985， engineer. His research interests include software architecture， software project management.
LIU Xinyi， born in 1982， M. S.， senior engineer. Her research interests include machine learning.
JI Xuechun， born in 1977， M. S.， professorate senior engineer. His research interests include artificial intelligence， power system automation.
Supported by:
State Grid Corporation Technology Project(5108-202340436A-3-2-ZN)

面向测试用例生成的大模型高效微调方法

曹鹏¹(), 温广琪¹, 杨金柱¹, 陈刚², 刘歆一², 季学纯³

^1.东北大学计算机科学与工程学院，沈阳 110169
^2.国网信息通信产业集团有限公司，北京 102200
^3.国网电力科学研究院有限公司，南京 211106

通讯作者: 曹鹏
作者简介:温广琪（1998—），男，山东德州人，博士研究生，主要研究方向：机器学习、智慧医疗
杨金柱（1979—），男，辽宁沈阳人，教授，博士生导师，博士，CCF高级会员，主要研究方向：人工智能、影像处理与分析、医学影像重建与优化
陈刚（1985—），男，北京人，工程师，主要研究方向：软件体系结构、软件项目管理
刘歆一（1982—），女，北京人，高级工程师，硕士，主要研究方向：机器学习
季学纯（1977—），男，江苏扬州人，研究员级高级工程师，硕士，主要研究方向：人工智能、电力系统自动化。
基金资助:
国家电网有限公司总部科技项目(5108-202340436A-3-2-ZN)

Abstract

Abstract:

Data-driven automated generation technology of unit test cases has problems of low coverage and poor readability， struggling to meet the increasing demand for testing. Recently， Large Language Model （LLM） has shown great potential in code generation tasks. However， due to the differences in functional and coding styles of code data， LLMs face the challenges of catastrophic forgetting and resource constraints. To address these problems， a transfer learning idea was proposed by fine-tuning coding and functional styles simultaneously， and an efficient fine-tuning training method was developed for LLMs in generating unit test cases. Firstly， the widely used instruction datasets were adopted to align LLM with instructions， and the instruction sets were divided by task types. At the same time， the weight increments with task-specific features were extracted and stored. Secondly， an adaptive style extraction module was designed for dealing with various coding styles with noise-resistant learning and coding style backtracking learning in the module. Finally， joint training of the functional and coding style increments was performed respectively on the target domain， thereby realizing efficient adaptation and fine-tuning on the target domains with limited resources. Experimental results of test case generation on SF110 Corpus of Classes dataset indicate that the proposed method outperforms the methods for comparison. Compared to the mainstream code generation LLMs — Codex， Code Llama and DeepSeek-Coder， the proposed method has the compilation rate increased by 0.8%， 43.5% and 33.8%， respectively； the branch coverage increased by 3.1%， 1.0%， and 17.2% respectively； and the line coverage increased by 4.1%， 6.5%， and 15.5% respectively； verifying the superiority of the proposed method in code generation tasks.

Key words: unit test, code generation, Large Language Model (LLM), weight incremental learning, fine-tuning learning

摘要：

基于数据驱动的单元测试代码自动化生成技术存在覆盖率低和可读性差的问题，难以应对日益增长的测试需求。大语言模型（LLM）在代码生成任务中显示了极大的潜力，然而由于代码数据的功能风格和编码风格的差异，LLM面临灾难性遗忘和资源受限这2个挑战。为了解决这些问题，提出将编码风格和功能风格同步迁移微调的思想，并开发一种高效的LLM微调训练方法用于单元测试用例生成。首先，利用广泛使用的指令数据集对LLM进行指令对齐，并按任务类型对指令集分类；同时，提取并存储具有任务特征的权重增量；其次，设计一个自适应风格提取模块，该模块包含抗噪声干扰学习和编码风格回溯学习，以应对不同的代码编写风格；最后，在目标域分别对功能风格增量和编码风格增量进行联合训练，以实现在目标域低资源情况下的高效适配和微调。在SF110 Corpus of Classes数据集上的测试用例生成实验结果表明，所提方法的结果均优于对比方法，与主流代码生成LLM Codex、Code Llama和DeepSeek-Coder相比，所提方法的编译率分别提高0.8%、43.5%和33.8%、分支覆盖率分别提高3.1%、1.0%和17.2%；行覆盖率分别提高4.1%、6.5%和15.5%，验证了所提方法在代码生成任务上的优越性。

关键词: 单元测试, 代码生成, 大语言模型, 权重增量学习, 微调学习

CLC Number:

TP391.1

Peng CAO, Guangqi WEN, Jinzhu YANG, Gang CHEN, Xinyi LIU, Xuechun JI. Efficient fine-tuning method of large language models for test case generation[J]. Journal of Computer Applications, 2025, 45(3): 725-731.

曹鹏, 温广琪, 杨金柱, 陈刚, 刘歆一, 季学纯. 面向测试用例生成的大模型高效微调方法[J]. 《计算机应用》唯一官方网站, 2025, 45(3): 725-731.

Figures/Tables 5

Fig. 1 Flow of efficient fine-tuning for cross-style large language model

Tab. 1 Comparison of compilation rate and quantity of test cases generated by different methods

方法	编译率/%	生成的测试用例数	生成的测试类数
GPT-4	43.2	765	89
Codex	52.7	642	76
Code Llama	37.0	571	72
DeepSeek-Coder	39.7	539	77
本文方法	53.1	566	69

Tab. 2 Comparison of coverage of test cases generated by different methods

方法	分支覆盖率	行覆盖率
GPT-4	67.3	76.2
Codex	78.5	83.0
Code Llama	80.1	81.1
DeepSeek-Coder	69.0	74.8
本文方法	80.9	86.4

Fig. 2 Comparison of generalization performance between basic fine-tuning model and weight increment model under different projects

Fig. 3 Accuracy comparison of proposed method under different project numbers and different group （cluster） numbers

References 32

1	金大海，宫云战，王雅文，等. 软件代码测试技术［J］. 信息通信技术， 2015， 9（3）：33-39.
	JIN D H， GONG Y Z， WAND Y W， et al. Testing technologies for software code ［J］. Information and Communication Technologies， 2015， 9（3）： 33-39.
2	钱月琴. 基于数据驱动的J2EE单元测试脚本自动生成技术［J］. 河北软件职业技术学院学报， 2009， 11（3）：55-57.
	QIAN Y Q. Automatic generation technology of J2EE unit test scripts based on data-driven ［J］. Journal of Hebei Software Institute， 2009， 11（3）：55-57.
3	刘会颖. 基于智能优化算法的测试数据自动生成技术研究［D］. 廊坊：北华航天工业学院， 2023.
	LIU H Y. Research on automatic test data generation technology based on intelligent optimization algorithm ［D］. Langfang： North China Institute of Aerospace Industry， 2023.
4	LI Y F， DAS P K， DOWE D L. Two decades of Web application testing — a survey of recent advances ［J］. Information Systems， 2014， 43： 20-54.
5	XU D， LI H， LAM C P. A systematic approach to automatically generate test scenarios from UML activity diagrams ［C］// Proceedings of the 3rd International Conference on IASTED International Conference： Advances in Computer Science and Technology. Anaheim， CA： ACTA Press， 2007： 134-139.
6	EIBEN A E， SMITH J. From evolutionary computation to the evolution of things ［J］. Nature， 2015， 521（7553）： 476-482.
7	CHEN J， WANG G， HAO D， et al. Coverage prediction for accelerating compiler testing ［J］. IEEE Transactions on Software Engineering， 2021， 47（2）： 261-278.
8	BROWN T B， MANN B， RYDER N， et al. Language models are few-shot learners ［C］// Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2020：1877-1901.
9	SIDDIQA M L， SANTOS J C S， TANVIRB R H， et al. Using large language models to generate JUnit tests： an empirical study ［C］// Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering. New York： ACM， 2024： 313-322.
10	STEENHOEK B， TUFANO M， SUBDARESAN N， et al. Reinforcement learning from automatic feedback for high-quality unit test generation ［EB/OL］. ［2024-01-10］. .
11	HU E J， SHEN Y， WALLIS P， et al. LoRA： low-rank adaptation of large language models ［EB/OL］. ［2024-05-22］. .
12	HOULSBY N， GIURGIU A， JASTRZEBSKI S， et al. Parameter-efficient transfer learning for NLP ［C］// Proceedings of the 36th International Conference on Machine Learning. New York： JMLR.org， 2019： 2790-2799.
13	周端阳，王猛. 基于三层体系结构的单元测试框架研究与实现［J］. 计算机应用， 2010， 30（8）：2189-2192.
	ZHOU R Y， WANG M. Research and implementation of unit testing framework based on three-layer architecture ［J］. Journal of Computer Applications， 2010， 30（8）：2189-2192.
14	PǍSǍREANU C S， MEHLITZ P C， BUSHNELL D H， et al. Combining unit-level symbolic execution and system-level concrete execution for testing NASA software ［C］// Proceedings of the 2008 International Symposium on Software Testing and Analysis. New York： ACM， 2008： 15-26.
15	XIE T， MARINOV D， SCHULTE W， et al. Symstra： a framework for generating object-oriented unit tests using symbolic execution［C］// Proceedings of the 2005 International Conference on Tools and Algorithms for the Construction and Analysis of Systems， LNCS 3440. Berlin： Springer， 2005： 365-381.
16	FRASER G， ARCURI A. EvoSuite： automatic test suite generation for object-oriented software ［C］// Proceedings of the 19th ACM SIGSOFT Symposium on Foundations of Software Engineering. New York： ACM， 2011： 416-419.
17	ENOIU E P， ČAUŠEVIĆ A， OSTRAND T J， et al. Automated test generation using model checking： an industrial evaluation ［J］. International Journal on Software Tools for Technology Transfer， 2016， 18（3）： 335-353.
18	TUFANO M， DRAIN D， SVYATKOVSKIY A， et al. Unit test case generation with Transformers and focal context ［EB/OL］. ［2024-05-08］. .
19	LEWIS M， LIU Y， GOYAL N， et al. BART： denoising sequence-to-sequence pre-training for natural language generation， translation， and comprehension ［C］// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg： ACL， 2020： 7871-7880.
20	GUO D， ZHU Q， YANG D， et al. DeepSeek-Coder： when the large language model meets programming — the rise of code intelligence ［EB/OL］. ［2024-07-11］. .
21	YUAN Z， LOU Y， LIU M， et al. No more manual tests？ evaluating and improving ChatGPT for unit test generation ［EB/OL］. ［2024-09-26］. .
22	LEMIEUX C， INALA J P， LAHIRI S K， et al. CodaMosa： escaping coverage plateaus in test generation with pre-trained large language models ［C］// Proceedings of the IEEE/ACM 45th International Conference on Software Engineering. Piscataway： IEEE， 2023： 919-931.
23	CHEN M， TWOREK J， JUN H， et al. Evaluating large language models trained on code ［EB/OL］. ［2024-08-17］. .
24	LI T O， ZONG W， WANG Y， et al. Nuances are the key： unlocking ChatGPT to find failure-inducing tests with differential prompting ［C］// Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering. Piscataway： IEEE， 2023： 14-26.
25	LESTER B， AL-RFOU R， CONSTANT N. The power of scale for parameter-efficient prompt tuning ［C］// Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2021： 3045-3059.
26	LI X L， LIANG P. Prefix-tuning： optimizing continuous prompts for generation ［C］// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing （Volume 1： Long Papers）. Stroudsburg： ACL， 2021： 4582-4597.
27	TOUVRON H， MARTIN L， STONE K， et al. LLaMA 2： open foundation and fine-tuned chat models ［EB/OL］. ［2024-11-05］..
28	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need ［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2017： 6000-6010.
29	HUSAIN H， WU H H， GAZIT T， et al. CodeSearchNet Challenge： evaluating the state of semantic code search ［EB/OL］. ［2024-08-05］. .
30	RASLEY J， RAJBHANDARI S， RUWASE O， et al. DeepSpeed： system optimizations enable training deep learning models with over 100 billion parameters ［C］// Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York： ACM， 2020： 3505-3506.
31	ROZIÈRE B， GEHRING J， GLOECKLE F， et al. Code Llama： open foundation models for code ［EB/OL］. ［2024-10-30］. .
32	FRASER G， ARCURI A. A large-scale evaluation of automated unit test generation using EvoSuite ［J］. ACM Transactions on Software Engineering and Methodology， 2015， 24（2）： No.8.

[1]	Yuemei XU, Yuqi YE, Xueyi HE. Bias challenges of large language models： identification， evaluation， and mitigation [J]. Journal of Computer Applications, 2025, 45(3): 697-708.
[2]	Jing HE, Yang SHEN, Runfeng XIE. Recognition and optimization of hallucination phenomena in large language models [J]. Journal of Computer Applications, 2025, 45(3): 709-714.
[3]	Xiaolin QIN, Xu GU, Dicheng LI, Haiwen XU. Survey and prospect of large language models [J]. Journal of Computer Applications, 2025, 45(3): 685-696.
[4]	Shuo SUN, Wei ZHANG, Wendi FENG, Yuwei ZHANG. Automatic foreign function interface generation method based on source code analysis [J]. Journal of Computer Applications, 2024, 44(7): 2151-2159.
[5]	Yuemei XU, Ling HU, Jiayi ZHAO, Wanze DU, Wenqing WANG. Technology application prospects and risk challenges of large language models [J]. Journal of Computer Applications, 2024, 44(6): 1655-1662.
[6]	Yushan JIANG, Yangsen ZHANG. Large language model-driven stance-aware fact-checking [J]. Journal of Computer Applications, 2024, 44(10): 3067-3073.
[7]	JIANG Jianchun, CHEN Huiling, DENG Lu, ZHAO Jianpeng. Configuration tool design based on control-oriented multi-core real-time operating system [J]. Journal of Computer Applications, 2016, 36(3): 765-769.
[8]	HU Zhengyu, SHEN Beijun. End-user programming language for mobile children educational game [J]. Journal of Computer Applications, 2015, 35(2): 540-544.
[9]	. Research and implementation of unit testing framework based on three-layer architecture [J]. Journal of Computer Applications, 2010, 30(8): 2189-2192.
[10]	. Instrumentation technology for embedded software statement coverage testing [J]. Journal of Computer Applications, 2010, 30(10): 2738-2740.
[11]	Kui CAI Lei LU Shuai-qiang WANG Jian-cheng WAN. Complicated behaviors modeling and code generation based on Web UI design pattern [J]. Journal of Computer Applications, 2009, 29(4): 1139-1142.
[12]	. An automatic generation algorithm of MPI communication code [J]. Journal of Computer Applications, 2007, 27(3): 759-761.

Efficient fine-tuning method of large language models for test case generation

面向测试用例生成的大模型高效微调方法

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 5

References 32

Related Articles 12

Recommended Articles

Metrics