《计算机应用》唯一官方网站 ›› 2026, Vol. 46 ›› Issue (3): 723-731.DOI: 10.11772/j.issn.1001-9081.2025040454

• 人工智能 • 上一篇    下一篇

增强模式链接与多生成器协同的SQL生成框架MG-SQL

吴定佳1,2, 崔喆1()   

  1. 1.中国科学院 成都计算机应用研究所,成都 610213
    2.中国科学院大学 计算机科学与技术学院,北京 100049
  • 收稿日期:2025-04-25 修回日期:2025-06-11 接受日期:2025-06-12 发布日期:2025-06-23 出版日期:2026-03-10
  • 通讯作者: 崔喆
  • 作者简介:吴定佳(1999—),男,四川巴中人,硕士研究生,主要研究方向:自然语言处理、大语言模型
  • 基金资助:
    四川省自然科学基金资助项目(2024NSFSC0004)

MG-SQL: SQL generation framework with enhanced schema linking and multi-generator collaboration

Dingjia WU1,2, Zhe CUI1()   

  1. 1.Chengdu Institute of Computer Application,Chinese Academy of Sciences,Chengdu Sichuan 610213,China
    2.School of Computer Science and Technology,University of Chinese Academy of Sciences,Beijing 100049,China
  • Received:2025-04-25 Revised:2025-06-11 Accepted:2025-06-12 Online:2025-06-23 Published:2026-03-10
  • Contact: Zhe CUI
  • About author:WU Dingjia, born in 1999, M. S. candidate. His research interests include natural language processing, large language models.
  • Supported by:
    Natural Science and Technology Foundation of Sichuan Province(2024NSFSC0004)

摘要:

针对大语言模型(LLM)在复杂多表数据库场景下生成结构化查询语言(SQL)的局限性,提出基于多生成器协同的Text-to-SQL框架——MG-SQL(Multi-Generator SQL)。首先,针对无关模式信息导致的噪声干扰,通过生成初始SQL,并结合语义相似度检索,提出增强模式链接优化方法。其次,为提高候选SQL的质量、增强多样性,基于精简模式构建多策略协同生成框架:1)使用经验生成器检索动态示例;2)使用思维链生成器强化逻辑推理;3)使用查询计划生成器模拟数据库的执行流程;4)使用渐进生成器进行迭代优化。再次,使用投票机制对SQL进行择优。最后,进一步提出反思学习机制,通过对比生成结果与参考SQL形成反思样本,动态构建领域经验库以实现持续学习。在BIRD基准测试中的结果表明,采用轻量级GPT-4o-mini模型时,所提框架的模式链接实现了98.89%的严格召回率(SRR),有效筛除了44.91%无关列;所提框架生成的SQL的执行准确率(EX)达69.69%,有效效率分数(VES)达79.59%,超越基于GPT-4o的主流方法,验证了所提框架在复杂场景下的有效性。

关键词: 模式链接, 大语言模型, Text-to-SQL, 检索增强, 上下文学习

Abstract:

To address the limitations of Large Language Models (LLMs) in generating Structured Query Language (SQL) in complex multi-table database scenarios, a multi-generator collaboration-based Text-SQL framework MG-SQL (Multi-Generator SQL) based on collaborative generators was proposed. Firstly, to mitigate noise interference caused by irrelevant schema information, the optimization method for enhancing schema linking process was proposed by generating initial SQLs and combining semantic similarity-based retrieval. Secondly, to improve the quality and diversity of candidate SQLs, a multi-strategy collaborative generation framework was developed on the basis of refined schema: 1) the experience generator was used to retrieve dynamic examples; 2) the chain-of-thought generator was used to strengthen logical reasoning; 3) the query plan generator was used to simulate database execution flows; and 4) the progressive generator was used to perform iterative optimization. Thirdly, the optimal SQL was selected through voting mechanism. Finally, a reflective learning mechanism was further proposed, where the generated results and reference SQL were compared to form reflective samples, so as to construct domain-specific knowledge base dynamically for continuous learning. The BIRD benchmark results demonstrate that, when employing the lightweight GPT-4o-mini model, the proposed framework’s schema linking achieves a 98.89% Strict Recall Rate (SRR) while effectively filtering out 44.91% of irrelevant columns; the SQL generated by the proposed framework achieves a 69.69% EXecution accuracy (EX) and a 79.59% Valid Efficiency Score (VES), outperforming mainstream GPT-4o-based approaches, which validates the effectiveness of the proposed framework in complex scenarios.

Key words: schema linking, large Language Model (LLM), Text-to-Structured Query Language (SQL), retrieval-augmented, In-Context Learning (ICL)

中图分类号: