基于语义增强模式链接的Text-to-SQL模型

doi:10.11772/j.issn.1001-9081.2023091360

《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (9): 2689-2695.DOI: 10.11772/j.issn.1001-9081.2023091360

基于语义增强模式链接的Text-to-SQL模型

吴相岚, 肖洋, 刘梦莹, 刘明铭()

南开大学软件学院，天津 300457

收稿日期:2023-10-09 修回日期:2023-12-24 接受日期:2023-12-26 发布日期:2024-03-15 出版日期:2024-09-10
通讯作者: 刘明铭
作者简介:吴相岚（2000—），男，安徽合肥人，硕士研究生，主要研究方向：自然语言处理、SQL语句生成
肖洋（1998—），男，河北秦皇岛人，硕士研究生，主要研究方向：虚假新闻检测、SQL语句生成
刘梦莹（1999—），女，山东济宁人，硕士研究生，主要研究方向：智能软件工程、虚假新闻检测
刘明铭（1979—），女，山东东阿人，讲师，博士，主要研究方向：数据挖掘。

Text-to-SQL model based on semantic enhanced schema linking

Xianglan WU, Yang XIAO, Mengying LIU, Mingming LIU()

College of Software，Nankai University，Tianjin 300457，China

Received:2023-10-09 Revised:2023-12-24 Accepted:2023-12-26 Online:2024-03-15 Published:2024-09-10
Contact: Mingming LIU
About author:WU Xianglan， born in 2000， M. S. candidate. His research interests include natural language processing， SQL statement generation.
XIAO Yang， born in 1998， M. S. candidate. His research interests include fake news detection， SQL statement generation.
LIU Mengying， born in 1999， M. S. candidate. Her research interests include intelligent software engineering， fake news detection.

摘要/Abstract

摘要：

为优化基于异构图编码器的Text-to-SQL生成效果，提出SELSQL模型。首先，模型采用端到端的学习框架，使用双曲空间下的庞加莱距离度量替代欧氏距离度量，以此优化使用探针技术从预训练语言模型中构建的语义增强的模式链接图；其次，利用K头加权的余弦相似度以及图正则化方法学习相似度度量图使得初始模式链接图在训练中迭代优化；最后，使用改良的关系图注意力网络（RGAT）图编码器以及多头注意力机制对两个模块的联合语义模式链接图进行编码，并且使用基于语法的神经语义解码器和预定义的结构化语言进行结构化查询语言（SQL）语句解码。在Spider数据集上的实验结果表明，使用ELECTRA-large预训练模型时，SELSQL模型比最佳基线模型的准确率提升了2.5个百分点，对于复杂SQL语句生成的提升效果很大。

关键词: 模式链接, 图结构学习, 预训练语言模型, Text-to-SQL, 异构图

Abstract:

To optimize Text-to-SQL generation performance based on heterogeneous graph encoder， SELSQL model was proposed. Firstly， an end-to-end learning framework was employed by the model， and the Poincaré distance metric in hyperbolic space was used instead of the Euclidean distance metric to optimize semantically enhanced schema linking graph constructed by the pre-trained language model using probe technology. Secondly， K-head weighted cosine similarity and graph regularization method were used to learn the similarity metric graph so that the initial schema linking graph was iteratively optimized during training. Finally， the improved Relational Graph ATtention network （RGAT） graph encoder and multi-head attention mechanism were used to encode the joint semantic schema linking graphs of the two modules， and Structured Query Language （SQL） statement decoding was solved using a grammar-based neural semantic decoder and a predefined structured language. Experimental results on Spider dataset show that when using ELECTRA-large pre-training model， the accuracy of SELSQL model is increased by 2.5 percentage points compared with the best baseline model， which has a great improvement effect on the generation of complex SQL statements.

Key words: schema linking, graph structure learning, pre-trained language model, Text-to-SQL, heterogeneous graph

中图分类号:

TP183

吴相岚, 肖洋, 刘梦莹, 刘明铭. 基于语义增强模式链接的Text-to-SQL模型[J]. 计算机应用, 2024, 44(9): 2689-2695.

Xianglan WU, Yang XIAO, Mengying LIU, Mingming LIU. Text-to-SQL model based on semantic enhanced schema linking[J]. Journal of Computer Applications, 2024, 44(9): 2689-2695.

图/表 7

参考文献 33

1	BOGIN B， BERANT J， GARDNER M. Representing schema structure with graph neural networks for Text-to-SQL parsing ［C］// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg： ACL， 2019： 4560-4565.
2	SCHLICHTKRULL M， KIPF T N， BLOEM P， et al. Modeling relational data with graph convolutional networks ［C］// Proceedings of the 15th International Conference on Semantic Web. Berlin： Springer， 2018： 593-607.
3	BOGIN B， GARDNER M， BERANT J. Global reasoning over database structures for Text-to-SQL parsing ［C］// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg： ACL， 2019： 3659-3664.
4	黄君扬，王振宇，梁家卿，等.基于自裁剪异构图的NL2SQL模型［J］.计算机工程，2022，48（9）：71-77.
	HUANG J Y， WANG Z Y， LIANG J Q， et al. NL2SQL model based on self-pruning heterogeneous graph ［J］. Computer Engineering， 2022，48（9）：71-77.
5	王秋月，程路易，徐波，等.基于知识增强的NL2SQL方法［J］.智能计算机与应用，2022，12（7）：1-7.
	WANG Q Y， CHENG L Y， XU B， et al. NL2SQL method based on knowledge enhancement ［J］. Intelligent Computer and Applications， 2022，12（7）：1-7.
6	WANG B， SHIN R， LIU X， et al. RAT-SQL： relation-aware schema encoding and linking for Text-to-SQL parsers ［C］// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg： ACL， 2020： 7567-7578.
7	SHAW P， USZKOREIT J， VASWANI A. Self-attention with relative position representations ［C］// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 2 （Short Papers）. Stroudsburg： ACL， 2018： 464-468.
8	CAO R， CHEN L， CHEN Z， et al. LGESQL： line graph enhanced Text-to-SQL model with mixed local and non-local relations ［C］// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing （Volume 1： Long Papers）. Stroudsburg： ACL， 2021： 2541-2555.
9	WANG K， SHEN W， YANG Y， et al. Relational graph attention network for aspect-based sentiment analysis ［C］// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg： ACL， 2020： 3229-3238.
10	GAN Y， PURVER M， WOODWARD J R. A review of cross-domain Text-to-SQL models ［C］// Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing： Student Research Workshop. Stroudsburg： ACL， 2020： 108-115.
11	GUO J， ZHAN Z， GAO Y， et al. Towards complex Text-to-SQL in cross-domain database with intermediate representation ［C］// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg： ACL， 2019： 4524-4535.
12	LI S， HU X， LIN L， et al. A multi-level supervised contrastive learning framework for low-resource natural language inference［J］. IEEE/ACM Transactions on Audio， Speech， and Language Processing， 2023， 31： 1771-1783.
13	XU P， KUMAR D， YANG W， et al. Optimizing deeper Transformers on small datasets ［C］// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing （Volume 1： Long Papers）. Stroudsburg： ACL， 2021： 2089-2102.
14	LIU A， HU X， LIN L， et al. Semantic enhanced Text-to-SQL parsing via iteratively learning schema linking graph ［C］// Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. New York： ACM， 2022： 1021-1030.
15	LIM D， HOHNE F， LI X， et al. Large scale learning on non-homophilous graphs： new benchmarks and strong simple methods ［C］// Proceedings of the 35th Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2021： 20887-20902.
16	LIM D， LI X， HOHNE F， et al. New benchmarks for learning on non-homophilous graphs ［EB/OL］. ［2023-02-14］. .
17	BO D， WANG X， SHI C， et al. Beyond low-frequency information in graph convolutional networks ［C］// Proceedings of the 2021 AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2021， 35（5）： 3950-3957.
18	YANG L， LI M， LIU L， et al. Diverse message passing for attribute with heterophily ［C］// Proceedings of the 35th Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2021： 4751-4763.
19	WANG X， JI H， SHI C， et al. Heterogeneous graph attention network ［C］// Proceedings of the 2019 World Wide Web Conference. New York： ACM， 2019： 2022-2032.
20	ZHOU J， CUI G， HU S， et al. Graph neural networks： a review of methods and applications ［J］. AI Open， 2020， 1： 57-81.
21	ANASTASIU D C， KARYPIS G. L2Knng： fast exact k-nearest neighbor graph construction with l2-norm pruning ［C］// Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. New York： ACM， 2015： 791-800.
22	CHEN Y， WU L， ZAKI M J. Iterative deep graph learning for graph neural networks： better and robust node embeddings ［C］// Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2020： 19314-19326.
23	FRANCESCHI L， NIEPERT M， PONTIL M， et al. Learning discrete structures for graph neural networks ［C］// Proceedings of the 36th International Conference on Machine Learning. New York： PMLR， 2019， 97： 1972-1982.
24	ZHAO J， WANG X， SHI C， et al. Heterogeneous graph structure learning for graph neural networks ［C］// Proceedings of the 2021 AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2021， 35（5）： 4697-4705.
25	WU Z， CHEN Y， KAO B， et al. Perturbed masking： parameter-free probing for analyzing and interpreting BERT ［C］// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg： ACL， 2020： 4166-4176.
26	DEVLIN J， CHANG M-W， LEE K， et al. BERT： pre-training of deep bidirectional transformers for language understanding ［C］// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 1 （Long and Short Papers）. Stroudsburg： ACL， 2019： 4171-4186.
27	NICKEL M， KIELA D. Poincaré embeddings for learning hierarchical representations ［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2017： 6341-6350.
28	CHAMI I， YING R， RE C， et al. Hyperbolic graph convolutional neural networks ［C］// Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2019： 4868-4879.
29	YIN P， NEUBIG G. A syntactic neural model for general-purpose code generation ［C］// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg： ACL， 2017： 440-450.
30	GAN Y， CHEN X， XIE J， et al. Natural SQL： making SQL easier to infer from natural language specifications ［C］// Proceedings of the 2021 Findings of the Association for Computational Linguistics. Stroudsburg： ACL， 2021： 2030-2042.
31	YU T， ZHANG R， YANG K， et al. Spider： a large-scale human-labeled dataset for complex and cross-domain semantic parsing and Text-to-SQL task ［C］// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2018： 3911-3921.
32	RUBIN O， BERANT J. SmBoP： semi-autoregressive bottom-up semantic parsing ［C］// Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies. Stroudsburg： ACL， 2021： 311-324.
33	LOSHCHILOV I， HUTTER F. Decoupled weight decay regularization ［EB/OL］. ［2023-10-11］. .

错误类型	错误占比/%	错误类型	错误占比/%
模式链接	44.7	condition	11.1
JOIN	23.3	嵌套	11.3
GROUG BY	9.6

错误类型	错误占比/%	错误类型	错误占比/%
模式链接	44.7	condition	11.1
JOIN	23.3	嵌套	11.3
GROUG BY	9.6

预训练模型	模型	准确率
BERT-large	RAT-SQL	69.7
	SMOP	—
	LGESQL	74.1
	ISESL	74.7
	SELSQL	77.1
Model Adaptive PLM	RAT-SQL+Grappa	73.4
	SMOP+Grappa	74.1
	LGESQL+ELECTRA-large	75.1
	ISESL+ELECTRA-large	75.8
	SELSQL+ELECTRA-large	78.3

预训练模型	模型	准确率
BERT-large	RAT-SQL	69.7
	SMOP	—
	LGESQL	74.1
	ISESL	74.7
	SELSQL	77.1
Model Adaptive PLM	RAT-SQL+Grappa	73.4
	SMOP+Grappa	74.1
	LGESQL+ELECTRA-large	75.1
	ISESL+ELECTRA-large	75.8
	SELSQL+ELECTRA-large	78.3

模型	准确率/%	下降百分点
SELSQL	78.3	—
移除初始图生成	76.5	1.8
移除相似度度量图	76.7	1.6
移除语义模式链接	75.5	2.7
移除庞加莱距离度量	77.5	0.8
移除NatSQL	77.5	0.8

基于语义增强模式链接的Text-to-SQL模型

Text-to-SQL model based on semantic enhanced schema linking

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 7

参考文献 33

相关文章 15

编辑推荐

Metrics

组件	不同模型的F1分数/%			提升百分点
组件	RAT-SQL	LGESQL	SELSQL	提升百分点
SELECT	82.2	92.2	93.4	1.2
SELECT（no AGG）	84.5	93.5	95.0	1.5
WHERE	82.5	81.2	81.4	0.2
WHERE（NO OP）	86.1	84.9	85.5	0.6
Group by（no having）	60.1	84.6	88.6	4.0
Group by	64.2	80.8	86.0	5.2
Order by	81.4	85.0	86.8	1.8
And/or	98.4	98.2	98.4	0.2
IUEN	37.5	62.7	64.2	1.5
keywords	91.3	91.1	91.4	0.3

难度等级	不同模型的准确率/%			提升百分点
难度等级	RAT-SQL	LGESQL	SELSQL	提升百分点
all	71.2	75.1	78.3	3.2
easy	87.9	91.9	92.9	1.0
medium	74.6	78.3	80.5	2.2
hard	60.3	64.9	70.3	5.4
extra	48.7	52.4	57.2	4.8

样本序号	难度等级	案例	LGESQL	SELSQL
1	hard	Find the average age of the dogs who went through treatments.	SELECT AVG（Dogs.age） FROM Dogs JOIN Treatments	SELECT AVG（Dogs.age） FROM Dogs WHERE Dogs.dog_id IN （ SELECT Treatments.dog_id FROM Treatments ）
2	hard	What are the names of the dogs for which the owner has not spend more than 1000 for treatment？	SELECT Dogs.name FROM Dogs WHERE Dogs.owner_id NOT IN （ SELECT Treatments.dog_id FROM Treatments GROUP BY Dogs.owner_id HAVING COUNT（Treatments.cost_of_treatment） > "value" ）	SELECT Dogs.name FROM Dogs WHERE Dogs.dog_id NOT IN （ SELECT Treatments.dog_id FROM Treatments GROUP BY Treatments.dog_id HAVING SUM（Treatments.cost_of_treatment） > "value" ）
3	extra	Which owner has paid the largest amount of money in total for their dogs？ Show the owner id and zip code.	SELECT Owners.owner_id， Owners.zip_code FROM Owners JOIN Dogs JOIN Charges GROUP BY Owners.owner_id ORDER BY SUM（Charges.charge_amount） DESC LIMIT 1	SELECT Owners.owner_id， Owners.zip_code FROM Owners JOIN Dogs JOIN Treatments GROUP BY Owners.owner_id ORDER BY SUM（Treatments.cost_of_treatment） DESC LIMIT 1
4	hard	Find the name， population and expected life length of asian country with the largest area？	SELECT country.Name， country.Population， country.LifeExpectancy FROM country WHERE country.Continent = "value" ORDER BY country.SurfaceArea DESC LIMIT 1	SELECT country.Name， country.Population， country.LifeExpectancy FROM country WHERE country.SurfaceArea = "value" AND country.SurfaceArea = （ SELECT MAX（country.SurfaceArea） FROM country WHERE country.Continent = "value" ）

[1]	杨帆, 邹窈, 朱明志, 马振伟, 程大伟, 蒋昌俊. 基于图注意力Transformer神经网络的信用卡欺诈检测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2634-2642.
[2]	毛典辉, 李学博, 刘峻岭, 张登辉, 颜文婧. 基于并行异构图和序列注意力机制的中文实体关系抽取模型[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2018-2025.
[3]	魏超, 陈艳平, 王凯, 秦永彬, 黄瑞章. 基于掩码提示与门控记忆网络校准的关系抽取方法[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1713-1719.
[4]	姚迅, 秦忠正, 杨捷. 生成式标签对抗的文本分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1781-1785.
[5]	王楷天, 叶青, 程春雷. 基于异构图表示的中医电子病历分类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 411-417.
[6]	马国帅, 钱宇华, 张亚宇, 李俊霞, 刘郭庆. 动态异构信息融合的科研合作潜力预测[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2775-2783.
[7]	黄梦林, 段磊, 张袁昊, 王培妍, 李仁昊. 基于Prompt学习的无监督关系抽取模型[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2010-2016.
[8]	高永兵, 高军甜, 马蓉, 杨立东. 用户粒度级的个性化社交文本生成模型[J]. 《计算机应用》唯一官方网站, 2023, 43(4): 1021-1028.
[9]	许亮, 张春, 张宁, 田雪涛. 融合多Prompt模板的零样本关系抽取模型[J]. 《计算机应用》唯一官方网站, 2023, 43(12): 3668-3675.
[10]	江静, 陈渝, 孙界平, 琚生根. 融合后验概率校准训练的文本分类算法[J]. 《计算机应用》唯一官方网站, 2022, 42(6): 1789-1795.
[11]	张海丰, 曾诚, 潘列, 郝儒松, 温超东, 何鹏. 结合BERT和特征投影网络的新闻主题文本分类方法[J]. 《计算机应用》唯一官方网站, 2022, 42(4): 1116-1124.
[12]	吕剑清, 王先兵, 陈刚, 张华, 王明刚. 面向工业生产的中文Text-to-SQL模型[J]. 《计算机应用》唯一官方网站, 2022, 42(10): 2996-3002.
[13]	王小鹏, 孙媛媛, 林鸿飞. 基于刑事Electra的编-解码关系抽取模型[J]. 《计算机应用》唯一官方网站, 2022, 42(1): 87-93.
[14]	张蓉, 张献国. 基于层次异构图注意力网络的虚假评论检测[J]. 计算机应用, 2021, 41(5): 1275-1281.
[15]	毕蓓, 潘慧瑶, 陈峰, 隋京言, 高扬, 王耀君. 基于异构图注意力网络的微博谣言监测模型[J]. 《计算机应用》唯一官方网站, 2021, 41(12): 3546-3550.