《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (9): 2689-2695.DOI: 10.11772/j.issn.1001-9081.2023091360
收稿日期:
2023-10-09
修回日期:
2023-12-24
接受日期:
2023-12-26
发布日期:
2024-03-15
出版日期:
2024-09-10
通讯作者:
刘明铭
作者简介:
吴相岚(2000—),男,安徽合肥人,硕士研究生,主要研究方向:自然语言处理、SQL语句生成
Xianglan WU, Yang XIAO, Mengying LIU, Mingming LIU()
Received:
2023-10-09
Revised:
2023-12-24
Accepted:
2023-12-26
Online:
2024-03-15
Published:
2024-09-10
Contact:
Mingming LIU
About author:
WU Xianglan, born in 2000, M. S. candidate. His research interests include natural language processing, SQL statement generation.摘要:
为优化基于异构图编码器的Text-to-SQL生成效果,提出SELSQL模型。首先,模型采用端到端的学习框架,使用双曲空间下的庞加莱距离度量替代欧氏距离度量,以此优化使用探针技术从预训练语言模型中构建的语义增强的模式链接图;其次,利用K头加权的余弦相似度以及图正则化方法学习相似度度量图使得初始模式链接图在训练中迭代优化;最后,使用改良的关系图注意力网络(RGAT)图编码器以及多头注意力机制对两个模块的联合语义模式链接图进行编码,并且使用基于语法的神经语义解码器和预定义的结构化语言进行结构化查询语言(SQL)语句解码。在Spider数据集上的实验结果表明,使用ELECTRA-large预训练模型时,SELSQL模型比最佳基线模型的准确率提升了2.5个百分点,对于复杂SQL语句生成的提升效果很大。
中图分类号:
吴相岚, 肖洋, 刘梦莹, 刘明铭. 基于语义增强模式链接的Text-to-SQL模型[J]. 计算机应用, 2024, 44(9): 2689-2695.
Xianglan WU, Yang XIAO, Mengying LIU, Mingming LIU. Text-to-SQL model based on semantic enhanced schema linking[J]. Journal of Computer Applications, 2024, 44(9): 2689-2695.
错误类型 | 错误占比/% | 错误类型 | 错误占比/% |
---|---|---|---|
模式链接 | 44.7 | condition | 11.1 |
JOIN | 23.3 | 嵌套 | 11.3 |
GROUG BY | 9.6 |
表1 LGESQL SQL语句生成错误比例
Tab. 1 LGESQL SQL sentence generation error ratio
错误类型 | 错误占比/% | 错误类型 | 错误占比/% |
---|---|---|---|
模式链接 | 44.7 | condition | 11.1 |
JOIN | 23.3 | 嵌套 | 11.3 |
GROUG BY | 9.6 |
预训练模型 | 模型 | 准确率 |
---|---|---|
BERT-large | RAT-SQL | 69.7 |
SMOP | — | |
LGESQL | 74.1 | |
ISESL | 74.7 | |
SELSQL | 77.1 | |
Model Adaptive PLM | RAT-SQL+Grappa | 73.4 |
SMOP+Grappa | 74.1 | |
LGESQL+ELECTRA-large | 75.1 | |
ISESL+ELECTRA-large | 75.8 | |
SELSQL+ELECTRA-large | 78.3 |
表2 Spider数据集上不同模型的匹配准确率 (%)
Tab. 2 Matching accuracies of different models on Spider dataset
预训练模型 | 模型 | 准确率 |
---|---|---|
BERT-large | RAT-SQL | 69.7 |
SMOP | — | |
LGESQL | 74.1 | |
ISESL | 74.7 | |
SELSQL | 77.1 | |
Model Adaptive PLM | RAT-SQL+Grappa | 73.4 |
SMOP+Grappa | 74.1 | |
LGESQL+ELECTRA-large | 75.1 | |
ISESL+ELECTRA-large | 75.8 | |
SELSQL+ELECTRA-large | 78.3 |
模型 | 准确率/% | 下降百分点 |
---|---|---|
SELSQL | 78.3 | — |
移除初始图生成 | 76.5 | 1.8 |
移除相似度度量图 | 76.7 | 1.6 |
移除语义模式链接 | 75.5 | 2.7 |
移除庞加莱距离度量 | 77.5 | 0.8 |
移除NatSQL | 77.5 | 0.8 |
表3 各模块的消融测试结果
Tab. 3 Ablation test results of various modules
模型 | 准确率/% | 下降百分点 |
---|---|---|
SELSQL | 78.3 | — |
移除初始图生成 | 76.5 | 1.8 |
移除相似度度量图 | 76.7 | 1.6 |
移除语义模式链接 | 75.5 | 2.7 |
移除庞加莱距离度量 | 77.5 | 0.8 |
移除NatSQL | 77.5 | 0.8 |
组件 | 不同模型的F1分数/% | 提升 百分点 | ||
---|---|---|---|---|
RAT-SQL | LGESQL | SELSQL | ||
SELECT | 82.2 | 92.2 | 93.4 | 1.2 |
SELECT(no AGG) | 84.5 | 93.5 | 95.0 | 1.5 |
WHERE | 82.5 | 81.2 | 81.4 | 0.2 |
WHERE(NO OP) | 86.1 | 84.9 | 85.5 | 0.6 |
Group by(no having) | 60.1 | 84.6 | 88.6 | 4.0 |
Group by | 64.2 | 80.8 | 86.0 | 5.2 |
Order by | 81.4 | 85.0 | 86.8 | 1.8 |
And/or | 98.4 | 98.2 | 98.4 | 0.2 |
IUEN | 37.5 | 62.7 | 64.2 | 1.5 |
keywords | 91.3 | 91.1 | 91.4 | 0.3 |
表4 Spider数据集上各组件匹配的F1分数
Tab. 4 F1 scores matched by components on Spider dataset
组件 | 不同模型的F1分数/% | 提升 百分点 | ||
---|---|---|---|---|
RAT-SQL | LGESQL | SELSQL | ||
SELECT | 82.2 | 92.2 | 93.4 | 1.2 |
SELECT(no AGG) | 84.5 | 93.5 | 95.0 | 1.5 |
WHERE | 82.5 | 81.2 | 81.4 | 0.2 |
WHERE(NO OP) | 86.1 | 84.9 | 85.5 | 0.6 |
Group by(no having) | 60.1 | 84.6 | 88.6 | 4.0 |
Group by | 64.2 | 80.8 | 86.0 | 5.2 |
Order by | 81.4 | 85.0 | 86.8 | 1.8 |
And/or | 98.4 | 98.2 | 98.4 | 0.2 |
IUEN | 37.5 | 62.7 | 64.2 | 1.5 |
keywords | 91.3 | 91.1 | 91.4 | 0.3 |
难度等级 | 不同模型的准确率/% | 提升 百分点 | ||
---|---|---|---|---|
RAT-SQL | LGESQL | SELSQL | ||
all | 71.2 | 75.1 | 78.3 | 3.2 |
easy | 87.9 | 91.9 | 92.9 | 1.0 |
medium | 74.6 | 78.3 | 80.5 | 2.2 |
hard | 60.3 | 64.9 | 70.3 | 5.4 |
extra | 48.7 | 52.4 | 57.2 | 4.8 |
表5 Spider数据集上各难度等级精确匹配准确率
Tab. 5 Exact matching accuracy of each difficulty level on Spider dataset
难度等级 | 不同模型的准确率/% | 提升 百分点 | ||
---|---|---|---|---|
RAT-SQL | LGESQL | SELSQL | ||
all | 71.2 | 75.1 | 78.3 | 3.2 |
easy | 87.9 | 91.9 | 92.9 | 1.0 |
medium | 74.6 | 78.3 | 80.5 | 2.2 |
hard | 60.3 | 64.9 | 70.3 | 5.4 |
extra | 48.7 | 52.4 | 57.2 | 4.8 |
样本序号 | 难度等级 | 案例 | LGESQL | SELSQL |
---|---|---|---|---|
1 | hard | Find the average age of the dogs who went through treatments. | SELECT AVG(Dogs.age) FROM Dogs JOIN Treatments | SELECT AVG(Dogs.age) FROM Dogs WHERE Dogs.dog_id IN ( SELECT Treatments.dog_id FROM Treatments ) |
2 | hard | What are the names of the dogs for which the owner has not spend more than 1000 for treatment? | SELECT Dogs.name FROM Dogs WHERE Dogs.owner_id NOT IN ( SELECT Treatments.dog_id FROM Treatments GROUP BY Dogs.owner_id HAVING COUNT(Treatments.cost_of_treatment) > "value" ) | SELECT Dogs.name FROM Dogs WHERE Dogs.dog_id NOT IN ( SELECT Treatments.dog_id FROM Treatments GROUP BY Treatments.dog_id HAVING SUM(Treatments.cost_of_treatment) > "value" ) |
3 | extra | Which owner has paid the largest amount of money in total for their dogs? Show the owner id and zip code. | SELECT Owners.owner_id, Owners.zip_code FROM Owners JOIN Dogs JOIN Charges GROUP BY Owners.owner_id ORDER BY SUM(Charges.charge_amount) DESC LIMIT 1 | SELECT Owners.owner_id, Owners.zip_code FROM Owners JOIN Dogs JOIN Treatments GROUP BY Owners.owner_id ORDER BY SUM(Treatments.cost_of_treatment) DESC LIMIT 1 |
4 | hard | Find the name, population and expected life length of asian country with the largest area? | SELECT country.Name, country.Population, country.LifeExpectancy FROM country WHERE country.Continent = "value" ORDER BY country.SurfaceArea DESC LIMIT 1 | SELECT country.Name, country.Population, country.LifeExpectancy FROM country WHERE country.SurfaceArea = "value" AND country.SurfaceArea = ( SELECT MAX(country.SurfaceArea) FROM country WHERE country.Continent = "value" ) |
表6 不同模型在Spider数据集上生成的SQL语句对比
Tab. 6 Comparison of SQL statements generated by different models on Spider dataset
样本序号 | 难度等级 | 案例 | LGESQL | SELSQL |
---|---|---|---|---|
1 | hard | Find the average age of the dogs who went through treatments. | SELECT AVG(Dogs.age) FROM Dogs JOIN Treatments | SELECT AVG(Dogs.age) FROM Dogs WHERE Dogs.dog_id IN ( SELECT Treatments.dog_id FROM Treatments ) |
2 | hard | What are the names of the dogs for which the owner has not spend more than 1000 for treatment? | SELECT Dogs.name FROM Dogs WHERE Dogs.owner_id NOT IN ( SELECT Treatments.dog_id FROM Treatments GROUP BY Dogs.owner_id HAVING COUNT(Treatments.cost_of_treatment) > "value" ) | SELECT Dogs.name FROM Dogs WHERE Dogs.dog_id NOT IN ( SELECT Treatments.dog_id FROM Treatments GROUP BY Treatments.dog_id HAVING SUM(Treatments.cost_of_treatment) > "value" ) |
3 | extra | Which owner has paid the largest amount of money in total for their dogs? Show the owner id and zip code. | SELECT Owners.owner_id, Owners.zip_code FROM Owners JOIN Dogs JOIN Charges GROUP BY Owners.owner_id ORDER BY SUM(Charges.charge_amount) DESC LIMIT 1 | SELECT Owners.owner_id, Owners.zip_code FROM Owners JOIN Dogs JOIN Treatments GROUP BY Owners.owner_id ORDER BY SUM(Treatments.cost_of_treatment) DESC LIMIT 1 |
4 | hard | Find the name, population and expected life length of asian country with the largest area? | SELECT country.Name, country.Population, country.LifeExpectancy FROM country WHERE country.Continent = "value" ORDER BY country.SurfaceArea DESC LIMIT 1 | SELECT country.Name, country.Population, country.LifeExpectancy FROM country WHERE country.SurfaceArea = "value" AND country.SurfaceArea = ( SELECT MAX(country.SurfaceArea) FROM country WHERE country.Continent = "value" ) |
1 | BOGIN B, BERANT J, GARDNER M. Representing schema structure with graph neural networks for Text-to-SQL parsing [C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2019: 4560-4565. |
2 | SCHLICHTKRULL M, KIPF T N, BLOEM P, et al. Modeling relational data with graph convolutional networks [C]// Proceedings of the 15th International Conference on Semantic Web. Berlin: Springer, 2018: 593-607. |
3 | BOGIN B, GARDNER M, BERANT J. Global reasoning over database structures for Text-to-SQL parsing [C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg: ACL, 2019: 3659-3664. |
4 | 黄君扬,王振宇,梁家卿,等.基于自裁剪异构图的NL2SQL模型[J].计算机工程,2022,48(9):71-77. |
HUANG J Y, WANG Z Y, LIANG J Q, et al. NL2SQL model based on self-pruning heterogeneous graph [J]. Computer Engineering, 2022,48(9):71-77. | |
5 | 王秋月,程路易,徐波,等.基于知识增强的NL2SQL方法[J].智能计算机与应用,2022,12(7):1-7. |
WANG Q Y, CHENG L Y, XU B, et al. NL2SQL method based on knowledge enhancement [J]. Intelligent Computer and Applications, 2022,12(7):1-7. | |
6 | WANG B, SHIN R, LIU X, et al. RAT-SQL: relation-aware schema encoding and linking for Text-to-SQL parsers [C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 7567-7578. |
7 | SHAW P, USZKOREIT J, VASWANI A. Self-attention with relative position representations [C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). Stroudsburg: ACL, 2018: 464-468. |
8 | CAO R, CHEN L, CHEN Z, et al. LGESQL: line graph enhanced Text-to-SQL model with mixed local and non-local relations [C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Stroudsburg: ACL, 2021: 2541-2555. |
9 | WANG K, SHEN W, YANG Y, et al. Relational graph attention network for aspect-based sentiment analysis [C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 3229-3238. |
10 | GAN Y, PURVER M, WOODWARD J R. A review of cross-domain Text-to-SQL models [C]// Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing: Student Research Workshop. Stroudsburg: ACL, 2020: 108-115. |
11 | GUO J, ZHAN Z, GAO Y, et al. Towards complex Text-to-SQL in cross-domain database with intermediate representation [C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2019: 4524-4535. |
12 | LI S, HU X, LIN L, et al. A multi-level supervised contrastive learning framework for low-resource natural language inference[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, 31: 1771-1783. |
13 | XU P, KUMAR D, YANG W, et al. Optimizing deeper Transformers on small datasets [C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Stroudsburg: ACL, 2021: 2089-2102. |
14 | LIU A, HU X, LIN L, et al. Semantic enhanced Text-to-SQL parsing via iteratively learning schema linking graph [C]// Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. New York: ACM, 2022: 1021-1030. |
15 | LIM D, HOHNE F, LI X, et al. Large scale learning on non-homophilous graphs: new benchmarks and strong simple methods [C]// Proceedings of the 35th Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2021: 20887-20902. |
16 | LIM D, LI X, HOHNE F, et al. New benchmarks for learning on non-homophilous graphs [EB/OL]. [2023-02-14]. . |
17 | BO D, WANG X, SHI C, et al. Beyond low-frequency information in graph convolutional networks [C]// Proceedings of the 2021 AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2021, 35(5): 3950-3957. |
18 | YANG L, LI M, LIU L, et al. Diverse message passing for attribute with heterophily [C]// Proceedings of the 35th Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2021: 4751-4763. |
19 | WANG X, JI H, SHI C, et al. Heterogeneous graph attention network [C]// Proceedings of the 2019 World Wide Web Conference. New York: ACM, 2019: 2022-2032. |
20 | ZHOU J, CUI G, HU S, et al. Graph neural networks: a review of methods and applications [J]. AI Open, 2020, 1: 57-81. |
21 | ANASTASIU D C, KARYPIS G. L2Knng: fast exact k-nearest neighbor graph construction with l2-norm pruning [C]// Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. New York: ACM, 2015: 791-800. |
22 | CHEN Y, WU L, ZAKI M J. Iterative deep graph learning for graph neural networks: better and robust node embeddings [C]// Proceedings of the 34th International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2020: 19314-19326. |
23 | FRANCESCHI L, NIEPERT M, PONTIL M, et al. Learning discrete structures for graph neural networks [C]// Proceedings of the 36th International Conference on Machine Learning. New York: PMLR, 2019, 97: 1972-1982. |
24 | ZHAO J, WANG X, SHI C, et al. Heterogeneous graph structure learning for graph neural networks [C]// Proceedings of the 2021 AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2021, 35(5): 4697-4705. |
25 | WU Z, CHEN Y, KAO B, et al. Perturbed masking: parameter-free probing for analyzing and interpreting BERT [C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2020: 4166-4176. |
26 | DEVLIN J, CHANG M-W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding [C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Stroudsburg: ACL, 2019: 4171-4186. |
27 | NICKEL M, KIELA D. Poincaré embeddings for learning hierarchical representations [C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2017: 6341-6350. |
28 | CHAMI I, YING R, RE C, et al. Hyperbolic graph convolutional neural networks [C]// Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook: Curran Associates Inc., 2019: 4868-4879. |
29 | YIN P, NEUBIG G. A syntactic neural model for general-purpose code generation [C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg: ACL, 2017: 440-450. |
30 | GAN Y, CHEN X, XIE J, et al. Natural SQL: making SQL easier to infer from natural language specifications [C]// Proceedings of the 2021 Findings of the Association for Computational Linguistics. Stroudsburg: ACL, 2021: 2030-2042. |
31 | YU T, ZHANG R, YANG K, et al. Spider: a large-scale human-labeled dataset for complex and cross-domain semantic parsing and Text-to-SQL task [C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2018: 3911-3921. |
32 | RUBIN O, BERANT J. SmBoP: semi-autoregressive bottom-up semantic parsing [C]// Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: ACL, 2021: 311-324. |
33 | LOSHCHILOV I, HUTTER F. Decoupled weight decay regularization [EB/OL]. [2023-10-11]. . |
[1] | 杨帆, 邹窈, 朱明志, 马振伟, 程大伟, 蒋昌俊. 基于图注意力Transformer神经网络的信用卡欺诈检测模型[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2634-2642. |
[2] | 毛典辉, 李学博, 刘峻岭, 张登辉, 颜文婧. 基于并行异构图和序列注意力机制的中文实体关系抽取模型[J]. 《计算机应用》唯一官方网站, 2024, 44(7): 2018-2025. |
[3] | 魏超, 陈艳平, 王凯, 秦永彬, 黄瑞章. 基于掩码提示与门控记忆网络校准的关系抽取方法[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1713-1719. |
[4] | 姚迅, 秦忠正, 杨捷. 生成式标签对抗的文本分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1781-1785. |
[5] | 王楷天, 叶青, 程春雷. 基于异构图表示的中医电子病历分类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 411-417. |
[6] | 马国帅, 钱宇华, 张亚宇, 李俊霞, 刘郭庆. 动态异构信息融合的科研合作潜力预测[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2775-2783. |
[7] | 黄梦林, 段磊, 张袁昊, 王培妍, 李仁昊. 基于Prompt学习的无监督关系抽取模型[J]. 《计算机应用》唯一官方网站, 2023, 43(7): 2010-2016. |
[8] | 高永兵, 高军甜, 马蓉, 杨立东. 用户粒度级的个性化社交文本生成模型[J]. 《计算机应用》唯一官方网站, 2023, 43(4): 1021-1028. |
[9] | 许亮, 张春, 张宁, 田雪涛. 融合多Prompt模板的零样本关系抽取模型[J]. 《计算机应用》唯一官方网站, 2023, 43(12): 3668-3675. |
[10] | 江静, 陈渝, 孙界平, 琚生根. 融合后验概率校准训练的文本分类算法[J]. 《计算机应用》唯一官方网站, 2022, 42(6): 1789-1795. |
[11] | 张海丰, 曾诚, 潘列, 郝儒松, 温超东, 何鹏. 结合BERT和特征投影网络的新闻主题文本分类方法[J]. 《计算机应用》唯一官方网站, 2022, 42(4): 1116-1124. |
[12] | 吕剑清, 王先兵, 陈刚, 张华, 王明刚. 面向工业生产的中文Text-to-SQL模型[J]. 《计算机应用》唯一官方网站, 2022, 42(10): 2996-3002. |
[13] | 王小鹏, 孙媛媛, 林鸿飞. 基于刑事Electra的编-解码关系抽取模型[J]. 《计算机应用》唯一官方网站, 2022, 42(1): 87-93. |
[14] | 张蓉, 张献国. 基于层次异构图注意力网络的虚假评论检测[J]. 计算机应用, 2021, 41(5): 1275-1281. |
[15] | 毕蓓, 潘慧瑶, 陈峰, 隋京言, 高扬, 王耀君. 基于异构图注意力网络的微博谣言监测模型[J]. 《计算机应用》唯一官方网站, 2021, 41(12): 3546-3550. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||