Chinese Text-to-SQL model for industrial production

doi:10.11772/j.issn.1001-9081.2021081525

Journal of Computer Applications ›› 2022, Vol. 42 ›› Issue (10): 2996-3002.DOI: 10.11772/j.issn.1001-9081.2021081525

• Artificial intelligence • Previous Articles Next Articles

Chinese Text-to-SQL model for industrial production

Jianqing LYU¹, Xianbing WANG¹, Gang CHEN¹, Hua ZHANG², Minggang WANG³

^1.Key Laboratory of Aerospace Information Security and Trusted Computing，Ministry of Education （Wuhan University），Wuhan Hubei 430072，China
^2.School of Computer Science，Wuhan University，Wuhan Hubei 430072，China
^3.Zunyi Aluminum Industry Company Limited，Zunyi Guizhou 563100，China

Received:2021-08-27 Revised:2021-11-20 Accepted:2021-11-24 Online:2022-01-07 Published:2022-10-10
Contact: Xianbing WANG
About author:LYU Jianqing， born in 1998， M. S. candidate. His research interests include big data， artificial intelligence.
WANG Xianbing， born in 1972， Ph. D. ， associate professor. His research interests include data mining and analysis， computer vision.
CHEN Gang， born in 1970， Ph. D. ， professor. His research interests include cyber security， artificial intelligence.
ZHANG Hua， born in 1973， Ph. D. ， lecturer. His research interests include management and analysis of big data mining.
WANG Minggang， born in 1979， M. S. ， senior engineer. His research interests include aluminum smelting intelligent manufacturing.
Supported by:
National Natural Science Foundation of China(51977155)

面向工业生产的中文Text-to-SQL模型

吕剑清¹, 王先兵¹, 陈刚¹, 张华², 王明刚³

^1.空天信息安全与可信计算教育部重点实验室(武汉大学), 武汉 430072
^2.武汉大学计算机学院, 武汉 430072
^3.遵义铝业股份有限公司, 贵州遵义 563100

通讯作者: 王先兵
作者简介:第一联系人：吕剑清（1998—），男，湖北黄冈人，硕士研究生，主要研究方向：大数据、人工智能
王先兵（1972—），男，湖北江陵人，副教授，博士，主要研究方向：数据挖掘与分析、计算机视觉; xbwang@whu.edu.cn
陈刚（1970—），男，湖北武汉人，教授，博士生导师，博士，主要研究方向：网络安全、人工智能
张华（1973—），男，湖北仙桃人，讲师，博士，主要研究方向：大数据挖掘管理与分析
王明刚（1979—），男，贵州贵阳人，高级工程师，硕士，主要研究方向：铝冶炼智能制造。
基金资助:
国家自然科学基金资助项目(51977155)

Abstract

Abstract:

When the model of translating English natural language questions into Structured Query Language （SQL） statements （Text-to-SQL） is migrated to Chinese industrial Text-to-SQL task， due to the poor interpretability and strong dispersion of industrial datasets， the representation format of the information of table names and column names in database are often inconsistent with the key information in questions， and the column names in questions are often hidden in the semantics， which leads to a lower exact match accuracy. Aiming at the problems appeared in migration， the corresponding solution was proposed and a modified model was constructed. Firstly， in data use process， factory metadata information was used to solve problem of inconsistency in representation format and the problem that the column names were hidden in the semantics. Then， according to the characteristics of Chinese language expression， a self-attention model based on relative position was used to directly identify the value of where clause by questions and database mode information. Finally， according to the characteristics of the query of industrial questions， the fine-tuned Bidirectional Encoder Representation from Transformers （BERT） was used to classify questions in order to improve the accuracy of SQL statement structure prediction. An industrial dataset based on the aluminum smelting industry was constructed and experimental verification was performed on this dataset. The results show that the exact match accuracy of the proposed model on the industrial test set is 74.2%. Compared with the effect of the mainstream models on English dataset Spider， it can be seen that the proposed model can effectively deal with the Chinese industrial Text-to-SQL task.

Key words: Chinese Text-to-SQL task, industrial dataset, metadata, self-attention model, Bidirectional Encoder Representation from Transformers (BERT)

摘要：

英文自然语言查询转SQL语句（Text-to-SQL）任务的模型迁移到中文工业Text-to-SQL任务时，由于工业数据集的可解释差且比较分散，会出现数据库的表名列名等信息与问句中关键信息的表示形式不一致以及问句中的列名隐含在语义中等问题导致模型精确匹配率变低。针对迁移过程中出现的问题，提出了对应的解决方法并构建修改后的模型。首先，在数据使用过程中融入工厂元数据信息以解决表示形式不一致以及列名隐含在语义中的问题；然后，根据中文语言表达方式的特性，使用基于相对位置的自注意力模型直接通过问句以及数据库模式信息识别出where子句的value值；最后，根据工业问句查询内容的特性，使用微调后的基于变换器的双向编码器表示技术（BERT）对问句进行分类以提高模型对SQL语句结构预测的准确率。构建了一个基于铝冶炼行业的工业数据集，并在该数据集上进行实验验证。结果表明所提模型在工业测试集上的精确匹配率为74.2%，对比英文数据集Spider上各阶段主流模型的效果后可以看出，所提模型能有效处理中文工业Text-to-SQL任务。

关键词: 中文Text-to-SQL任务, 工业数据集, 元数据, 自注意力模型, 基于变换器的双向编码器表示技术

CLC Number:

TP391.2

Jianqing LYU, Xianbing WANG, Gang CHEN, Hua ZHANG, Minggang WANG. Chinese Text-to-SQL model for industrial production[J]. Journal of Computer Applications, 2022, 42(10): 2996-3002.

吕剑清, 王先兵, 陈刚, 张华, 王明刚. 面向工业生产的中文Text-to-SQL模型[J]. 《计算机应用》唯一官方网站, 2022, 42(10): 2996-3002.

Figures/Tables 8

References 25

1	WARREN D H D， PEREIRA F C N. An efficient easily adaptable system for interpreting natural language queries［J］. American Journal of Computational Linguistics， 1982， 8（3/4）： 110-122.
2	ANDROUTSOPOULOS I， RITCHIE G D， THANISCH P. Natural language interfaces to databases-an introduction［J］. Natural Language Engineering， 1995， 1（1）： 29-81. 10.1017/s135132490000005x
3	POPESCU A M， ARMANASU A， ETZIONI O， et al. Modern natural language interfaces to databases： composing statistical parsing with semantic tractability［C］// Proceedings of the 20th International Conference on Computational Linguistics. ［S.l.］： COLING， 2004： 141-147. 10.3115/1220355.1220376
4	HALLET C. Generic querying of relational databases using natural language generation techniques［C］// Proceedings of the 4th International Natural Language Generation Conference. Stroudsburg， PA： Association for Computational Linguistics， 2006： 95-102. 10.3115/1706269.1706289
5	GIORDANI A， MOSCHITTI A. Generating SQL queries using natural language syntactic dependencies and metadata［C］// Proceedings of the 2012 International Conference on Applications of Natural Language Processing to Information Systems， LNCS 7337. Berlin： Springer， 2012： 164-170.
6	ZHONG V， XIONG C M， SOCHER R. Seq2SQL： generating structured queries from natural language using reinforcement learning［EB/OL］. （2017-11-09）［2021-06-20］.. 10.48550/arXiv.1709.00103
7	YU T， ZHANG R， YANG K， et al. Spider： a large-scale human-labeled dataset for complex and cross-domain semantic parsing and Text-to-SQL task［C］// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： Association for Computational Linguistics， 2018： 3911-3921. 10.18653/v1/d18-1425
8	HE P C， MAO Y， CHAKRABARTI K， et al. X-SQL： reinforce schema representation with context［EB/OL］. （2019-08-21）［2021-06-20］..
9	ZHANG X Y， YIN F J， MA G J， et al. M-SQL： multi-task representation learning for single-table Text2SQL generation［J］. IEEE Access， 2020， 8：43156-43167. 10.1109/access.2020.2977613
10	YU T， YASUNAGA M， YANG K， et al. SyntaxSQLNet： syntax tree networks for complex and cross-domain text-to-SQL task［C］// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： Association for Computational Linguistics， 2018： 1653-1663. 10.18653/v1/d18-1193
11	GUO J Q， ZHAN Z C， GAO Y， et al. Towards complex Text-to-SQL in cross-domain database with intermediate representation［C］// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg， PA： Association for Computational Linguistics， 2019： 4524-4535. 10.18653/v1/p19-1444
12	YU T， LI Z F， ZHANG Z L， et al. TypeSQL： knowledge-based type-aware neural Text-to-SQL generation［C］// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies. Stroudsburg， PA： Association for Computational Linguistics， 2018： 588-594. 10.18653/v1/n18-2093
13	WANG C L， TATWAWADI K， BROCKSCHMIDT M， et al. Robust text-to-SQL generation with execution-guided decoding［EB/OL］. （2018-09-13）［2021-06-20］..
14	FINEGAN-DOLLAK C， KUMMERFELD J K， ZHANG L， et al. Improving text-to-SQL evaluation methodology［C］// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg， PA： Association for Computational Linguistics， 2018： 351-360. 10.18653/v1/p18-1033
15	WANG B L， SHIN R， LIU X D， et al. RAT-SQL： relation-aware schema encoding and linking for text-to-SQL parsers［C］// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg， PA： Association for Computational Linguistics， 2020： 7567-7578. 10.18653/v1/2020.acl-main.677
16	MIN Q K， SHI Y F， ZHANG Y. A pilot study for Chinese SQL semantic parsing［C］// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg， PA： Association for Computational Linguistics， 2019： 3652-3658. 10.18653/v1/d19-1377
17	WANG L J， ZHANG A， WU K， et al. DuSQL： a large-scale and pragmatic Chinese text-to-SQL dataset［C］// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg， PA： Association for Computational Linguistics， 2020： 6923-6935. 10.18653/v1/2020.emnlp-main.562
18	LIN X V， SOCHER R， XIONG C M. Bridging textual and tabular data for cross-domain text-to-SQL semantic parsing［C］// Proceedings of the 2020 Findings of the Association for Computational Linguistics： EMNLP 2020. Stroudsburg， PA： Association for Computational Linguistics， 2020： 4870-4888. 10.18653/v1/2020.findings-emnlp.438
19	VINYALS O， FORTUNATO M， JAITLY N. Pointer networks［C］// Proceedings of the 28th International Conference on Neural Information Processing Systems. Cambridge： MIT Press， 2015： 2692-2700.
20	XU X J， LIU C， SONG D. SQLNet： generating structured queries from natural language without reinforcement learning［EB/OL］. （2017-11-13）［2021-06-20］..
21	PRICE P J. Evaluation of spoken language systems： the ATIS domain［C］// Proceedings of the 1990 Workshop on Speech and Natural Language. San Francisco： Morgan Kaufmann Publishers Inc.， 1990： 91-95. 10.3115/116580.116612
22	ZELLE J M， MOONEY R J. Learning to parse database queries using inductive logic programming［C］// Proceedings of the 13th National Conference on Artificial Intelligence. Menlo Park， CA： AAAI Press， 1996： 1050-1055. 10.1007/3-540-60925-3_59
23	张顺利，王应军，姬东鸿. 基于BLSTM网络的医学时间短语识别［J］. 计算机应用研究， 2020， 37（4）：1059-1062. 10.19734/j.issn.1001-3695.2018.09.0742
	ZHANG S L， WANG Y J， JI D H. Temporal phrases extraction in clinical text based on bidirectional long-short term memory model［J］. Application Research of Computers， 2020， 37（4）： 1059-1062. 10.19734/j.issn.1001-3695.2018.09.0742
24	SHAW P， USZKOREIT J， VASWANI A. Self-attention with relative position representations［C］// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies， Volume 2 （Short Papers）. Stroudsburg， PA： Association for Computational Linguistics， 2018： 464-468. 10.18653/v1/n18-2074
25	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook， NY： Curran Associates Inc.， 2017：6000-6010.

数据集	Q	DB	Domain	Table/DB
ATIS	5 280	1	1	32.0
GeoQuery	877	1	1	6.0
Spider	10 181	200	138	5.1
CSpider	9 691	166	―	5.3
DuSQL	23 797	200	―	4.1
本文数据集	2 836	1	1	27.0

数据集	Q	DB	Domain	Table/DB
ATIS	5 280	1	1	32.0
GeoQuery	877	1	1	6.0
Spider	10 181	200	138	5.1
CSpider	9 691	166	―	5.3
DuSQL	23 797	200	―	4.1
本文数据集	2 836	1	1	27.0

模型	本文数据集	Spider
SyntaxSQLNet^［10］	7.4	19.7
IRNet^［11］	25.7	46.7
IRNET+BERT	29.4	54.7
RAT-SQL^［15］	30.2	57.2
RAT-SQL+BERT	36.8	65.6
本文模型	74.2	―

模型	本文数据集	Spider
SyntaxSQLNet^［10］	7.4	19.7
IRNet^［11］	25.7	46.7
IRNET+BERT	29.4	54.7
RAT-SQL^［15］	30.2	57.2
RAT-SQL+BERT	36.8	65.6
本文模型	74.2	―

技术	精确匹配率
+元数据（仅同义化）	39.4
+元数据	66.8

Chinese Text-to-SQL model for industrial production

面向工业生产的中文Text-to-SQL模型

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 8

References 25

Related Articles 15

Recommended Articles

Metrics

[1]	Yi ZHANG, Shuangsheng WANG, Bin HE, Peiming YE, Keqiang LI. Named entity recognition method of elementary mathematical text based on BERT [J]. Journal of Computer Applications, 2022, 42(2): 433-439.
[2]	Qiming RUAN, Yi GUO, Nan ZHENG, Yexiang WANG. Customs declaration good classification algorithm based on hierarchical multi-task BERT [J]. Journal of Computer Applications, 2022, 42(1): 71-77.
[3]	Yu PENG, Xiaoyu LI, Shijie HU, Xiaolei LIU, Weizhong QIAN. Three-stage question answering model based on BERT [J]. Journal of Computer Applications, 2022, 42(1): 64-70.
[4]	WU Xiaoping, ZHANG Qiang, ZHAO Fang, JIAO Lin. Entity relation extraction method for guidelines of cardiovascular disease based on bidirectional encoder representation from transformers [J]. Journal of Computer Applications, 2021, 41(1): 145-149.
[5]	CHEN Bo, HE Lianyue, YAN Weiwei, XU Zhaomiao, XU Jun. Portable operating system interface of UNIX compatibility technology in mass small distributed file system [J]. Journal of Computer Applications, 2018, 38(5): 1389-1392.
[6]	YANG Wenhui, LI Guoqiang, MIAO Fang. Metadata management mechanism of massive spatial data storage [J]. Journal of Computer Applications, 2015, 35(5): 1276-1279.
[7]	LIU Lian ZHENG Biao GONG Yi-li. Metadata processing optimization in distributed file systems [J]. Journal of Computer Applications, 2012, 32(12): 3271-3273.
[8]	LI Hong-yan. Methods of metadata management in block-level continuous data protection system [J]. Journal of Computer Applications, 2012, 32(08): 2141-2149.
[9]	ZHAO Xiao-yong YANG Yang SUN Li-li CHEN Yu. Hadoop-based storage architecture for mass MP3 files [J]. Journal of Computer Applications, 2012, 32(06): 1724-1726.
[10]	ZHANG Hu-yin ZHANG Ming-yang LI Xin. E-learning resource library model based on domain ontology [J]. Journal of Computer Applications, 2012, 32(01): 191-195.
[11]	hua yanjiang. Decentralized approach for metadata management in computing resource sharing platform [J]. Journal of Computer Applications, 2011, 31(02): 462-465.
[12]	. Information query based on semantic association in grid environment [J]. Journal of Computer Applications, 2009, 29(06): 1517-1526.
[13]	. Research and implementation of information sharing system based on LDAP [J]. Journal of Computer Applications, 2008, 28(4): 1042-1044.
[14]	. Design and implementation of focused Web crawler based on semantic analysis [J]. Journal of Computer Applications, 2007, 27(2): 406-408.
[15]	XiangHao Xiang . Design and implementation of inquiring and retrieval component for WMMS [J]. Journal of Computer Applications, 2006, 26(11): 2682-2684.