Journal of Computer Applications ›› 2022, Vol. 42 ›› Issue (10): 2996-3002.DOI: 10.11772/j.issn.1001-9081.2021081525

• Artificial intelligence • Previous Articles     Next Articles

Chinese Text-to-SQL model for industrial production

Jianqing LYU1, Xianbing WANG1, Gang CHEN1, Hua ZHANG2, Minggang WANG3   

  1. 1.Key Laboratory of Aerospace Information Security and Trusted Computing,Ministry of Education (Wuhan University),Wuhan Hubei 430072,China
    2.School of Computer Science,Wuhan University,Wuhan Hubei 430072,China
    3.Zunyi Aluminum Industry Company Limited,Zunyi Guizhou 563100,China
  • Received:2021-08-27 Revised:2021-11-20 Accepted:2021-11-24 Online:2022-01-07 Published:2022-10-10
  • Contact: Xianbing WANG
  • About author:LYU Jianqing, born in 1998, M. S. candidate. His research interests include big data, artificial intelligence.
    WANG Xianbing, born in 1972, Ph. D. , associate professor. His research interests include data mining and analysis, computer vision.
    CHEN Gang, born in 1970, Ph. D. , professor. His research interests include cyber security, artificial intelligence.
    ZHANG Hua, born in 1973, Ph. D. , lecturer. His research interests include management and analysis of big data mining.
    WANG Minggang, born in 1979, M. S. , senior engineer. His research interests include aluminum smelting intelligent manufacturing.
  • Supported by:
    National Natural Science Foundation of China(51977155)

面向工业生产的中文Text-to-SQL模型

吕剑清1, 王先兵1, 陈刚1, 张华2, 王明刚3   

  1. 1.空天信息安全与可信计算教育部重点实验室(武汉大学), 武汉 430072
    2.武汉大学 计算机学院, 武汉 430072
    3.遵义铝业股份有限公司, 贵州 遵义 563100
  • 通讯作者: 王先兵
  • 作者简介:第一联系人:吕剑清(1998—),男,湖北黄冈人,硕士研究生,主要研究方向:大数据、人工智能
    王先兵(1972—),男,湖北江陵人,副教授,博士,主要研究方向:数据挖掘与分析、计算机视觉; xbwang@whu.edu.cn
    陈刚(1970—),男,湖北武汉人,教授,博士生导师,博士,主要研究方向:网络安全、人工智能
    张华(1973—),男,湖北仙桃人,讲师,博士,主要研究方向:大数据挖掘管理与分析
    王明刚(1979—),男,贵州贵阳人,高级工程师,硕士,主要研究方向:铝冶炼智能制造。
  • 基金资助:
    国家自然科学基金资助项目(51977155)

Abstract:

When the model of translating English natural language questions into Structured Query Language (SQL) statements (Text-to-SQL) is migrated to Chinese industrial Text-to-SQL task, due to the poor interpretability and strong dispersion of industrial datasets, the representation format of the information of table names and column names in database are often inconsistent with the key information in questions, and the column names in questions are often hidden in the semantics, which leads to a lower exact match accuracy. Aiming at the problems appeared in migration, the corresponding solution was proposed and a modified model was constructed. Firstly, in data use process, factory metadata information was used to solve problem of inconsistency in representation format and the problem that the column names were hidden in the semantics. Then, according to the characteristics of Chinese language expression, a self-attention model based on relative position was used to directly identify the value of where clause by questions and database mode information. Finally, according to the characteristics of the query of industrial questions, the fine-tuned Bidirectional Encoder Representation from Transformers (BERT) was used to classify questions in order to improve the accuracy of SQL statement structure prediction. An industrial dataset based on the aluminum smelting industry was constructed and experimental verification was performed on this dataset. The results show that the exact match accuracy of the proposed model on the industrial test set is 74.2%. Compared with the effect of the mainstream models on English dataset Spider, it can be seen that the proposed model can effectively deal with the Chinese industrial Text-to-SQL task.

Key words: Chinese Text-to-SQL task, industrial dataset, metadata, self-attention model, Bidirectional Encoder Representation from Transformers (BERT)

摘要:

英文自然语言查询转SQL语句(Text-to-SQL)任务的模型迁移到中文工业Text-to-SQL任务时,由于工业数据集的可解释差且比较分散,会出现数据库的表名列名等信息与问句中关键信息的表示形式不一致以及问句中的列名隐含在语义中等问题导致模型精确匹配率变低。针对迁移过程中出现的问题,提出了对应的解决方法并构建修改后的模型。首先,在数据使用过程中融入工厂元数据信息以解决表示形式不一致以及列名隐含在语义中的问题;然后,根据中文语言表达方式的特性,使用基于相对位置的自注意力模型直接通过问句以及数据库模式信息识别出where子句的value值;最后,根据工业问句查询内容的特性,使用微调后的基于变换器的双向编码器表示技术(BERT)对问句进行分类以提高模型对SQL语句结构预测的准确率。构建了一个基于铝冶炼行业的工业数据集,并在该数据集上进行实验验证。结果表明所提模型在工业测试集上的精确匹配率为74.2%,对比英文数据集Spider上各阶段主流模型的效果后可以看出,所提模型能有效处理中文工业Text-to-SQL任务。

关键词: 中文Text-to-SQL任务, 工业数据集, 元数据, 自注意力模型, 基于变换器的双向编码器表示技术

CLC Number: