Chinese natural language interface based on paraphrasing

ZHANG Junchi1, HU Jie1, LIU Mengchi2   

  1. 1. School of Computer and Information Engineering, Hubei University, Wuhan Hubei 430062, China;
    2. State Key Laboratory of Software Engineering(Wuhan University), Wuhan Hubei 430072, China
    This work is partially supported by the National Natural Science Foundation of China (61202100).


张俊驰1, 胡婕1, 刘梦赤2   

  1. 1. 湖北大学 计算机与信息工程学院, 武汉430062;
    2. 软件工程国家重点实验室(武汉大学), 武汉 430072
Abstract: In this paper, a novel method for Chinese Natural Language Interface of Database (NLIDB) based on Chinese paraphrase was proposed to solve the problems of traditional methods based on syntactic parsing which cannot obtain high accuracy and need a lot of manual label training corpus. First, key entities of user statements in databases were extracted, and candidate tree sets and their tree expressions were generated. Then most relevant semantic expressions were filtered by paraphrase classifier which was obtained from the Internet Q&A training corpus. Finally, candidate trees were translated into Structured Query Language (SQL). F1 score was respectively 83.4% and 90% on data sets of Chinese America Geography (GeoQueries880) and Questions about Restaurants (RestQueries250) by using the proposed method, better than syntactic based method. The experimental results demonstrate that the NLIDB based on paraphrase can handle the semantic gaps between users and databases better.

Key words: Natural Language Interface of DataBase (NLIDB), word vector, paraphrase, natural language expression, machine learning

摘要: 针对传统以句法分析为主的数据库自然语言接口系统识别用户语义准确率不高,且需要大量人工标注训练语料的问题,提出了一种基于复述的中文自然语言接口(NLIDB)实现方法。首先提取用户语句中表征数据库实体词,建立候选树集及对应的形式化自然语言表达;其次由网络问答语料训练得到的复述分类器筛选出语义最相近的表达;最后将相应的候选树转换为结构化查询语句(SQL)。实验表明该方法在美国地理问答语料(GeoQueries880)、餐饮问答语料(RestQueries250)上的F1值分别达到83.4%、90%,均优于句法分析方法。通过对比实验结果发现基于复述方法的数据库自然语言接口系统能更好地处理用户与数据库的语义鸿沟问题。

关键词: 数据库自然语言接口, 词向量, 复述, 自然语言表达, 机器学习

