计算机应用 ›› 2016, Vol. 36 ›› Issue (5): 1290-1295.DOI: 10.11772/j.issn.1001-9081.2016.05.1290

• 人工智能 • 上一篇    下一篇

基于复述的中文自然语言接口

张俊驰1, 胡婕1, 刘梦赤2   

  1. 1. 湖北大学 计算机与信息工程学院, 武汉430062;
    2. 软件工程国家重点实验室(武汉大学), 武汉 430072
  • 收稿日期:2015-10-15 修回日期:2015-12-08 出版日期:2016-05-10 发布日期:2016-05-09
  • 通讯作者: 胡婕
  • 作者简介:张俊驰(1990-),男,湖北武汉人,硕士研究生,主要研究方向:机器学习、自然语言处理;胡婕(1977-),女,湖北汉川人,副教授,博士,主要研究方向:数据库及其推理、语义数据库、复杂数据管理;刘梦赤(1962-),男,湖北武汉人,教授,博士生导师,博士,主要研究方向:数据库及其推理、语义数据库、复杂数据管理。
  • 基金资助:
    国家自然科学基金资助项目(61202100)。

Chinese natural language interface based on paraphrasing

ZHANG Junchi1, HU Jie1, LIU Mengchi2   

  1. 1. School of Computer and Information Engineering, Hubei University, Wuhan Hubei 430062, China;
    2. State Key Laboratory of Software Engineering(Wuhan University), Wuhan Hubei 430072, China
  • Received:2015-10-15 Revised:2015-12-08 Online:2016-05-10 Published:2016-05-09
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61202100).

摘要: 针对传统以句法分析为主的数据库自然语言接口系统识别用户语义准确率不高,且需要大量人工标注训练语料的问题,提出了一种基于复述的中文自然语言接口(NLIDB)实现方法。首先提取用户语句中表征数据库实体词,建立候选树集及对应的形式化自然语言表达;其次由网络问答语料训练得到的复述分类器筛选出语义最相近的表达;最后将相应的候选树转换为结构化查询语句(SQL)。实验表明该方法在美国地理问答语料(GeoQueries880)、餐饮问答语料(RestQueries250)上的F1值分别达到83.4%、90%,均优于句法分析方法。通过对比实验结果发现基于复述方法的数据库自然语言接口系统能更好地处理用户与数据库的语义鸿沟问题。

关键词: 数据库自然语言接口, 词向量, 复述, 自然语言表达, 机器学习

Abstract: In this paper, a novel method for Chinese Natural Language Interface of Database (NLIDB) based on Chinese paraphrase was proposed to solve the problems of traditional methods based on syntactic parsing which cannot obtain high accuracy and need a lot of manual label training corpus. First, key entities of user statements in databases were extracted, and candidate tree sets and their tree expressions were generated. Then most relevant semantic expressions were filtered by paraphrase classifier which was obtained from the Internet Q&A training corpus. Finally, candidate trees were translated into Structured Query Language (SQL). F1 score was respectively 83.4% and 90% on data sets of Chinese America Geography (GeoQueries880) and Questions about Restaurants (RestQueries250) by using the proposed method, better than syntactic based method. The experimental results demonstrate that the NLIDB based on paraphrase can handle the semantic gaps between users and databases better.

Key words: Natural Language Interface of DataBase (NLIDB), word vector, paraphrase, natural language expression, machine learning

中图分类号: