Journal of Computer Applications ›› 2020, Vol. 40 ›› Issue (11): 3192-3197.DOI: 10.11772/j.issn.1001-9081.2020040473

• Data science and technology • Previous Articles     Next Articles

Query extension based on deep semantic information

LIU Gaojun, FANG Xiao, DUAN Jianyong   

  1. College of Information Science, North China University of Technology, Beijing 100144, China
  • Received:2020-04-17 Revised:2020-06-26 Online:2020-11-10 Published:2020-07-09
  • Supported by:
    This work is partially supported by the National Natural Science Foundation of China (61972003), the Foundation of CNONIX National Standard Application and Promotion Lab (4020548420G3).

基于深度语义信息的查询扩展

刘高军, 方晓, 段建勇   

  1. 北方工业大学 信息学院, 北京 100144
  • 通讯作者: 段建勇(1978-),男,山西文水人,教授,博士,CCF会员,主要研究方向:自然语言处理、信息检索;duanjy@ncut.edu.cn
  • 作者简介:刘高军(1962-),男,吉林长春人,教授,硕士,主要研究方向:数据处理、软件服务;方晓(1995-),女,山东潍坊人,硕士研究生,主要研究方向:自然语言处理、信息检索
  • 基金资助:
    国家自然科学基金资助项目(61972003);CNONIX国家标准应用与推广实验室资助项目(4020548420G3)。

Abstract: With the advent of the Internet era, search engines begin to be widely used. In the case of unpopular data, the search engine is unable to retrieve the required data due to the small range of the user's search term. At this time, the query extension system can effectively assist the search engine to provide the reliable services. Based on the query extension method of global document analysis, a semantic relevance model which combines the neural network model with the corpus containing semantic information was proposed to extract semantic information between words in a deeper level. This deep semantic information can provide more comprehensive and effective feature support for the query extension system, so as to analyze the extensible relationship between words. The local extensible word distribution was extracted from the semantic data such as thesaurus and language knowledge base "HowNet" sememe annotation information, and the local extensible word distribution of each word in corpus space was fitted to the global extensible word distribution by using the deep mining ability of the neural network model. In the comparison experiment with the query extension methods based on language model and thesaurus respectively, the query extension method based on semantic relevance model has a higher query extension efficiency; especially for the unpopular search data, the recall rate of semantic relevance model increases by 11.1 percentage points and 5.29 percentage points compared to those of the comparison methods respectively.

Key words: query extension, semantic relevance, deep learning, global document analysis, language model

摘要: 随着互联网时代的到来,搜索引擎开始被普遍使用。在针对冷门数据时,由于用户的搜索词范围过小,搜索引擎无法检索出需要的数据,此时查询扩展系统可以有效辅助搜索引擎来提供可靠服务。基于全局文档分析的查询扩展方法,提出结合神经网络模型与包含语义信息的语料的语义相关模型,来更深层地提取词语间的语义信息。这些深层语义信息可以为查询扩展系统提供更加全面有效的特征支持,从而分析词语间的可扩展关系。在近义词林、语言知识库“HowNet”义原标注信息等语义数据中抽取局部可扩展词分布,利用神经网络模型的深度挖掘能力将语料空间中每一个词语的局部可扩展词分布拟合成全局可扩展词分布。在与分别基于语言模型和近义词林的查询扩展方法对比实验中,使用基于语义相关模型的查询扩展方法拥有较高的查询扩展效率;尤其针对冷门搜索数据时,语义相关模型的查全率比对比方法分别提高了11.1个百分点与5.29个百分点。

关键词: 查询扩展, 语义相关度, 深度学习, 全局文档分析, 语言模型

CLC Number: