Semantic extraction of domain-dependent mathematical text

  

  • Received:2021-06-02 Revised:2021-07-27 Online:2021-07-27

领域相关的数学文本语义抽取

陈肖宇1,王伟2   

  1. 1. 北京航空航天大学计算机学院软件开发环境国家重点实验室
    2. 北京航空航天大学
  • 通讯作者: 王伟

Abstract: Aiming at the problem of insufficient acquisition of document semantic information in the field of science and technology, rule-based methods for extracting semantics from domain-dependent mathematical text were proposed. Firstly, domain concepts were extracted from the text and semantic linking between mathematical entities and domain concepts were discovered; Secondly, through context analysis for mathematical symbols, entity mentions or corresponding text descriptions were detected and their semantics were extracted; Finally, semantics of expressions were obtained based on the extracted semantics of mathematical symbols therein. Taking linear algebra text as a case study, a semantic annotation dataset was constructed. Experimental results show that the proposed methods for extracting semantics of identifiers, linear algebra entities and expressions are able to achieve a precision score higher than 93% and a recall score higher than 91%.

Key words: semantic extraction, entity mention, context analysis, mathematical language processing, mathematical text understanding

摘要: 针对科技领域文档语义信息获取不充分的问题,提出一套基于规则的数学领域相关的语义抽取方法。首先从文本中提取领域概念并实现数学实体与领域概念之间的语义映射,然后通过对数学符号的上下文分析,获取数学符号的实体指代或文字描述进而抽取其语义,最后基于已抽取的数学符号语义实现表达式的语义分析。以线性代数文本为研究实例,构建了一个语义标注数据集并进行实验,结果表明该方法对标识符、线性代数实体以及表达式的语义抽取具有93%以上的准确率和91%以上的召回率。

关键词: 语义抽取, 实体指代, 上下文分析, 数学语言处理, 数学文本理解

CLC Number: