Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (1): 40-47.DOI: 10.11772/j.issn.1001-9081.2023111699

• Artificial intelligence • Previous Articles     Next Articles

HTLR: named entity recognition framework with hierarchical fusion of multi-knowledge

Xueqiang LYU1, Tao WANG1, Xindong YOU1(), Ge XU2   

  1. 1.Beijing Key Laboratory of Internet Culture and Digital Dissemination Research (Beijing Information Science and Technology University),Beijing 100101,China
    2.Fujian Provincial Key Laboratory of Information Processing and Intelligent Control (Minjiang University),Fuzhou Fujian 350108,China
  • Received:2023-12-06 Revised:2024-05-11 Accepted:2024-05-20 Online:2024-07-25 Published:2025-01-10
  • Contact: Xindong YOU
  • About author:LYU Xueqiang, born in 1970, Ph. D., professor. His research interests include natural language processing.
    WANG Tao, born in 1998, M. S. candidate. His research interests include knowledge graph.
    XU Ge, born in 1978, Ph. D., professor. His research interests include natural language processing.
  • Supported by:
    National Natural Science Foundation of China(62171043);Natural Science Foundation of Beijing(4212020);Huaneng Group Headquarters Technology Project(HNKJ21-HF43);Central Leading Local Project(2020L3024);Research and Development Program of Beijing Municipal Education Commission(KM202111232001)

层次融合多元知识的命名实体识别框架——HTLR

吕学强1, 王涛1, 游新冬1(), 徐戈2   

  1. 1.网络文化与数字传播北京市重点实验室(北京信息科技大学),北京 100101
    2.福建省信息处理与智能控制重点实验室(闽江学院),福州 350108
  • 通讯作者: 游新冬
  • 作者简介:吕学强(1970—),男,辽宁抚顺人,教授,博士,CCF高级会员,主要研究方向:自然语言处理;
    王涛(1998—),男,河北廊坊人,硕士研究生,主要研究方向:知识图谱;
    徐戈(1978—),男,浙江淳安人,教授,博士,主要研究方向:自然语言处理。
  • 基金资助:
    国家自然科学基金资助项目(62171043);北京市自然科学基金资助项目(4212020);华能集团总部科技项目HNKJ21?HF43;中央引导地方项目(2020L3024);北京市教育委员会研究与发展计划项目(KM202111232001)

Abstract:

Chinese Named Entity Recognition (NER) tasks aim to extract entities from unstructured text and assign them to predefined entity categories. Aiming at the issue of insufficient semantic learning caused by the lack of contextual information in most Chinese NER methods, an NER framework with hierarchical fusion of multi-knowledge, named HTLR (Chinese NER method based on Hierarchical Transformer fusing Lexicon and Radical), was proposed to utilize hierarchically fused multi-knowledge to help the model learn richer and more comprehensive contextual and semantic information. Firstly, the lexicon contained in the corpus was identified and vectorized by using a publicly available Chinese lexicon table and word vector table. At the same time, the knowledge about Chinese lexicon was learned by modeling semantic relationships between lexicon and related characters through optimized position encoding. Secondly, the corpus was converted into the corresponding coding sequences to represent the character form information by the coding based on Chinese character radicals provided by Han Dian website, and an RFE-CNN (Radical Feature Extraction-Convolutional Neural Network) model was proposed for extracting radical information. Finally, the Hierarchical Transformer model was proposed, where semantic relationships between characters and lexicon, characters and radical forms in lower-level modules, and multi-knowledge about characters, lexicon, and radical forms were learned at higher-level modules, which helped the model acquire character representations with richer semantics. Experimental results on public datasets Weibo, Resume, MSRA, and OntoNotes4.0 show that the F1 values of the proposed method are improved by 9.43, 0.75, 1.76, and 6.45 percentage points, respectively, compared with those of the mainstream method NFLAT (Non-Flat-LAttice Transformer for Chinese named entity recognition), reaching the optimal level. It can be seen that multi-semantic knowledge, hierarchical fusion, the RFE-CNN structure, and Hierarchical Transformer structure are effective for learning rich semantic knowledge and improving model performance.

Key words: Named Entity Recognition (NER), Natural Language Processing (NLP), knowledge graph construction, lexicon enhancement, radical enhancement

摘要:

中文命名实体识别(NER)任务旨在抽取非结构化文本中包含的实体并给它们分配预定义的实体类别。针对大多数中文NER方法在上下文信息缺乏时的语义学习不足问题,提出一种层次融合多元知识的NER框架——HTLR (Chinese NER method based on Hierarchical Transformer fusing Lexicon and Radical),以通过分层次融合的多元知识来帮助模型学习更丰富、全面的上下文信息和语义信息。首先,通过发布的中文词汇表和词汇向量表识别语料中包含的潜在词汇并把它们向量化,同时通过优化后的位置编码建模词汇和相关字符的语义关系,以学习中文的词汇知识;其次,通过汉典网发布的基于汉字字形的编码将语料转换为相应的编码序列以代表字形信息,并提出RFE-CNN (Radical Feature Extraction-Convolutional Neural Network)模型来提取字形知识;最后,提出Hierarchical Transformer模型,其中由低层模块分别学习字符和词汇以及字符和字形的语义关系,并由高层模块进一步融合字符、词汇、字形等多元知识,从而帮助模型学习语义更丰富的字符表征。在Weibo、Resume、MSRA和OntoNotes4.0公开数据集进行了实验,与主流方法NFLAT (Non-Flat-LAttice Transformer for Chinese named entity recognition)的对比结果表明,所提方法的F1值在4个数据集上分别提升了9.43、0.75、1.76和6.45个百分点,达到最优水平。可见,多元语义知识、层次化融合、RFE-CNN结构和Hierarchical Transformer结构对学习丰富的语义知识及提高模型性能是有效的。

关键词: 命名实体识别, 自然语言处理, 知识图谱构建, 词汇增强, 字形增强

CLC Number: