《计算机应用》唯一官方网站 ›› 2023, Vol. 43 ›› Issue (9): 2707-2714.DOI: 10.11772/j.issn.1001-9081.2022091407

• 2022第10届CCF大数据学术会议 • 上一篇    下一篇

融合局部语义特征的学者细粒度信息提取方法

田悦霖1,2, 黄瑞章1,2(), 任丽娜1,2   

  1. 1.公共大数据国家重点实验室(贵州大学),贵阳 550025
    2.贵州大学 计算机科学与技术学院,贵阳 550025
  • 收稿日期:2022-09-06 修回日期:2022-10-27 接受日期:2022-11-07 发布日期:2023-09-10 出版日期:2023-09-10
  • 通讯作者: 黄瑞章
  • 作者简介:田悦霖(1997—),女,河北深州人,硕士研究生,CCF会员,主要研究方向:自然语言处理、文本挖掘、机器学习
    任丽娜(1987—),女,辽宁阜新人,博士研究生,主要研究方向:自然语言处理、文本挖掘、机器学习。
  • 基金资助:
    国家自然科学基金资助项目(62066007)

Scholar fine-grained information extraction method fused with local semantic features

Yuelin TIAN1,2, Ruizhang HUANG1,2(), Lina REN1,2   

  1. 1.State Key Laboratory of Public Big Data (Guizhou University),Guiyang Guizhou 550025,China
    2.College of Computer Science and Technology,Guizhou University,Guiyang Guizhou 550025,China
  • Received:2022-09-06 Revised:2022-10-27 Accepted:2022-11-07 Online:2023-09-10 Published:2023-09-10
  • Contact: Ruizhang HUANG
  • About author:TIAN Yuelin, born in 1997, M. S. candidate. Her research interests include natural language processing, text mining, machine learning.
    REN Lina, born in 1987, Ph. D. candidate. Her research interests include natural language processing, text mining, machine learning.
  • Supported by:
    National Natural Science Foundation of China(62066007)

摘要:

从学者主页中提取的学者细粒度信息(如学者研究方向、教育经历等)在大规模专业人才库的创建等方面具有非常重要的应用价值。针对现有学者细粒度信息提取方法无法有效利用上下文语义联系的问题,提出一种融合局部语义特征的学者信息提取方法,利用局部范围文本的语义联系对学者主页进行细粒度信息抽取。首先,通过全词掩码中文预训练模型RoBERTa-wwm-ext学习通用语义表征;之后将通用语义表征中的目标句表征向量与局部相邻文本表征向量共同输入卷积神经网络(CNN)实现局部语义融合,从而获得更高维度的目标句表征向量;最终将目标句表征向量从高维度空间映射到低维度标签空间实现学者主页细粒度信息的抽取。实验结果表明,使用此融合局部语义特征的方法进行学者细粒度信息提取的宏平均F1值达到93.43%,与未融合局部语义的RoBERTa-wwm-ext-TextCNN方法相比提高了8.60个百分点,验证了所提方法在学者细粒度信息提取任务上的有效性。

关键词: 学者信息提取, 预训练模型, 局部语义融合, TextCNN, 特征提取

Abstract:

It is importantly used in the fields such as creation of large-scale professional talent pools to extract scholar fine-grained information such as scholar’s research directions, education experience from scholar homepages. To address the problem that the existing scholar fine-grained information extraction methods cannot use contextual semantic associations effectively, a scholar fine-grained information extraction method incorporating local semantic features was proposed to extract fine-grained information from scholar homepages by using semantic associations in the local text. Firstly, general semantic representation was learned by the full-word mask Chinese pre-trained model RoBERTa-wwm-ext. Subsequently, the representation vector of the target sentence, as well as its locally adjacent text representation vector from the general semantic embeddings, were jointly fed into a CNN (Convolutional Neural Network) to accomplish local semantic fusion, thereby obtaining a higher-dimensional representation vector for the target sentence. Finally, the representation vector of the target sentence was mapped from the high-dimensional space to the low-dimensional labeling space to extract the fine-grained information from the scholar homepage. Experimental results show that the micro-average F1 score of the scholar fine-grained information extraction method fusing local semantic features reaches 93.43%, which is higher than that of RoBERTa-wwm-ext-TextCNN method without fusing local semantic by 8.60 percentage points, which verifies the effectiveness of the proposed method on the scholar fine-grained information extraction task.

Key words: scholar information extraction, pre-trained model, local semantic fusion, TextCNN, feature extraction

中图分类号: