《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (4): 1072-1078.DOI: 10.11772/j.issn.1001-9081.2021071278

• CCF第36届中国计算机应用大会 (CCF NCCA 2021) • 上一篇    下一篇

知识图谱增强的科普文本分类模型

唐望径1,2, 许斌1(), 仝美涵1, 韩美奂1,3, 王黎明4, 钟琦4   

  1. 1.清华大学 计算机科学与技术系,北京 100084
    2.北京交通大学 计算机与信息技术学院,北京 100044
    3.清华大学 深圳国际研究生院,广东 深圳 518055
    4.中国科普研究所,北京 100081
  • 收稿日期:2021-07-16 修回日期:2021-09-07 接受日期:2021-09-07 发布日期:2022-04-15 出版日期:2022-04-10
  • 通讯作者: 许斌
  • 作者简介:唐望径(1996—),男,海南海口人,硕士研究生,CCF会员,主要研究方向:自然语言处理、知识图谱
    仝美涵(1995—),女,河南三门峡人,博士研究生,CCF会员,主要研究方向:知识工程、信息抽取
    韩美奂(1997—),女,内蒙古赤峰人,硕士研究生,主要研究方向:知识工程、信息抽取
    王黎明(1982—),男,内蒙古巴彦淖尔人,副研究员,博士,主要研究方向:科学传播的社会网络分析方法
    钟琦(1970—),女,山西太原人,研究员,硕士,主要研究方向:大众传媒上的科学传播评估。

Popular science text classification model enhanced by knowledge graph

Wangjing TANG1,2, Bin XU1(), Meihan TONG1, Meihuan HAN1,3, Liming WANG4, Qi ZHONG4   

  1. 1.Department of Computer Science and Technology,Tsinghua University,Beijing 100084,China
    2.School of Computer and Information Technology,Beijing Jiaotong University,Beijing 100044,China
    3.Tsinghua Shenzhen International Graduate School,Shenzhen Guangdong 518055,China
    4.China Research Institute for Science Popularization,Beijing 100081,China
  • Received:2021-07-16 Revised:2021-09-07 Accepted:2021-09-07 Online:2022-04-15 Published:2022-04-10
  • Contact: Bin XU
  • About author:TANG Wangjing, born in 1996, M.S. candidate. His research interests include natural language processing, knowledge graph.
    TONG Meihan, born in 1995, Ph. D. candidate. Her research interests include knowledge engineering, information extraction.
    HAN Meihuan, born in 1997, M. S. candidate. Her research interests include knowledge engineering, information extraction.
    WANG Liming, born in 1982, Ph. D., associate research fellow. His research interests include social network analysis method to scientific communication.
    ZHONG Qi, born in 1970, M.S., research fellow. Her research interests include evaluation of science communication on mass media.

摘要:

科普文本分类是将科普文章按照科普分类体系进行划分的任务。针对科普文章篇幅超过千字,模型难以聚焦关键信息,造成传统模型分类性能不佳的问题,提出一种结合知识图谱进行两级筛选的科普长文本分类模型,来减少主题无关信息的干扰,提升模型的分类性能。首先,采用四步法构建科普领域的知识图谱;然后,将该知识图谱作为距离监督器,并通过训练句子过滤器来过滤掉无关信息;最后,使用注意力机制对过滤后的句子集做进一步的信息筛选,并实现基于注意力的主题分类模型。在所构建的科普文本分类数据集(PSCD)上的实验结果表明,基于领域知识图谱的知识增强的文本分类算法模型具有更高的F1-Score,相较于TextCNN模型和BERT模型,在F1-Score上分别提升了2.88个百分点和1.88个百分点,验证了知识图谱对于长文本信息筛选的有效性。

关键词: 科普文本分类, 知识图谱, 两级筛选, 长文本分类, 注意力

Abstract:

Popular science text classification aims to classify the popular science articles according to the popular science classification system. Concerning the problem that the length of popular science articles often exceeds 1 000 words, which leads to the model hard to focus on key points and causes poor classification performance of the traditional models, a model for long text classification combining knowledge graph to perform two-level screening was proposed to reduce the interference of topic-irrelevant information and improve the performance of model classification. First, a four-step method was used to construct a knowledge graph for the domains of popular science. Then, this knowledge graph was used as a distance monitor to filter out irrelevant information through training sentence filters. Finally, the attention mechanism was used to further filter the information of the filtered sentence set, and the attention-based topic classification model was completed. Experimental results on the constructed Popular Science Classification Dataset (PSCD) show that the text classification algorithm model based on the domain knowledge graph information enhancement has higher F1-Score. Compared with the TextCNN model and the BERT (Bidirectional Encoder Representations from Transformers) model, the proposed model has the F1-Score increased by 2.88 percentage points and 1.88 percentage points respectively, verifying the effectiveness of knowledge graph to long text information screening.

Key words: popular science text classification, knowledge graph, two-level screening, long text classification, attention

中图分类号: