《计算机应用》唯一官方网站 ›› 2022, Vol. 42 ›› Issue (4): 1072-1078.DOI: 10.11772/j.issn.1001-9081.2021071278

所属专题: CCF第36届中国计算机应用大会 (CCF NCCA 2021)

• CCF第36届中国计算机应用大会 (CCF NCCA 2021) • 上一篇    下一篇


唐望径1,2, 许斌1(), 仝美涵1, 韩美奂1,3, 王黎明4, 钟琦4   

  1. 1.清华大学 计算机科学与技术系,北京 100084
    2.北京交通大学 计算机与信息技术学院,北京 100044
    3.清华大学 深圳国际研究生院,广东 深圳 518055
    4.中国科普研究所,北京 100081
  • 收稿日期:2021-07-16 修回日期:2021-09-07 接受日期:2021-09-07 发布日期:2022-04-15 出版日期:2022-04-10
  • 通讯作者: 许斌
  • 作者简介:唐望径(1996—),男,海南海口人,硕士研究生,CCF会员,主要研究方向:自然语言处理、知识图谱

Popular science text classification model enhanced by knowledge graph

Wangjing TANG1,2, Bin XU1(), Meihan TONG1, Meihuan HAN1,3, Liming WANG4, Qi ZHONG4   

  1. 1.Department of Computer Science and Technology,Tsinghua University,Beijing 100084,China
    2.School of Computer and Information Technology,Beijing Jiaotong University,Beijing 100044,China
    3.Tsinghua Shenzhen International Graduate School,Shenzhen Guangdong 518055,China
    4.China Research Institute for Science Popularization,Beijing 100081,China
  • Received:2021-07-16 Revised:2021-09-07 Accepted:2021-09-07 Online:2022-04-15 Published:2022-04-10
  • Contact: Bin XU
  • About author:TANG Wangjing, born in 1996, M.S. candidate. His research interests include natural language processing, knowledge graph.
    TONG Meihan, born in 1995, Ph. D. candidate. Her research interests include knowledge engineering, information extraction.
    HAN Meihuan, born in 1997, M. S. candidate. Her research interests include knowledge engineering, information extraction.
    WANG Liming, born in 1982, Ph. D., associate research fellow. His research interests include social network analysis method to scientific communication.
    ZHONG Qi, born in 1970, M.S., research fellow. Her research interests include evaluation of science communication on mass media.



关键词: 科普文本分类, 知识图谱, 两级筛选, 长文本分类, 注意力


Popular science text classification aims to classify the popular science articles according to the popular science classification system. Concerning the problem that the length of popular science articles often exceeds 1 000 words, which leads to the model hard to focus on key points and causes poor classification performance of the traditional models, a model for long text classification combining knowledge graph to perform two-level screening was proposed to reduce the interference of topic-irrelevant information and improve the performance of model classification. First, a four-step method was used to construct a knowledge graph for the domains of popular science. Then, this knowledge graph was used as a distance monitor to filter out irrelevant information through training sentence filters. Finally, the attention mechanism was used to further filter the information of the filtered sentence set, and the attention-based topic classification model was completed. Experimental results on the constructed Popular Science Classification Dataset (PSCD) show that the text classification algorithm model based on the domain knowledge graph information enhancement has higher F1-Score. Compared with the TextCNN model and the BERT (Bidirectional Encoder Representations from Transformers) model, the proposed model has the F1-Score increased by 2.88 percentage points and 1.88 percentage points respectively, verifying the effectiveness of knowledge graph to long text information screening.

Key words: popular science text classification, knowledge graph, two-level screening, long text classification, attention
