《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (6): 1720-1726.DOI: 10.11772/j.issn.1001-9081.2023060845

• CCF第38届中国计算机应用大会 (CCF NCCA 2023) • 上一篇    

基于改进TextRank的科技文本关键词抽取方法

杨冬菊1,2(), 胡成富1,2   

  1. 1.北方工业大学 信息学院,北京 100144
    2.大规模流数据集成与分析技术北京市重点实验室(北方工业大学),北京 100144
  • 收稿日期:2023-07-04 修回日期:2023-08-03 接受日期:2023-08-07 发布日期:2023-08-28 出版日期:2024-06-10
  • 通讯作者: 杨冬菊
  • 作者简介:胡成富(1997—),男,湖南郴州人,硕士研究生,主要研究方向:自然语言处理。
  • 基金资助:
    广州市科技计划项目(202206030009)

Keyword extraction method for scientific text based on improved TextRank

Dongju YANG1,2(), Chengfu HU1,2   

  1. 1.School of Information Science and Technology,North China University of Technology,Beijing 100144,China
    2.Beijing Key Laboratory on Integration and Analysis of Large?Scale Stream Data (North China University of Technology),Beijing 100144,China
  • Received:2023-07-04 Revised:2023-08-03 Accepted:2023-08-07 Online:2023-08-28 Published:2024-06-10
  • Contact: Dongju YANG
  • About author:HU Chengfu, born in 1997, M. S. candidate. His research interests include natural language processing.
  • Supported by:
    Guangzhou Science and Technology Plan Project(202206030009)

摘要:

针对科技文本关键词抽取任务中抽取出现次数少但能较好表达文本主旨的词语效果差的问题,提出一种基于改进TextRank的关键词抽取方法。首先,利用词语的词频-逆文档频率(TF-IDF)统计特征和位置特征优化共现图中词语间的概率转移矩阵,通过迭代计算得到词语的初始得分;然后,利用K-Core(K-Core decomposition)算法挖掘K-Core子图得到词语的层级特征,利用平均信息熵特征衡量词语的主题表征能力;最后,在词语初始得分的基础上融合层级特征和平均信息熵特征,从而确定关键词。实验结果表明,在公开数据集上,与TextRank方法和OTextRank(Optimized TextRank)方法相比,所提方法在抽取不同关键词数量的实验中,F1均值分别提高了6.5和3.3个百分点;在科技服务项目数据集上,与TextRank方法和OTextRank方法相比,所提方法在抽取不同关键词数量的实验中,F1均值分别提高了7.4和3.2个百分点。实验结果验证了所提方法抽取出现频率低但较好表达文本主旨关键词的有效性。

关键词: 科技文本, 关键词抽取, TextRank, K-Core图, 平均信息熵

Abstract:

Aiming at the poor extraction effect of words that appear less frequently but can better express the theme of the text in the keyword extraction task of scientific text, a keyword extraction method based on improved TextRank was proposed. Firstly, the Term Frequency-Inverse Document Frequency (TF-IDF) statistical features and positional features of the words were used to optimize the probability transfer matrix between the words in the co-occurrence graph, and the initial scores of the words were obtained through iterative computation. Then, K-Core (K-Core decomposition) algorithm was used to mine the K-Core subgraphs to get the hierarchical features of the words, and the average information entropy feature was used to measure the thematic representation ability of the words. Finally, on the basis of the initial score of the word, the hierarchical feature and the average information entropy feature were fused to determine the keyword. The experimental results show that: on the public dataset, compared with the TextRank method and the OTextRank (Optimized TextRank) method, the proposed method increases the average F1 by 6.5 and 3.3 percentage points respectively for extracting different numbers of keywords; on the science and technology service project dataset, compared with the TextRank method and the OtexTRank method, the proposed method increases the average F1 by 7.4 and 3.2 percentage points for extracting different numbers of keywords. Experimental results verified the effectiveness of the proposed method for extracting keywords with low frequency but better expressing the theme of the text.

Key words: scientific text, keyword extraction, TextRank, K-Core (K-Core decomposition) diagram, average information entropy

中图分类号: