基于改进TextRank的科技文本关键词抽取方法

doi:10.11772/j.issn.1001-9081.2023060845

《计算机应用》唯一官方网站 ›› 2024, Vol. 44 ›› Issue (6): 1720-1726.DOI: 10.11772/j.issn.1001-9081.2023060845

所属专题： CCF第38届中国计算机应用大会 (CCF NCCA 2023)

• CCF第38届中国计算机应用大会 (CCF NCCA 2023) • 上一篇下一篇

基于改进TextRank的科技文本关键词抽取方法

杨冬菊¹^,²(), 胡成富¹^,²

^1.北方工业大学信息学院，北京 100144
^2.大规模流数据集成与分析技术北京市重点实验室（北方工业大学），北京 100144

收稿日期:2023-07-04 修回日期:2023-08-03 接受日期:2023-08-07 发布日期:2023-08-28 出版日期:2024-06-10
通讯作者: 杨冬菊
作者简介:胡成富（1997—），男，湖南郴州人，硕士研究生，主要研究方向：自然语言处理。
基金资助:
广州市科技计划项目(202206030009)

Keyword extraction method for scientific text based on improved TextRank

Dongju YANG¹^,²(), Chengfu HU¹^,²

^1.School of Information Science and Technology，North China University of Technology，Beijing 100144，China
^2.Beijing Key Laboratory on Integration and Analysis of Large?Scale Stream Data （North China University of Technology），Beijing 100144，China

Received:2023-07-04 Revised:2023-08-03 Accepted:2023-08-07 Online:2023-08-28 Published:2024-06-10
Contact: Dongju YANG
About author:HU Chengfu， born in 1997， M. S. candidate. His research interests include natural language processing.
Supported by:
Guangzhou Science and Technology Plan Project(202206030009)

摘要/Abstract

摘要：

针对科技文本关键词抽取任务中抽取出现次数少但能较好表达文本主旨的词语效果差的问题，提出一种基于改进TextRank的关键词抽取方法。首先，利用词语的词频-逆文档频率（TF-IDF）统计特征和位置特征优化共现图中词语间的概率转移矩阵，通过迭代计算得到词语的初始得分；然后，利用K-Core（K-Core decomposition）算法挖掘K-Core子图得到词语的层级特征，利用平均信息熵特征衡量词语的主题表征能力；最后，在词语初始得分的基础上融合层级特征和平均信息熵特征，从而确定关键词。实验结果表明，在公开数据集上，与TextRank方法和OTextRank（Optimized TextRank）方法相比，所提方法在抽取不同关键词数量的实验中，F1均值分别提高了6.5和3.3个百分点；在科技服务项目数据集上，与TextRank方法和OTextRank方法相比，所提方法在抽取不同关键词数量的实验中，F1均值分别提高了7.4和3.2个百分点。实验结果验证了所提方法抽取出现频率低但较好表达文本主旨关键词的有效性。

关键词: 科技文本, 关键词抽取, TextRank, K-Core图, 平均信息熵

Abstract:

Aiming at the poor extraction effect of words that appear less frequently but can better express the theme of the text in the keyword extraction task of scientific text， a keyword extraction method based on improved TextRank was proposed. Firstly， the Term Frequency-Inverse Document Frequency （TF-IDF） statistical features and positional features of the words were used to optimize the probability transfer matrix between the words in the co-occurrence graph， and the initial scores of the words were obtained through iterative computation. Then， K-Core （K-Core decomposition） algorithm was used to mine the K-Core subgraphs to get the hierarchical features of the words， and the average information entropy feature was used to measure the thematic representation ability of the words. Finally， on the basis of the initial score of the word， the hierarchical feature and the average information entropy feature were fused to determine the keyword. The experimental results show that： on the public dataset， compared with the TextRank method and the OTextRank （Optimized TextRank） method， the proposed method increases the average F1 by 6.5 and 3.3 percentage points respectively for extracting different numbers of keywords； on the science and technology service project dataset， compared with the TextRank method and the OtexTRank method， the proposed method increases the average F1 by 7.4 and 3.2 percentage points for extracting different numbers of keywords. Experimental results verified the effectiveness of the proposed method for extracting keywords with low frequency but better expressing the theme of the text.

Key words: scientific text, keyword extraction, TextRank, K-Core (K-Core decomposition) diagram, average information entropy

中图分类号:

TP391.1

杨冬菊, 胡成富. 基于改进TextRank的科技文本关键词抽取方法[J]. 计算机应用, 2024, 44(6): 1720-1726.

Dongju YANG, Chengfu HU. Keyword extraction method for scientific text based on improved TextRank[J]. Journal of Computer Applications, 2024, 44(6): 1720-1726.

图/表 4

参考文献 19

1	SALTON G， BUCKLEY C. Term-weighting approaches in automatic text retrieval ［J］. Information Processing & Management， 1988，24（5）：513-523.
2	WANG Z H， WANG D， LI Q. Keyword extraction from scientific research projects based on SRP-TF-IDF［J］. Chinese Journal of Electronics， 2021， 30（4）： 652-657.
3	刘啸剑，谢飞，吴信东.基于图和LDA主题模型的关键词抽取算法［J］.情报学报，2016，35（6）：664-672.
	LIU X J， XIE F， WU X D. Graph based keyphrase extraction using LDA topic model［J］. Journal of the China Society for Scientific and Technical Information， 2016，35（6）：664-672.
4	WU D， YANG R X， SHEN C. Sentiment word co-occurrence and knowledge pair feature extraction based LDA short text clustering algorithm ［J］. Journal of Intelligent Information Systems， 2021， 56（1）： 1-23.
5	罗婉丽，张磊.结合拓扑势与TextRank算法的关键词提取方法［J］.计算机应用与软件，2022，39（1）：334-338.
	LUO W L， ZHANG L. A keywords extraction method combining topological potential and TextRank algorithm［J］. Computer Applications and Software， 2022， 39（1）： 334-338.
6	QIU D， ZHENG Q. Improving TextRank algorithm for automatic keyword extraction with tolerance rough set［J］. International Journal of Fuzzy Systems， 2022， 24（3）： 1332-1342.
7	DING T， YANG W， WEI F， et al. Chinese keyword extraction model with distributed computing［J］. Computers & Electrical Engineering， 2022， 97：107639.
8	SHE C， YOU H， LIN C， et al. Deep neural semantic network for keywords extraction on short text［C］// Proceedings of the 6th International Conference of Pioneering Computer Scientists， Engineers and Educators. Singapore： Springer， 2020： 101-112.
9	ZHANG Y， TUO M， YIN Q， et al. Keywords extraction with deep neural network model ［J］. Neurocomputing， 2020， 383： 113-121.
10	王玉叶，王玙.基于图神经网络的专利关键词提取算法研究［J］.情报理论与实践，2023，46（5）：202-208.
	WANG Y Y， WANG Y. Research on patent keyword extraction algorithm based on graph neural network ［J］. Information Studies：Theroy & Application， 2023，46（5）：202-208.
11	方俊伟，崔浩冉，贺国秀，等. 基于先验知识 TextRank 的学术文本关键词抽取［J］. 情报科学， 2019， 37（3）： 75-80.
	FANG J W， CUI H R， HE G X， et al. Keyword extraction of academic text with TextRank model based on prior knowledge ［J］. Information Science， 2019， 37（3）： 75-80.
12	赵占芳，刘鹏鹏，李雪山.基于改进TextRank的铁路文献关键词抽取算法［J］.北京交通大学学报，2021，45（2）：80-86.
	ZHAO Z F， LIU P P， LI X S. Keywords extraction algorithm of railway literature based on improved TextRank ［J］. Journal of Beijing Jiaotong University， 2021，45（2）：80-86.
13	李雪山，刘鹏鹏，李子林，等.融合注意力机制的铁路科技文献关键词抽取研究［J］.铁道学报，2022，44（12）：65-72.
	LI X S， LIU P P， LI Z L， et al. Research on keyword extraction of railway science and technology literature based on attention mechanism ［J］. Journal of China Railway Society， 2022，44（12）：65-72.
14	李志强，潘苏含，戴娟.一种改进的TextRank关键词提取算法［J］. 计算机技术与发展， 2020， 30（3）： 77-81.
	LI Z Q， PAN S H， DAI J. An improved TextRank keyword extraction algorithm ［J］. Computer Technology and Development， 2020， 30（3）： 77-81.
15	李旭晖，周怡.基于语义聚类的关键词抽取方法［J］.情报科学，2022，40（3）：99-108.
	LI X H， ZHOU Y. A keyword extraction method based on semantic clustering ［J］. Information Science， 2022， 40（3）：99-108.
16	HUANG Z， XIE Z. A patent keywords extraction method using TextRank model with prior public knowledge［J］. Complex & Intelligent Systems， 2022， 8（1）： 1-12.
17	XIONG A， LIU D， TIAN H， et al. News keyword extraction algorithm based on semantic clustering and word graph model ［J］. Tsinghua Science and Technology， 2021， 26（6）：886-893.
18	GUO W， WANG Z， HAN F. Multifeature fusion keyword extraction algorithm based on TextRank ［J］. IEEE Access， 2022， 10： 71805-71813.
19	MIHALCEA R， TARAU P. TextRank： Bringing order into text ［C］// Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. Berlin： Springer， 2004： 404-411.

数据集	方法	提取关键词数为3			提取关键词数为5			提取关键词数为7			提取关键词数为10
数据集	方法	准确率	召回率	F1值	准确率	召回率	F1值	准确率	召回率	F1值	准确率	召回率	F1值
公开数据集	TF-IDF	24.4	30.9	27.2	18.6	38.7	25.1	13.5	40.4	20.3	10.4	43.8	16.8
	TextRank	22.3	28.4	25.0	16.9	35.3	22.9	13.7	39.6	20.4	10.7	43.9	17.3
	OTextRank	28.9	32.1	30.4	19.9	40.6	27.0	14.8	43.6	22.1	11.9	48.2	19.2
	MFTextRank	30.8	38.6	34.2	22.6	46.3	30.4	17.6	50.0	26.0	13.2	53.1	21.1
科技服务项目数据集	TF-IDF	27.8	22.2	24.7	22.0	29.0	25.0	17.8	32.5	23.0	14.0	36.1	20.2
	TextRank	24.2	19.2	21.4	19.2	25.2	21.8	15.9	29.1	20.6	12.6	32.7	18.2
	OTextRank	28.9	26.9	27.9	21.6	32.4	25.9	18.7	35.1	24.4	14.2	36.5	20.4
	MFTextRank	32.9	26.4	29.3	28.2	37.2	32.1	21.3	39.1	27.6	15.6	40.4	22.5

数据集	方法	提取关键词数为3			提取关键词数为5			提取关键词数为7			提取关键词数为10
数据集	方法	准确率	召回率	F1值	准确率	召回率	F1值	准确率	召回率	F1值	准确率	召回率	F1值
公开数据集	TF-IDF	24.4	30.9	27.2	18.6	38.7	25.1	13.5	40.4	20.3	10.4	43.8	16.8
	TextRank	22.3	28.4	25.0	16.9	35.3	22.9	13.7	39.6	20.4	10.7	43.9	17.3
	OTextRank	28.9	32.1	30.4	19.9	40.6	27.0	14.8	43.6	22.1	11.9	48.2	19.2
	MFTextRank	30.8	38.6	34.2	22.6	46.3	30.4	17.6	50.0	26.0	13.2	53.1	21.1
科技服务项目数据集	TF-IDF	27.8	22.2	24.7	22.0	29.0	25.0	17.8	32.5	23.0	14.0	36.1	20.2
	TextRank	24.2	19.2	21.4	19.2	25.2	21.8	15.9	29.1	20.6	12.6	32.7	18.2
	OTextRank	28.9	26.9	27.9	21.6	32.4	25.9	18.7	35.1	24.4	14.2	36.5	20.4
	MFTextRank	32.9	26.4	29.3	28.2	37.2	32.1	21.3	39.1	27.6	15.6	40.4	22.5

[1]	汪锦云, 向阳. 基于关键词图表示的文本语义去重算法[J]. 《计算机应用》唯一官方网站, 2023, 43(10): 3070-3076.
[2]	陈伟, 杨燕. 基于指针网络的抽取生成式摘要生成模型[J]. 《计算机应用》唯一官方网站, 2021, 41(12): 3527-3533.
[3]	叶菁菁, 李琳, 钟珞. 基于标签的微博关键词抽取排序方法[J]. 计算机应用, 2016, 36(2): 563-567.
[4]	赵佳鹏, 林民. 基于维基百科的领域历史沿革信息抽取[J]. 计算机应用, 2015, 35(4): 1021-1025.

基于改进TextRank的科技文本关键词抽取方法

Keyword extraction method for scientific text based on improved TextRank

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 4

参考文献 19

相关文章 4

编辑推荐

Metrics