科研论文的可比性评估与比较性引文生成方法

doi:10.11772/j.issn.1001-9081.2024060898

《计算机应用》唯一官方网站 ›› 2025, Vol. 45 ›› Issue (6): 1888-1894.DOI: 10.11772/j.issn.1001-9081.2024060898

• 数据科学与技术 • 上一篇

科研论文的可比性评估与比较性引文生成方法

李翔宇¹, 陈景强¹^,²()

^1.南京邮电大学计算机学院，南京 210023
^2.江苏省大数据安全与智能处理重点实验室（南京邮电大学），南京 210023

收稿日期:2024-06-28 修回日期:2024-09-10 接受日期:2024-09-12 发布日期:2024-09-25 出版日期:2025-06-10
通讯作者: 陈景强
作者简介:李翔宇（2001—），男，山东潍坊人，硕士研究生，主要研究方向：自然语言处理、文本生成
陈景强（1983—），男，浙江温州人，副教授，博士，主要研究方向：文本摘要、自然语言处理、人工智能。cjq@njupt.edu.cn
基金资助:
国家自然科学基金青年科学基金资助项目(62102192)

Comparability assessment and comparative citation generation method for scientific papers

Xiangyu LI¹, Jingqiang CHEN¹^,²()

^1.School of Computer Science，Nanjing University of Posts and Telecommunications，Nanjing Jiangsu 210023，China
^2.Jiangsu Key Laboratory of Big Data Security and Intelligent Processing （Nanjing University of Posts and Telecommunications），Nanjing Jiangsu 210023，China

Received:2024-06-28 Revised:2024-09-10 Accepted:2024-09-12 Online:2024-09-25 Published:2025-06-10
Contact: Jingqiang CHEN
About author:LI Xiangyu， born in 2001， M. S. candidate. His research interests include natural language processing， text generation.
CHEN Jingqiang， born in 1983， Ph. D.， associate professor. His research interests include text summarization， natural language processing， artificial intelligence.
Supported by:
Young Scientists Fund of National Natural Science Foundation of China(62102192)

摘要/Abstract

摘要：

针对比较性引文生成中面临的两大挑战——准确判定论文间的可比性及生成具有比较性的句子，提出科研论文的可比性评估（CA）与比较性引文生成方法SciCACG（Scientific Comparability Assessment and Citation Generation）。该方法构建了3个核心模块：用于判断2篇论文是否具备可比性的CA模块、负责从论文与参考文献中抽取出具体的比较对象的比较对象抽取（CE）模块和用于生成相应的比较性引用句子的比较引文生成模块。首先，利用SciBERT （Scientific BERT）模型处理输入的2篇文章，并通过CA模块进行可比性的评估；其次，对于被判定为可比的文章，采用CE模块识别并抽取出关键的比较对象；最后，使用比较引文生成模块生成包含这些比较对象的比较性引文。实验结果显示，在CA阶段，所提方法在平均倒数排名（MRR）上达到了0.532，在召回率@10（R@10）上达到了0.731，较之前的SciBERT-FNN（Scientific Bidirectional Encoder Representations from Transformers-Feedforward Neural Network）方法在各个数据集上均有提升；在比较性引文生成中，相较于次优的BART-Large（Bidirectional and Auto-Regressive Transformers-Large）方法，所提方法的ROUGE（Recall-Oriented Understudy for Gisting Evaluation）-1、ROUGE-2和ROUGE-L的F1分数分别提高了1.90、1.29和2.55个百分点。此外，实验结果验证了科学文献自动化比较与分析技术对引文句子生成任务具有重要意义，特别是在提高比较信息的可追溯性和确保引用句子信息的全面性方面，展现出极大的实用价值。

关键词: 比较性引文, 可比性评估, 引文生成, 文本生成, 文本分类, 比较对象抽取

Abstract:

To address the two major challenges in comparative citation generation — determining the comparability between papers accurately and generating comparative sentences， a Comparability Assessment （CA） and comparative citation generation method for scientific papers， named SciCACG（Scientific Comparability Assessment and Citation Generation）， was proposed. Three core modules were constructed in the proposed method： a CA module， which was used to determine whether two papers were comparable； a Comparison object Extraction （CE） module， which was employed to extract specific comparison objects from the papers and references， and a comparative citation generation module， which was responsible for generating the corresponding comparative citation sentences. Firstly， the SciBERT （Scientific BERT） model was used to process the two input papers， and the comparability was assessed through the CA module. Then， for papers determined to be comparable， the CE module was used to identify and extract key comparison objects. Finally， the comparative citation generation module was utilized to generate comparative citations containing these objects. Experimental results show that in the CA stage， the proposed method achieves 0.532 in Mean Reciprocal Rank （MRR） and 0.731 in Recall@10 （R@10）， and outperforms the previous SciBERT-FNN （Scientific Bidirectional Encoder Representations from Transformers-Feedforward Neural Network） method on all the datasets； in the comparative citation generation， Compared to the suboptimal BART-Large （Bidirectional and Auto-Progressive Transformers-Large） method， the F1 scores of ROUGE （Recall-Oriented Understudy for Gisting Evaluation）-1， ROUGE-2， and ROUGE-L in the proposed method have increased by 1.90， 1.29， and 2.55 percentage points， respectively. Additionally， the results confirm that the technologies of automated comparison and analysis of scientific literature are crucial for citation sentence generation tasks； particularly， in enhancing the traceability of comparative information and ensuring the comprehensiveness of citation sentences， these technologies demonstrate substantial practical value.

Key words: comparative citation, Comparability Assessment (CA), citation generation, text generation, text classification, Comparison object Extraction (CE)

中图分类号:

TP391.1

李翔宇, 陈景强. 科研论文的可比性评估与比较性引文生成方法[J]. 计算机应用, 2025, 45(6): 1888-1894.

Xiangyu LI, Jingqiang CHEN. Comparability assessment and comparative citation generation method for scientific papers[J]. Journal of Computer Applications, 2025, 45(6): 1888-1894.

图/表 10

参考文献 31

1	BORNMANN L， MUTZ R. Growth rates of modern science： a bibliometric analysis based on the number of publications and cited references［J］. Journal of The Association for Information Science and Technology， 2015， 66（11）： 2215-2222.
2	TEUFEL S， SIDDHARTHAN A， TIDHAR D. Automatic classification of citation function［C］// Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2006： 103-110.
3	XING X， FAN X， WAN X. Automatic generation of citation texts in scholarly papers： a pilot study［C］// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg： ACL， 2020： 6181-6190.
4	LUHN H P. The automatic creation of literature abstracts［J］. IBM Journal of Research and Development， 1958， 2（2）： 159-165.
5	EDMUNDSON H P. New methods in automatic extracting［J］. Journal of the ACM， 1969， 16（2）： 264-285.
6	QAZVINIAN V， RADEV D R. Scientific paper summarization using citation summary networks［C］// Proceedings of the 22nd International Conference on Computational Linguistics. ［S.l.］： Coling 2008 Organizing Committee， 2008： 689-696.
7	MEI Q， ZHAI C. Generating impact-based summaries for scientific literature［C］// Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics： Human Language Technologies. Stroudsburg： ACL， 2008： 816-824.
8	McNEE S M， ALBERT I， COSLEY D， et al. On the recommending of citations for research papers［C］// Proceedings of the 2002 ACM Conference on Computer Supported Cooperative Work. New York： ACM， 2002： 116-125.
9	BHAGAVATULA C， FELDMAN S， POWER R， et al. Content-based citation recommendation［C］// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies （Volume 1： Long Papers）. Stroudsburg： ACL， 2018： 238-251.
10	MEDIĆ Z， ŠNAJDER J. Improved local citation recommendation based on context enhanced with global information［C］// Proceedings of the 1st Workshop on Scholarly Document Processing. Stroudsburg： ACL， 2020： 97-103.
11	GU N， GAO Y， HAHNLOSER R H R. Local citation recommendation with hierarchical-attention text encoder and SciBERT-based reranking ［C］// Proceedings of the 2022 European Conference on Information Retrieval， LNCS 13185. Cham： Springer， 2022： 274-288.
12	GE Y， DINH L， LIU X， et al. BACO： a background knowledge-and content-based framework for citing sentence generation［C］// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing （Volume 1： Long Papers）. Stroudsburg： ACL， 2021： 1466-1478.
13	BELTAGY I， LO K， COHAN A. SciBERT： a pretrained language model for scientific text ［C］// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg： ACL， 2019： 3615-3620.
14	COHAN A， GOHARIAN N. Scientific article summarization using citation-context and article’s discourse structure［C］// Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2015： 390-400.
15	YASUNAGA M， KASAI J， ZHANG R， et al. ScisummNet： a large annotated corpus and content-impact models for scientific paper summarization with citation networks［C］// Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2019： 7386-7393.
16	JURGENS D， KUMAR S， HOOVER R， et al. Measuring the evolution of a scientific field through citation frames［J］. Transactions of the Association for Computational Linguistics， 2018， 6： 391-406.
17	COHAN A， AMMAR W， VAN ZUYLEN M， et al. Structural scaffolds for citation intent classification in scientific publications［C］// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies （Volume 1： Long and Short Papers）. Stroudsburg： ACL， 2019： 3586-3596.
18	王心玥，赵丹群. 引文情感识别研究进展及评述［J］. 情报理论与实践， 2024，47（1）： 173-181， 189.
	WANG X Y， ZHAO D Q. Review on progress of citation sentiment identification［J］. Information Studies： Theory and Application， 2024， 47（1）： 173-181， 189.
19	廖君华，刘自强，白如江，等. 基于引文内容分析的引用情感识别研究［J］. 图书情报工作， 2018， 62（15）： 112-121.
	LIAO J H， LIU Z Q， BAI R J， et al. Citation sentiment recognition method based on citation content analysis［J］. Library and Information Service， 2018， 62（15）： 112-121.
20	SEE A， LIU P J， MANNING C D. Get to the point： summarization with pointer-generator networks［C］// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg： ACL， 2017： 1073-1083.
21	LUU K， WU X， KONCEL-KEDZIORSKI R， et al. Explaining relationships between scientific documents ［C］// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing （Volume 1： Long Papers）. Stroudsburg： ACL， 2021： 2130-2144.
22	RADFORD A， WU J， CHILD R， et al. Language models are unsupervised multitask learners ［EB/OL］. ［2024-02-16］..
23	CHEN J， CAI C， JIANG X， et al. Comparative graph-based summarization of scientific papers guided by comparative citations［C］// Proceedings of the 29th International Conference on Computational Linguistics. Stroudsburg： ACL， 2022： 5978-5988.
24	GU N， HAHNLOSER R H R. Controllable citation text generation［EB/OL］. ［2024-02-16］..
25	LEWIS M， LIU Y， GOYAL N， et al. BART： denoising sequence-to-sequence pre-training for natural language generation， translation， and comprehension［C］// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg： ACL， 2020： 7871-7880.
26	ZHAO H， LUO Z， FENG C， et al. A context-based framework for modeling the role and function of on-line resource citations in scientific literature［C］// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg： ACL， 2019： 5206-5215.
27	KINGMA D P， BA J L. Adam： a method for stochastic optimization［EB/OL］. ［2024-02-16］..
28	LIN C Y. ROUGE： a package for automatic evaluation of summaries［C］// Proceedings of the ACL-04 Workshop： Text Summarization Branches Out. Stroudsburg： ACL， 2004： 74-81.
29	PARIKH A P， TÄCKSTRÖM O， DAS D， et al. A decomposable attention model for natural language inference［C］// Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2016： 2249-2255.
30	CHEN Q， ZHU X， LING Z H， et al. Enhanced LSTM for natural language inference［C］// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg： ACL， 2017： 1657-1668.
31	JEONG C， JANG S， PARK E， et al. A context-aware citation recommendation model with BERT and graph convolutional networks ［J］. Scientometrics， 2020， 124（3）： 1907-1922.

实验名称	准确率	召回率	F1分数
10倍交叉验证	93.24	93.24	94.32
外部测试集	90.87	91.02	90.87

实验名称	准确率	召回率	F1分数
10倍交叉验证	93.24	93.24	94.32
外部测试集	90.87	91.02	90.87

标签	训练集样本数	验证集样本数	测试集样本数
可比	27 416	1 403	1 523
不可比	82 306	4 396	4 536

标签	训练集样本数	验证集样本数	测试集样本数
可比	27 416	1 403	1 523
不可比	82 306	4 396	4 536

方法	ACL‑200		FullTextPeerRead		CA
方法	MRR	R@10	MRR	R@10	MRR	R@10
DualCon	0.335	0.647
DualEnh	0.366	0.703
BERT-FNN	0.482	0.736	0.458	0.706	0.508	0.716
SciBERT-FNN	0.531	0.779	0.536	0.773	0.521	0.724
SciBERT-CNN	0.541	0.781	0.539	0.780	0.525	0.726
SciBERT-LSTM	0.545	0.785	0.542	0.781	0.525	0.728
SciCACG	0.552	0.787	0.545	0.783	0.532	0.731

科研论文的可比性评估与比较性引文生成方法

Comparability assessment and comparative citation generation method for scientific papers

RichHTML

PDF

可视化

摘要/Abstract

引用本文

使用本文

图/表 10

参考文献 31

相关文章 15

编辑推荐

Metrics

方法	F1分数
方法	R‑1	R‑2	R‑L
EXT-Oracle	22.21	4.96	16.84
PTGEN	24.60	6.16	19.19
PTGEN-Cross	27.08	7.14	20.61
BART-Large	29.62	9.86	24.51
SciCACG	31.52	11.15	27.06

模型	F1分数
模型	R-1	R-2	R-L
SciCACG	31.52	11.15	27.06
-w/o CA	30.92	10.70	26.71
-w/o CE	30.79	11.09	26.68

方法	流畅性	相关性	连贯性	总体质量
原引文句子	4.76	4.55	4.82	4.69
BART-large	3.70	3.37	2.80	3.09
SciCACG	3.68	3.46	2.94	3.14

[1]	余明峰, 秦永彬, 黄瑞章, 陈艳平, 林川. 基于对比学习增强双注意力机制的多标签文本分类方法[J]. 《计算机应用》唯一官方网站, 2025, 45(6): 1732-1740.
[2]	李嘉欣, 莫思特. 基于MiniRBT-LSTM-GAT与标签平滑的台区电力工单分类[J]. 《计算机应用》唯一官方网站, 2025, 45(4): 1356-1362.
[3]	孙海涛, 林佳瑜, 梁祖红, 郭洁. 结合标签混淆的中文文本分类数据增强技术[J]. 《计算机应用》唯一官方网站, 2025, 45(4): 1113-1119.
[4]	帅奇, 王海瑞, 朱贵富. 基于双向对比训练的中文故事结尾生成模型[J]. 《计算机应用》唯一官方网站, 2024, 44(9): 2683-2688.
[5]	李晨阳, 张龙, 郑秋生, 钱少华. 基于扩散序列的多元可控文本生成[J]. 《计算机应用》唯一官方网站, 2024, 44(8): 2414-2420.
[6]	余新言, 曾诚, 王乾, 何鹏, 丁晓玉. 基于知识增强和提示学习的小样本新闻主题分类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1767-1774.
[7]	姚迅, 秦忠正, 杨捷. 生成式标签对抗的文本分类模型[J]. 《计算机应用》唯一官方网站, 2024, 44(6): 1781-1785.
[8]	余杭, 周艳玲, 翟梦鑫, 刘涵. 基于预训练模型与标签融合的文本分类[J]. 《计算机应用》唯一官方网站, 2024, 44(3): 709-714.
[9]	王楷天, 叶青, 程春雷. 基于异构图表示的中医电子病历分类方法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 411-417.
[10]	张家伟, 高冠东, 肖珂, 宋胜尊. 基于改进分层注意网络和TextCNN联合建模的暴力犯罪分级算法[J]. 《计算机应用》唯一官方网站, 2024, 44(2): 403-410.
[11]	于碧辉, 蔡兴业, 魏靖烜. 基于提示学习的小样本文本分类方法[J]. 《计算机应用》唯一官方网站, 2023, 43(9): 2735-2740.
[12]	崔雨萌, 王靖亚, 刘晓文, 闫尚义, 陶知众. 融合注意力和裁剪机制的通用文本分类模型[J]. 《计算机应用》唯一官方网站, 2023, 43(8): 2396-2405.
[13]	杨森淇, 段旭良, 肖展, 郎松松, 李志勇. 基于ERNIE+DPCNN+BiGRU的农业新闻文本分类[J]. 《计算机应用》唯一官方网站, 2023, 43(5): 1461-1466.
[14]	张旭, 生龙, 张海芳, 田丰, 王巍. 基于标签混淆的院前急救文本分类模型[J]. 《计算机应用》唯一官方网站, 2023, 43(4): 1050-1055.
[15]	高永兵, 高军甜, 马蓉, 杨立东. 用户粒度级的个性化社交文本生成模型[J]. 《计算机应用》唯一官方网站, 2023, 43(4): 1021-1028.