Comparability assessment and comparative citation generation method for scientific papers

doi:10.11772/j.issn.1001-9081.2024060898

Journal of Computer Applications ›› 2025, Vol. 45 ›› Issue (6): 1888-1894.DOI: 10.11772/j.issn.1001-9081.2024060898

• Data science and technology • Previous Articles

Comparability assessment and comparative citation generation method for scientific papers

Xiangyu LI¹, Jingqiang CHEN¹^,²()

^1.School of Computer Science，Nanjing University of Posts and Telecommunications，Nanjing Jiangsu 210023，China
^2.Jiangsu Key Laboratory of Big Data Security and Intelligent Processing （Nanjing University of Posts and Telecommunications），Nanjing Jiangsu 210023，China

Received:2024-06-28 Revised:2024-09-10 Accepted:2024-09-12 Online:2024-09-25 Published:2025-06-10
Contact: Jingqiang CHEN
About author:LI Xiangyu， born in 2001， M. S. candidate. His research interests include natural language processing， text generation.
CHEN Jingqiang， born in 1983， Ph. D.， associate professor. His research interests include text summarization， natural language processing， artificial intelligence.
Supported by:
Young Scientists Fund of National Natural Science Foundation of China(62102192)

科研论文的可比性评估与比较性引文生成方法

李翔宇¹, 陈景强¹^,²()

^1.南京邮电大学计算机学院，南京 210023
^2.江苏省大数据安全与智能处理重点实验室（南京邮电大学），南京 210023

通讯作者: 陈景强
作者简介:李翔宇（2001—），男，山东潍坊人，硕士研究生，主要研究方向：自然语言处理、文本生成
陈景强（1983—），男，浙江温州人，副教授，博士，主要研究方向：文本摘要、自然语言处理、人工智能。cjq@njupt.edu.cn
基金资助:
国家自然科学基金青年科学基金资助项目(62102192)

Abstract

Abstract:

To address the two major challenges in comparative citation generation — determining the comparability between papers accurately and generating comparative sentences， a Comparability Assessment （CA） and comparative citation generation method for scientific papers， named SciCACG（Scientific Comparability Assessment and Citation Generation）， was proposed. Three core modules were constructed in the proposed method： a CA module， which was used to determine whether two papers were comparable； a Comparison object Extraction （CE） module， which was employed to extract specific comparison objects from the papers and references， and a comparative citation generation module， which was responsible for generating the corresponding comparative citation sentences. Firstly， the SciBERT （Scientific BERT） model was used to process the two input papers， and the comparability was assessed through the CA module. Then， for papers determined to be comparable， the CE module was used to identify and extract key comparison objects. Finally， the comparative citation generation module was utilized to generate comparative citations containing these objects. Experimental results show that in the CA stage， the proposed method achieves 0.532 in Mean Reciprocal Rank （MRR） and 0.731 in Recall@10 （R@10）， and outperforms the previous SciBERT-FNN （Scientific Bidirectional Encoder Representations from Transformers-Feedforward Neural Network） method on all the datasets； in the comparative citation generation， Compared to the suboptimal BART-Large （Bidirectional and Auto-Progressive Transformers-Large） method， the F1 scores of ROUGE （Recall-Oriented Understudy for Gisting Evaluation）-1， ROUGE-2， and ROUGE-L in the proposed method have increased by 1.90， 1.29， and 2.55 percentage points， respectively. Additionally， the results confirm that the technologies of automated comparison and analysis of scientific literature are crucial for citation sentence generation tasks； particularly， in enhancing the traceability of comparative information and ensuring the comprehensiveness of citation sentences， these technologies demonstrate substantial practical value.

Key words: comparative citation, Comparability Assessment (CA), citation generation, text generation, text classification, Comparison object Extraction (CE)

摘要：

针对比较性引文生成中面临的两大挑战——准确判定论文间的可比性及生成具有比较性的句子，提出科研论文的可比性评估（CA）与比较性引文生成方法SciCACG（Scientific Comparability Assessment and Citation Generation）。该方法构建了3个核心模块：用于判断2篇论文是否具备可比性的CA模块、负责从论文与参考文献中抽取出具体的比较对象的比较对象抽取（CE）模块和用于生成相应的比较性引用句子的比较引文生成模块。首先，利用SciBERT （Scientific BERT）模型处理输入的2篇文章，并通过CA模块进行可比性的评估；其次，对于被判定为可比的文章，采用CE模块识别并抽取出关键的比较对象；最后，使用比较引文生成模块生成包含这些比较对象的比较性引文。实验结果显示，在CA阶段，所提方法在平均倒数排名（MRR）上达到了0.532，在召回率@10（R@10）上达到了0.731，较之前的SciBERT-FNN（Scientific Bidirectional Encoder Representations from Transformers-Feedforward Neural Network）方法在各个数据集上均有提升；在比较性引文生成中，相较于次优的BART-Large（Bidirectional and Auto-Regressive Transformers-Large）方法，所提方法的ROUGE（Recall-Oriented Understudy for Gisting Evaluation）-1、ROUGE-2和ROUGE-L的F1分数分别提高了1.90、1.29和2.55个百分点。此外，实验结果验证了科学文献自动化比较与分析技术对引文句子生成任务具有重要意义，特别是在提高比较信息的可追溯性和确保引用句子信息的全面性方面，展现出极大的实用价值。

关键词: 比较性引文, 可比性评估, 引文生成, 文本生成, 文本分类, 比较对象抽取

CLC Number:

TP391.1

Xiangyu LI, Jingqiang CHEN. Comparability assessment and comparative citation generation method for scientific papers[J]. Journal of Computer Applications, 2025, 45(6): 1888-1894.

李翔宇, 陈景强. 科研论文的可比性评估与比较性引文生成方法[J]. 《计算机应用》唯一官方网站, 2025, 45(6): 1888-1894.

Figures/Tables 10

References 31

1	BORNMANN L， MUTZ R. Growth rates of modern science： a bibliometric analysis based on the number of publications and cited references［J］. Journal of The Association for Information Science and Technology， 2015， 66（11）： 2215-2222.
2	TEUFEL S， SIDDHARTHAN A， TIDHAR D. Automatic classification of citation function［C］// Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2006： 103-110.
3	XING X， FAN X， WAN X. Automatic generation of citation texts in scholarly papers： a pilot study［C］// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg： ACL， 2020： 6181-6190.
4	LUHN H P. The automatic creation of literature abstracts［J］. IBM Journal of Research and Development， 1958， 2（2）： 159-165.
5	EDMUNDSON H P. New methods in automatic extracting［J］. Journal of the ACM， 1969， 16（2）： 264-285.
6	QAZVINIAN V， RADEV D R. Scientific paper summarization using citation summary networks［C］// Proceedings of the 22nd International Conference on Computational Linguistics. ［S.l.］： Coling 2008 Organizing Committee， 2008： 689-696.
7	MEI Q， ZHAI C. Generating impact-based summaries for scientific literature［C］// Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics： Human Language Technologies. Stroudsburg： ACL， 2008： 816-824.
8	McNEE S M， ALBERT I， COSLEY D， et al. On the recommending of citations for research papers［C］// Proceedings of the 2002 ACM Conference on Computer Supported Cooperative Work. New York： ACM， 2002： 116-125.
9	BHAGAVATULA C， FELDMAN S， POWER R， et al. Content-based citation recommendation［C］// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies （Volume 1： Long Papers）. Stroudsburg： ACL， 2018： 238-251.
10	MEDIĆ Z， ŠNAJDER J. Improved local citation recommendation based on context enhanced with global information［C］// Proceedings of the 1st Workshop on Scholarly Document Processing. Stroudsburg： ACL， 2020： 97-103.
11	GU N， GAO Y， HAHNLOSER R H R. Local citation recommendation with hierarchical-attention text encoder and SciBERT-based reranking ［C］// Proceedings of the 2022 European Conference on Information Retrieval， LNCS 13185. Cham： Springer， 2022： 274-288.
12	GE Y， DINH L， LIU X， et al. BACO： a background knowledge-and content-based framework for citing sentence generation［C］// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing （Volume 1： Long Papers）. Stroudsburg： ACL， 2021： 1466-1478.
13	BELTAGY I， LO K， COHAN A. SciBERT： a pretrained language model for scientific text ［C］// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg： ACL， 2019： 3615-3620.
14	COHAN A， GOHARIAN N. Scientific article summarization using citation-context and article’s discourse structure［C］// Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2015： 390-400.
15	YASUNAGA M， KASAI J， ZHANG R， et al. ScisummNet： a large annotated corpus and content-impact models for scientific paper summarization with citation networks［C］// Proceedings of the 33rd AAAI Conference on Artificial Intelligence. Palo Alto： AAAI Press， 2019： 7386-7393.
16	JURGENS D， KUMAR S， HOOVER R， et al. Measuring the evolution of a scientific field through citation frames［J］. Transactions of the Association for Computational Linguistics， 2018， 6： 391-406.
17	COHAN A， AMMAR W， VAN ZUYLEN M， et al. Structural scaffolds for citation intent classification in scientific publications［C］// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics： Human Language Technologies （Volume 1： Long and Short Papers）. Stroudsburg： ACL， 2019： 3586-3596.
18	王心玥，赵丹群. 引文情感识别研究进展及评述［J］. 情报理论与实践， 2024，47（1）： 173-181， 189.
	WANG X Y， ZHAO D Q. Review on progress of citation sentiment identification［J］. Information Studies： Theory and Application， 2024， 47（1）： 173-181， 189.
19	廖君华，刘自强，白如江，等. 基于引文内容分析的引用情感识别研究［J］. 图书情报工作， 2018， 62（15）： 112-121.
	LIAO J H， LIU Z Q， BAI R J， et al. Citation sentiment recognition method based on citation content analysis［J］. Library and Information Service， 2018， 62（15）： 112-121.
20	SEE A， LIU P J， MANNING C D. Get to the point： summarization with pointer-generator networks［C］// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg： ACL， 2017： 1073-1083.
21	LUU K， WU X， KONCEL-KEDZIORSKI R， et al. Explaining relationships between scientific documents ［C］// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing （Volume 1： Long Papers）. Stroudsburg： ACL， 2021： 2130-2144.
22	RADFORD A， WU J， CHILD R， et al. Language models are unsupervised multitask learners ［EB/OL］. ［2024-02-16］..
23	CHEN J， CAI C， JIANG X， et al. Comparative graph-based summarization of scientific papers guided by comparative citations［C］// Proceedings of the 29th International Conference on Computational Linguistics. Stroudsburg： ACL， 2022： 5978-5988.
24	GU N， HAHNLOSER R H R. Controllable citation text generation［EB/OL］. ［2024-02-16］..
25	LEWIS M， LIU Y， GOYAL N， et al. BART： denoising sequence-to-sequence pre-training for natural language generation， translation， and comprehension［C］// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg： ACL， 2020： 7871-7880.
26	ZHAO H， LUO Z， FENG C， et al. A context-based framework for modeling the role and function of on-line resource citations in scientific literature［C］// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg： ACL， 2019： 5206-5215.
27	KINGMA D P， BA J L. Adam： a method for stochastic optimization［EB/OL］. ［2024-02-16］..
28	LIN C Y. ROUGE： a package for automatic evaluation of summaries［C］// Proceedings of the ACL-04 Workshop： Text Summarization Branches Out. Stroudsburg： ACL， 2004： 74-81.
29	PARIKH A P， TÄCKSTRÖM O， DAS D， et al. A decomposable attention model for natural language inference［C］// Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2016： 2249-2255.
30	CHEN Q， ZHU X， LING Z H， et al. Enhanced LSTM for natural language inference［C］// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics （Volume 1： Long Papers）. Stroudsburg： ACL， 2017： 1657-1668.
31	JEONG C， JANG S， PARK E， et al. A context-aware citation recommendation model with BERT and graph convolutional networks ［J］. Scientometrics， 2020， 124（3）： 1907-1922.

实验名称	准确率	召回率	F1分数
10倍交叉验证	93.24	93.24	94.32
外部测试集	90.87	91.02	90.87

实验名称	准确率	召回率	F1分数
10倍交叉验证	93.24	93.24	94.32
外部测试集	90.87	91.02	90.87

标签	训练集样本数	验证集样本数	测试集样本数
可比	27 416	1 403	1 523
不可比	82 306	4 396	4 536

标签	训练集样本数	验证集样本数	测试集样本数
可比	27 416	1 403	1 523
不可比	82 306	4 396	4 536

方法	ACL‑200		FullTextPeerRead		CA
方法	MRR	R@10	MRR	R@10	MRR	R@10
DualCon	0.335	0.647
DualEnh	0.366	0.703
BERT-FNN	0.482	0.736	0.458	0.706	0.508	0.716
SciBERT-FNN	0.531	0.779	0.536	0.773	0.521	0.724
SciBERT-CNN	0.541	0.781	0.539	0.780	0.525	0.726
SciBERT-LSTM	0.545	0.785	0.542	0.781	0.525	0.728
SciCACG	0.552	0.787	0.545	0.783	0.532	0.731

Comparability assessment and comparative citation generation method for scientific papers

科研论文的可比性评估与比较性引文生成方法

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 10

References 31

Related Articles 15

Recommended Articles

Metrics

方法	F1分数
方法	R‑1	R‑2	R‑L
EXT-Oracle	22.21	4.96	16.84
PTGEN	24.60	6.16	19.19
PTGEN-Cross	27.08	7.14	20.61
BART-Large	29.62	9.86	24.51
SciCACG	31.52	11.15	27.06

模型	F1分数
模型	R-1	R-2	R-L
SciCACG	31.52	11.15	27.06
-w/o CA	30.92	10.70	26.71
-w/o CE	30.79	11.09	26.68

方法	流畅性	相关性	连贯性	总体质量
原引文句子	4.76	4.55	4.82	4.69
BART-large	3.70	3.37	2.80	3.09
SciCACG	3.68	3.46	2.94	3.14

[1]	Mingfeng YU, Yongbin QIN, Ruizhang HUANG, Yanping CHEN, Chuan LIN. Multi-label text classification method based on contrastive learning enhanced dual-attention mechanism [J]. Journal of Computer Applications, 2025, 45(6): 1732-1740.
[2]	Jiaxin LI, Site MO. Power work order classification in substation area based on MiniRBT-LSTM-GAT and label smoothing [J]. Journal of Computer Applications, 2025, 45(4): 1356-1362.
[3]	Haitao SUN, Jiayu LIN, Zuhong LIANG, Jie GUO. Data augmentation technique incorporating label confusion for Chinese text classification [J]. Journal of Computer Applications, 2025, 45(4): 1113-1119.
[4]	Qi SHUAI, Hairui WANG, Guifu ZHU. Chinese story ending generation model based on bidirectional contrastive training [J]. Journal of Computer Applications, 2024, 44(9): 2683-2688.
[5]	Chenyang LI, Long ZHANG, Qiusheng ZHENG, Shaohua QIAN. Multivariate controllable text generation based on diffusion sequences [J]. Journal of Computer Applications, 2024, 44(8): 2414-2420.
[6]	Xun YAO, Zhongzheng QIN, Jie YANG. Generative label adversarial text classification model [J]. Journal of Computer Applications, 2024, 44(6): 1781-1785.
[7]	Xinyan YU, Cheng ZENG, Qian WANG, Peng HE, Xiaoyu DING. Few-shot news topic classification method based on knowledge enhancement and prompt learning [J]. Journal of Computer Applications, 2024, 44(6): 1767-1774.
[8]	Hang YU, Yanling ZHOU, Mengxin ZHAI, Han LIU. Text classification based on pre-training model and label fusion [J]. Journal of Computer Applications, 2024, 44(3): 709-714.
[9]	Jiawei ZHANG, Guandong GAO, Ke XIAO, Shengzun SONG. Violent crime hierarchy algorithm by joint modeling of improved hierarchical attention network and TextCNN [J]. Journal of Computer Applications, 2024, 44(2): 403-410.
[10]	Kaitian WANG, Qing YE, Chunlei CHENG. Classification method for traditional Chinese medicine electronic medical records based on heterogeneous graph representation [J]. Journal of Computer Applications, 2024, 44(2): 411-417.
[11]	Bihui YU, Xingye CAI, Jingxuan WEI. Few-shot text classification method based on prompt learning [J]. Journal of Computer Applications, 2023, 43(9): 2735-2740.
[12]	Yumeng CUI, Jingya WANG, Xiaowen LIU, Shangyi YAN, Zhizhong TAO. General text classification model combining attention and cropping mechanism [J]. Journal of Computer Applications, 2023, 43(8): 2396-2405.
[13]	Senqi YANG, Xuliang DUAN, Zhan XIAO, Songsong LANG, Zhiyong LI. Text classification of agricultural news based on ERNIE+DPCNN+BiGRU [J]. Journal of Computer Applications, 2023, 43(5): 1461-1466.
[14]	Xu ZHANG, Long SHENG, Haifang ZHANG, Feng TIAN, Wei WANG. Pre-hospital emergency text classification model based on label confusion [J]. Journal of Computer Applications, 2023, 43(4): 1050-1055.
[15]	Yongbing GAO, Juntian GAO, Rong MA, Lidong YANG. User granularity-level personalized social text generation model [J]. Journal of Computer Applications, 2023, 43(4): 1021-1028.