Prompt learning method for ancient text sentence segmentation and punctuation based on span-extracted prototypical network

doi:10.11772/j.issn.1001-9081.2023121719

Journal of Computer Applications ›› 2024, Vol. 44 ›› Issue (12): 3815-3822.DOI: 10.11772/j.issn.1001-9081.2023121719

• Artificial intelligence • Previous Articles Next Articles

Prompt learning method for ancient text sentence segmentation and punctuation based on span-extracted prototypical network

Yingjie GAO¹, Min LIN¹(), Siriguleng²^,³, Bin LI¹, Shujun ZHANG¹^,²

^1.College of Computer Science and Technology，Inner Mongolia Normal University，Hohhot Inner Mongolia 010022，China
^2.School of Chinese Language and Literature，Inner Mongolia Normal University，Hohhot Inner Mongolia 010022，China
^3.College of Computer Science and Technology，Inner Mongolia Minzu University，Tongliao Inner Mongolia 028000，China

Received:2023-12-15 Revised:2024-02-15 Accepted:2024-02-26 Online:2024-03-11 Published:2024-12-10
Contact: Min LIN
About author:GAO Yingjie， born in 1999， M. S. candidate. Her research interests include natural language processing.
Siriguleng， born in 1991， Ph. D. candidate. Her research interests include computational linguistics， natural language processing.
LI Bin， born in 1998， M. S. candidate. His research interests include entity relation extraction.
ZHANG Shujun， born in 1979， Ph. D. candidate， senior experimentalist. His research interests include natural language processing， applied linguistics.
Supported by:
National Natural Science Foundation of China(62266033);Natural Science Foundation of Inner Mongolia(2021LHMS06010);Science and Technology Program of Inner Mongolia Autonomous Region(2021GG0218);Open Project of Key Laboratory of Inner Mongolia Autonomous Region, Ministry of Education(2023KFZD03);Graduate Student Scientific Research Innovation Project in Inner Mongolia Autonomous Region(S20231076Z);Fundamental Research Fund for Inner Mongolia Normal University(2022JBXC018)

基于片段抽取原型网络的古籍文本断句标点提示学习方法

高颖杰¹, 林民¹(), 斯日古楞null²^,³, 李斌¹, 张树钧¹^,²

^1.内蒙古师范大学计算机科学技术学院，呼和浩特 010022
^2.内蒙古师范大学文学院，呼和浩特 010022
^3.内蒙古民族大学计算机科学与技术学院，内蒙古通辽 028000

通讯作者: 林民
作者简介:高颖杰（1999—），女，内蒙古锡林郭勒人，硕士研究生，主要研究方向：自然语言处理
斯日古楞（1991—），女（蒙古族），内蒙古通辽人，博士研究生，主要研究方向：计算语言学、自然语言处理
李斌（1998—），男，内蒙古乌兰察布人，硕士研究生，主要研究方向：实体关系抽取
张树钧（1979—），男，内蒙古呼和浩特人，高级实验师，博士研究生，主要研究方向：自然语言处理、语言文字应用。
基金资助:
国家自然科学基金资助项目(62266033);内蒙古自然科学基金资助项目(2021LHMS06010);内蒙古自治区科技计划项目(2021GG0218);内蒙古自治区级教育部重点实验室开放课题(2023KFZD03);内蒙古自治区硕士研究生科研创新项目(S20231076Z);内蒙古师范大学基本科研业务费专项(2022JBXC018)

Abstract

Abstract:

In view of the phenomenon that automatic sentence segmentation and punctuation task in ancient book information processing relies on large-scale annotated corpora， and considering that training high-quality， large-scale samples is expensive and these samples are difficult to obtain， a prompt learning method for ancient text sentence segmentation and punctuation based on span-extracted prototypical network was proposed. Firstly， structured prompt information was incorporated into the support set to form an effective prompt template， so as to improve the model's learning efficiency. Then， combined with a punctuation position extractor and a prototype network classifier， the misjudgment impact and the interference from non-punctuation labels in traditional sequence labeling method were effectively reduced. Experimental results show that on Records of the Grand Historian dataset， the F1 score of the proposed method is 2.47 percentage points higher than that of the Siku-BERT-BiGRU-CRF （Siku - Bidirectional Encoder Representation from Transformer - Bidirectional Gated Recurrent Unit - Conditional Random Field） method. In addition， on the public multi-domain ancient text dataset CCLUE， the precision and F1 score of this method reach 91.60% and 93.12% respectively， indicating that the method can perform sentence segmentation and punctuation in multi-domain ancient text effectively and automatically by using a small number of training samples. Therefore， the proposed method offers new thought and approach for conducting in-depth research on automatic sentence segmentation and punctuation， as well as for enhancing the model's learning efficiency， in multi-domain ancient text.

Key words: intelligent collation of ancient books, span-extracted prototypical network, prompt learning, automatic sentence segmentation and punctuation, deep learning

摘要：

针对古籍信息处理中自动断句及标点任务依赖大规模标注语料的现象，在考虑高质量、大规模样本的训练成本昂贵且难以获取的背景下，提出一种基于片段抽取原型网络的古籍文本断句标点提示学习方法。首先，通过对支持集加入结构化提示信息形成有效的提示模板，从而提高模型的学习效率；其次，结合标点位置提取器和原型网络分类器，有效减少传统序列标注方法中的误判影响及非标点标签的干扰。实验结果表明，与Siku-BERT-BiGRU-CRF（Siku-Bidirectional Encoder Representation from Transformer-Bidirectional Gated Recurrent Unit-Conditional Random Field）方法相比，在《史记》数据集上所提方法的F1值提升了2.47个百分点。此外，在公开的多领域古籍数据集CCLUE上，所提方法的精确率和F1值分别达到了91.60%和93.12%，说明所提方法利用少量训练样本就能对多领域古籍进行有效的自动断句标点。因此，所提方法为多领域古籍文本的自动断句及标点任务的深入研究以及提高模型的学习效率提供了新的思路和方法。

关键词: 古籍智能整理, 片段抽取原型网络, 提示学习, 自动断句标点, 深度学习

CLC Number:

TP391.1

Yingjie GAO, Min LIN, Siriguleng, Bin LI, Shujun ZHANG. Prompt learning method for ancient text sentence segmentation and punctuation based on span-extracted prototypical network[J]. Journal of Computer Applications, 2024, 44(12): 3815-3822.

高颖杰, 林民, 斯日古楞null, 李斌, 张树钧. 基于片段抽取原型网络的古籍文本断句标点提示学习方法[J]. 《计算机应用》唯一官方网站, 2024, 44(12): 3815-3822.

Figures/Tables 10

References 33

1	王丽丽，张宁. 数字人文视角下的古籍知识关联探析［J］. 农业图书情报学报， 2022， 34（9）：51-59.
	WANG L L， ZHANG N. Knowledge correlation of Chinese ancient books from the perspective of digital humanity［J］. Journal of Library and Information Science in Agriculture， 2022， 34（9）： 51-59.
2	王博立，史晓东，苏劲松.一种基于循环神经网络的古文断句方法［J］. 北京大学学报（自然科学版）， 2017， 53（2）：255-261.
	WANG B L， SHI X D， SU J S. A sentence segmentation method for ancient Chinese texts based on recurrent neural network［J］. Acta Scientiarum Naturalium Universitatis Pekinensis， 2017， 53（2）： 255-261.
3	HAN X， WANG H， ZHANG S， et al. Sentence segmentation for classical Chinese based on LSTM with radical embedding ［J］. The Journal of China Universities of Posts and Telecommunications， 2019， 26（2）： 1-8.
4	VASWANI A， SHAZEER N， PARMAR N， et al. Attention is all you need ［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2017： 6000-6010.
5	俞敬松，魏一，张永伟. 基于BERT的古文断句研究与应用［J］. 中文信息学报， 2019， 33（11）：57-63.
	YU J S， WEI Y， ZHANG Y W. Automatic ancient Chinese texts segmentation based on BERT［J］. Journal of Chinese Information Processing， 2019， 33（11）：57-63.
6	洪涛，程瑞雪，刘思汐，等. 一种基于Transformer模型的古籍自动标点技术［J］. 数字人文， 2021（2）： 111-122.
	HONG T， CHENG R X， LIU S X， et al. An automatic punctuation technique based on Transformer model ［J］. Digital Humanities， 2021（2）：111-122.
7	王倩，王东波，李斌，等. 面向海量典籍文本的深度学习自动断句与标点平台构建研究［J］. 数据分析与知识发现， 2021， 5（3）：25-34.
	WANG Q， WANG D B， LI B， et al. Deep learning based automatic sentence segmentation and punctuation model for massive classical Chinese literature［J］. Data Analysis and Knowledge Discovery， 2021， 5（3）： 25-34.
8	陈天莹，陈蓉，潘璐璐，等. 基于前后文n-gram模型的古汉语句子切分［J］. 计算机工程， 2007， 33（3）：192-193， 196.
	CHEN T Y， CHEN R， PAN L L， et al. Archaic Chinese punctuating sentences based on context n-gram model［J］. Computer Engineering， 2007， 33（3）：192-193， 196.
9	黄建年，侯汉清. 农业古籍断句标点模式研究［J］. 中文信息学报， 2008， 22（4）：31-38.
	HUANG J N， HOU H Q. On sentence segmentation and punctuation model for ancient books on agriculture［J］. Journal of Chinese Information Processing， 2008， 22（4）：31-38.
10	张开旭，夏云庆，宇航. 基于条件随机场的古汉语自动断句与标点方法［J］. 清华大学学报（自然科学版）， 2009， 49（10）：1733-1736.
	ZHANG K X， XIA Y Q， YU H. CRF-based approach to sentence segmentation and punctuation for ancient Chinese prose ［J］. Journal of Tsinghua University （Science and Technology）， 2009， 49（10）： 1733-1736.
11	WANG B， SHI X， TAN Z， et al. A sentence segmentation method for ancient Chinese texts based on NNLM ［C］// Proceedings of the 2016 Workshop on Chinese Lexical Semantics. Cham： Springer， 2016： 387-396.
12	王东波，刘畅，朱子赫，等. SikuBERT与SikuRoBERTa：面向数字人文的《四库全书》预训练模型构建及应用研究［J］. 图书馆论坛， 2022， 42（6）：31-43.
	WANG D B， LIU C， ZHU Z H， et al. Construction and application of pre-trained models of Siku Quanshu in orientation to digital humanities ［J］. Library Tribune， 2022， 42（6）：31-43.
13	赵连振，张逸勤，刘江峰，等. 面向数字人文的先秦两汉典籍自动标点研究——以SikuBERT预训练模型为例［J］. 图书馆论坛， 2022， 42（12）：120-128， 137.
	ZHAO L Z， ZHANG Y Q， LIU J F， et al. Study on automatic punctuation of ancient Chinese classics of pre-Qin and Han dynasties in the context of digital humanities： taking SikuBERT pre-training model for example ［J］. Library Tribune， 2022， 42（12）： 120-128， 137.
14	王瑶，顾磊. 基于BERT+BiLSTM+CRF模型与新预处理方法的古籍自动标点［J］. 软件导刊， 2022， 21（9）：7-13.
	WANG Y， GU L. Automatic punctuation of ancient books based on BERT+BiLSTM+CRF model and new preprocessing method ［J］. Software Guide， 2022， 21（9）：7-13.
15	程宁. 基于深度学习的古籍文本断句与词法分析一体化处理技术研究［D］. 南京：南京师范大学， 2020：53-57.
	CHENG N. Research on integrated processing technology of sentence segmentation and lexical analysis of ancient books based on deep learning ［D］. Nanjing： Nanjing Normal University， 2020：53-57.
16	庄百川. 基于深度学习的古文自动断句与标点研究［D］. 武汉：武汉邮电科学研究院， 2022：41-50.
	ZHUANG B C. Automatic segmentation and punctuation of ancient Chinese based on deep learning ［D］. Wuhan： Wuhan Research Institute of Posts and Telecommunications， 2022：41-50.
17	韩旭. 基于Transformer-CRF的文言文断句方法研究——以唐代墓志铭为例［J］. 情报工程， 2021， 7（5）：30-39.
	HAN X. Research on sentence segmentation of classical Chinese based on Transformer-CRF： taking epitaph of Tang dynasty as an example［J］. Technology Intelligence Engineering， 2021， 7（5）：30-39.
18	WANG Y， YAO Q， KWOK J T， et al. Generalizing from a few examples： a survey on few-shot learning［J］. ACM Computing Surveys， 2021， 53（3）： No.63.
19	HUISMAN M， VAN RIJN J N， PLAAT A. A survey of deep meta-learning ［J］. Artificial Intelligence Review， 2021， 54： 4483-4541.
20	KULKARNI V， MEHDAD Y， CHEVALIER T. Domain adaptation for named entity recognition in online media with word embeddings［EB/OL］. ［2023-12-12］..
21	SNELL J， SWERSKY K， ZEMEL R. Prototypical networks for few-shot learning［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook： Curran Associates Inc.， 2017： 4080-4090.
22	WEI J， ZOU K. EDA： easy data augmentation techniques for boosting performance on text classification tasks ［C］// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Stroudsburg： ACL， 2019：6382-6388.
23	GENG R， LI B， LI Y， et al. Dynamic memory induction networks for few-shot text classification ［C］// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg： ACL， 2020： 1087-1094.
24	SHENG J， GUO S， CHEN Z， et al. Adaptive attentional network for few-shot knowledge graph completion ［C］// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2020： 1681-1691.
25	WANG J， WANG C， TAN C， et al. SpanProto： a two-stage span-based prototypical network for few-shot named entity recognition ［C］// Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Stroudsburg： ACL， 2022： 3466-3476.
26	YUAN Z， CHEN Z， ZHOU Y， et al. Enhance prototypical network with hybrid loss for few-shot text classification ［J］. Journal of Physics： Conference Series， 2023， 2555： No.012017.
27	FANG Y， LIU Z， PAN S， et al. Few-shot relation extraction based on multi-level feature metric learning ［C］// Proceedings of the SPIE 12610， 3rd International Conference on Artificial Intelligence and Computer Engineering. Bellingham， WA： SPIE， 2022： No.1261058.
28	KAUFMANN B， BUSBY D， DAS C K， et al. Validation of a zero-shot learning natural language processing tool for data abstraction from unstructured healthcare data ［EB/OL］. ［2023-12-08］..
29	LIU P， YUAN W， FU J， et al. Pre-train， prompt， and predict： a systematic survey of prompting methods in natural language processing ［J］. ACM Computing Surveys， 2020， 55（9）： No.195.
30	SCHICK T， SCHÜTZE H. Exploiting cloze questions for few-shot text classification and natural language inference ［C］// Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics： Main Volume. Stroudsburg： ACL， 2021： 255-269.
31	LI X L， LIANG P. Prefix-tuning： optimizing continuous prompts for generation ［C］// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing （Volume 1： Long Papers）. Stroudsburg： ACL， 2021： 4582-4597.
32	LIU X， ZHENG Y， DU Z， et al. GPT understands， too ［EB/OL］. ［2023-12-09］..
33	SUN Y， WANG S， FENG S， et al. ERNIE 3.0： large-scale knowledge enhanced pre-training for language understanding and generation［EB/OL］. ［2023-12-12］..

方法	数据集规模（字数）/10⁷	Pre/%	Rec/%	F1/%
BERT-BiGRU-CRF^［15］	0.421	80.00	63.43	70.76
BERT-FLAT-CRF^［16］	0.900	87.11	74.95	80.57
Siku-BERT^［12］	2.600	87.86	87.92	87.86
BERT+微调^［5］	2.900	70.92	69.88	70.40
BERT-LSTM-CRF^［7］	10.200	—	—	90.84
SpanProtoNet	0.038	88.94	88.96	88.95

方法	数据集规模（字数）/10⁷	Pre/%	Rec/%	F1/%
BERT-BiGRU-CRF^［15］	0.421	80.00	63.43	70.76
BERT-FLAT-CRF^［16］	0.900	87.11	74.95	80.57
Siku-BERT^［12］	2.600	87.86	87.92	87.86
BERT+微调^［5］	2.900	70.92	69.88	70.40
BERT-LSTM-CRF^［7］	10.200	—	—	90.84
SpanProtoNet	0.038	88.94	88.96	88.95

方法	Pre	Rec	F1
ERNIE3.0	67.83	64.16	64.72
Siku-BERT	84.01	84.71	84.36
Siku-RoBERTa	84.22	84.85	84.53
Siku-BERT-BiLSTM-CRF	86.14	85.22	85.65
Siku-BERT-BiGRU-CRF	86.46	86.52	86.48
Xunzi-QWen-Chat	88.92	96.43	92.52
SpanProtoNet	88.94	88.96	88.95

方法	Pre	Rec	F1
ERNIE3.0	67.83	64.16	64.72
Siku-BERT	84.01	84.71	84.36
Siku-RoBERTa	84.22	84.85	84.53
Siku-BERT-BiLSTM-CRF	86.14	85.22	85.65
Siku-BERT-BiGRU-CRF	86.46	86.52	86.48
Xunzi-QWen-Chat	88.92	96.43	92.52
SpanProtoNet	88.94	88.96	88.95

模型结构	Pre	Rec	F1
Siku-BERT	84.01	84.71	84.36
Siku-BERT-Span-Linear	85.86	87.24	86.54
SpanProtoNet	88.94	88.96	88.95

Prompt learning method for ancient text sentence segmentation and punctuation based on span-extracted prototypical network

基于片段抽取原型网络的古籍文本断句标点提示学习方法

RichHTML

PDF

Knowledge

Abstract

Cite this article

share this article

Figures/Tables 10

References 33

Related Articles 15

Recommended Articles

Metrics

方法	Pre	Rec	F1
ERNIE3.0	70.02	70.11	70.06
Siku-BERT	87.86	87.92	87.86
Siku-BERT-BiLSTM-CRF	91.77	91.84	91.80
Siku-BERT-BiGRU-CRF	92.63	92.43	92.52
Xunzi-QWen-Chat	91.35	96.77	93.98
SpanProtoNet	91.60	94.68	93.12

[1]	Shunyong LI, Shiyi LI, Rui XU, Xingwang ZHAO. Incomplete multi-view clustering algorithm based on self-attention fusion [J]. Journal of Computer Applications, 2024, 44(9): 2696-2703.
[2]	Yunchuan HUANG, Yongquan JIANG, Juntao HUANG, Yan YANG. Molecular toxicity prediction based on meta graph isomorphism network [J]. Journal of Computer Applications, 2024, 44(9): 2964-2969.
[3]	Yexin PAN, Zhe YANG. Optimization model for small object detection based on multi-level feature bidirectional fusion [J]. Journal of Computer Applications, 2024, 44(9): 2871-2877.
[4]	Jing QIN, Zhiguang QIN, Fali LI, Yueheng PENG. Diagnosis of major depressive disorder based on probabilistic sparse self-attention neural network [J]. Journal of Computer Applications, 2024, 44(9): 2970-2974.
[5]	Xiyuan WANG, Zhancheng ZHANG, Shaokang XU, Baocheng ZHANG, Xiaoqing LUO, Fuyuan HU. Unsupervised cross-domain transfer network for 3D/2D registration in surgical navigation [J]. Journal of Computer Applications, 2024, 44(9): 2911-2918.
[6]	Yuhan LIU, Genlin JI, Hongping ZHANG. Video pedestrian anomaly detection method based on skeleton graph and mixed attention [J]. Journal of Computer Applications, 2024, 44(8): 2551-2557.
[7]	Yanjie GU, Yingjun ZHANG, Xiaoqian LIU, Wei ZHOU, Wei SUN. Traffic flow forecasting via spatial-temporal multi-graph fusion [J]. Journal of Computer Applications, 2024, 44(8): 2618-2625.
[8]	Qianhong SHI, Yan YANG, Yongquan JIANG, Xiaocao OUYANG, Wubo FAN, Qiang CHEN, Tao JIANG, Yuan LI. Multi-granularity abrupt change fitting network for air quality prediction [J]. Journal of Computer Applications, 2024, 44(8): 2643-2650.
[9]	Yiqun ZHAO, Zhiyu ZHANG, Xue DONG. Anisotropic travel time computation method based on dense residual connection physical information neural networks [J]. Journal of Computer Applications, 2024, 44(7): 2310-2318.
[10]	Song XU, Wenbo ZHANG, Yifan WANG. Lightweight video salient object detection network based on spatiotemporal information [J]. Journal of Computer Applications, 2024, 44(7): 2192-2199.
[11]	Xindong YOU, Yingzi WEN, Xinpeng SHE, Xueqiang LYU. Triplet extraction method for mine electromechanical equipment field [J]. Journal of Computer Applications, 2024, 44(7): 2026-2033.
[12]	Xun SUN, Ruifeng FENG, Yanru CHEN. Monocular 3D object detection method integrating depth and instance segmentation [J]. Journal of Computer Applications, 2024, 44(7): 2208-2215.
[13]	Zheng WU, Zhiyou CHENG, Zhentian WANG, Chuanjian WANG, Sheng WANG, Hui XU. Deep learning-based classification of head movement amplitude during patient anaesthesia resuscitation [J]. Journal of Computer Applications, 2024, 44(7): 2258-2263.
[14]	Huanhuan LI, Tianqiang HUANG, Xuemei DING, Haifeng LUO, Liqing HUANG. Public traffic demand prediction based on multi-scale spatial-temporal graph convolutional network [J]. Journal of Computer Applications, 2024, 44(7): 2065-2072.
[15]	Zhi ZHANG, Xin LI, Naifu YE, Kaixi HU. DKP： defending against model stealing attacks based on dark knowledge protection [J]. Journal of Computer Applications, 2024, 44(7): 2080-2086.

预训练模型	Pre	Rec	F1
Siku-BERT	99.51	99.21	99.36
Siku-RoBERT	99.38	99.69	99.53

预训练模型	Pre	Rec	F1
Siku-BERT	99.51	99.21	99.36
Siku-RoBERT	99.38	99.69	99.53